Preprint reviews by Patrick Schloss

TaxAss: Leveraging Custom Databases Achieves Fine-Scale Taxonomic Resolution

Robin Rebecca Rohwer, Joshua J Hamilton, Ryan J Newton, Katherine D McMahon

Review posted on 09th November 2017

The preprint by Robin Rowher and colleagues seeks to develop a workflow that complements methods for classifying 16S rRNA gene sequences with greater precision than is found in the Wang naive Bayesian classifier. This is an issue that many people have raised with me. A lack of classification for a sequence can be blamed on inadequacies of the taxa represented in the database, lack of taxonomic data (e.g. at the species level) within the database, and the selection of the region within the 16S rRNA gene to classify. This paper seems primarily concerned with the first problem by supplementing ecosystem-specific sequences and touches on the second problem by adding finer taxonomic information for the ecosystem-specific sequences. I felt like the authors were a bit conflicted over what they wanted this manuscript to be. Is it a description/announcement of a new method, TaxAss? Is it a validation study? Is it a benchmarking study? Overall, it is a description of a new method that is being used by the authors and others. However, I feel like it needs some help to improve the description as there are points in the manuscript that are not clear. Furthermore, I felt that the validation and benchmarking could use some help to quantify the need for the method and to demonstrate that the method overcomes that problem.

General comments...

1. As described in Figure 1, sequences that fall below an empirically determined threshold when compared to an ecosystem-specific database are classified using a comprehensive database and those that are above the threshold are classified against the ecosystem-specific database. Perhaps it's because I'm familiar with people using blastn to classify sequences, as I read the manuscript, it was not clear whether the sequences in the two arms were then classified using the Wang method or blastn. Reading through the source code, it looks like blastn is only used to split the dataset and once split, the data are classified using the Wang method. Perhaps this could be clarified in the text.

2. It is not clear how the FreshTrain database was developed or how it is curated to add finer taxonomic names to the sequences. The authors have done this for the readers who are interested in fresh water bacteria, but what steps should someone interested in gut microbiota take to recreate the database to classify their data? More importantly, how did the authors decide on their taxonomic levels of lineage, clade, and tribe? Why not follow the phylogenetic approach used by the greengenes developers for defining family, genus, and species for environmental sequences?

3. I am also not clear why the authors did not want to pool FreshTrain with one of the comprehensive databases. A simple `cat` command would pool the two files producing a file that could then be used as a single reference. The downside of this would be that they would need to add the same level of taxonomic detail that is in the FreshTrain database to the greengenes database. Also, a downside of the greengenes database is that the core reference appears to be moth balled going forward while RDP and SILVA are still actively being developed.

4. One motivation that the authors state for the method is the issue of "forcing". I would call these "false positives", but I get their point. The authors raise this issue numerous times. Yet I was unable to find a citation that quantifies forcing and the authors do not appear to measure the amount of forcing in their data. Perhaps this is what they were getting at in Figure 3? If that is the case, then I am a bit troubled because they are accepting the FreshTrain data as the ground truth, when it has not been validated yet. I could also imagine that even with FreshTrain, there might be forcing if a taxonomic name is set for the full length sequence, but two variable region sequences are identical even though their parent sequences have different taxonomies. More importantly, the source code indicates that the authors are using any confidence score with out applying a filter. The suggested confidence score is 80%, not 0%. I don't think that the problem with classifications from the Wang method is forcing, rather, it's that the classifications don't go deep enough. Something may classify as a Bacillus with 20% confidence and so researchers should work their way up the taxonomy until the classification is above 80%, which might be Firmicutes. In offline conversations with the authors, they reassured me that they are applying an 80% threshold in separate scripts. It would probably be worth adding that they are using 80% as a threshold in the Methods seciton.

5. Related to this point, at L122 the authors state that "In a large database an OTU dissimilar to any reference sequences will not be classified repeatably as any one taxon, resulting in a low bootstrap confidence." This is correct, but is a bit misleading. I would suggest saying "...repeatedly as any one genus, resulting in a low bootstrap confidence and reclassification at a higher taxonomic level where there is sufficient bootstrap confidence". I am concerned that the results and the discussion of forcing are based on not using a confidence threshold rather than the default 80% threshold.

6. To measure forcing, I would like to see the authors run the greengenes and FreshTrain databases back through the classifier using a leave-one-out testing procedure and quantify how many times the incorrect classification is given, when using the 80% (or even the 0%) threshold. Again, I suspect the results would indicate that the problem isn't one of forcing, but of "holding back". To be clear, this isn't necessarily a problem with the Wang method, but the databases. Addressing this point is where I think the authors could really do the field a service. It would be a really helpful contribution to show the percentage of forcing (false positives) and holding back (false negatives?) in a leave-one-out scheme and on a real dataset when classifying with (1) each of the comprehensive databases, (2) using TaxAss with the comprehensive databases and FreshTrain, (3) merging the comprehensive databases with FreshTrain and running them through the Wang classifier.

7. I am not sure what the authors mean by "maintaining richness" as they use it in the manuscript. Could the problem they are trying to address be described better? Also, I would ask whether they know what the *true* richness is and if not, why they think that one value of richness is better than another. Perhaps this corresponds to what I might call "underclassifciation" or "false negatives".

L25 - why not include the RDP reference database in this list?

L49 - "Course" should be "Coarse"

show less

See response

Updating the 97% identity threshold for 16S‌ ribosomal RNA OTUs

Robert C Edgar

Review posted on 11th October 2017

The preprint by Robert Edgar sets out to take on the issue of what similarity threshold should be used to delineate bacterial species using partial and full-length 16S rRNA gene sequences. This is well covered territory and I'm not sure that many people would defend to the death the assertion that a 97% cutoff describes species-level taxa. It is helpful to have a discussion about the various threshold people use to bin sequences into OTUs. I think that the broader discussion and the discussion in this specific preprint in favor of a high threshold (e.g. 99.9 or 100%) has come off as being rather dogmatic. My comments below include suggestions for taking a more nuanced view. Ultimately, I think Edgar's and others' goal of pushing the field to a high threshold is an attempt to get a tool to do something it is not capable of doing. Specifically, 16S rRNA gene fragment sequences cannot delineate bacterial species and cannot tell us about phenotype. If scientists have these types of questions there are far more powerful tools at their disposal than debating the appropriate threshold for defining OTUs.

To be transparent, a considerable amount of the material that Edgar uses as a point of contrast to his work are papers that I have published over the past few years and I am the creator of mothur. As of writing this review, I have not been asked to review this manuscript for a journal, but would be happy for any editor to use my comments. Judging from the style of writing, my sense is that this preprint is unlikely to have already been submitted to a journal.

Major comments.

1. The general approach Edgar has taken is to use a variety of metrics to compare the composition of operational taxonomic units (OTUs) generated by database-independent approaches to the taxonomic assignment for those sequences. By identifying the distance that optimizes these thresholds, he arrives at the conclusion that the widely used 97% threshold is too low. Although this approach may be new, this conclusion is not (see the numerous papers published by [Tiedje and Konstantinidis](https://www.ncbi.nlm.nih.go.... I have significant concerns about his method and do not think Edgar has appropriately described the limitations of his approach. His is a problematic approach because systematicists are inconsistent in how they lump and split strains into bacterial species. From the perspecitve of the 16S rRNA gene, some species are finely split (e.g. Bacillus cereus, subtilis, anthracis) and others are lumped (e.g. Pseudomonas putida). There is broad consensus within microbiology that the 16S rRNA gene is unable to delineate bacterial species or phenotype. Furthermore, a 250 nt region of that gene is even less able to delineate a species. Considering that a minority of bacteria have actually been assigned a species-level classification, using taxonomy as the ground truth for assessing a threshold is problematic. Previous attempts have replaced the DNA-DNA hybridization approach of Stackebrandt and Goebel with genome-scale phylogenies and attempted to correlate that structure with 16S rRNA gene sequence diversity. These caveats as well as a more thorough review of attempts to find a better cutoff are warranted in a revised manuscript.

2. One of the reasons to favor a less restrictive threshold (e.g. 97%) is that there is considerable intragenomic variation in addition to considerable intraspecies variation. Using a higher threshold risks splitting sequences from the same genome into different OTUs. Previously, Edgar has indicated that he thinks this variation is the result of sequencing artifacts or contamination (see bottom of page 9, they are not. As an example of intragenomic variation, E. coli ATCC 70096 has 7 copies of the 16S rRNA gene and 6 of these are different from each other in the full length version of the gene. Fortunately, within the V4 region the 7 copies are identical. Alternatively, Staphylococcus aureus ATCC BAA-1718 and Staphylococcus epidermidis ATCC 12228 both have 5 copies of the 16S rRNA gene. Considering the V4 region of these species, 4 of the 5 copies in each genome are identical between the two species. The remaining S. aureus copy is 1 nt different from the other S aureus copies; however the remaining S. epidermidis copy is 1.7 and 2.0% different from the other S. epidermidis and S. aureus copies. The less restrictive threshold would lump the two species together; however, the more restrictive threshold suggested by Edgar would generate 3 OTUs. None of these reflect the biology he claims and the method would split sequences from the same strain into different OTUs. Given the ubiquity of these strains in skin-associated communities, it would make sense to take a more guarded recommendation than to make dogmatic pronouncements about using high thresholds. In the Discussion, Edgar brushes off intraspecies variation concerns and seems to ignore the case where an investigator would like to make an inference regarding the association between the relative abundance of individual OTUs and different treatment groups. Furthermore, he seems to think it would be possible to correct for the inflated alpha diversity metrics obtained by splitting sequences from the same species into different OTUs - the same seems reasonable to say about lower threshold. Although Edgar's Pcs calculations seem to account for intraspecies variation, it does not seem to factor in intrastrain variation.

3. Edgar states "Also, state-of-the-art denoisers have been shown to accurately recover biological sequences from 454 and Illumina amplicon reads (Quince et al., 2009; Callahan et al., 2016; Edgar, 2016) suggesting that the best strategy for amplicon reads is to cluster denoised sequences, in which case the clustering problem is well-modeled by error-free sequences from known species." Again, I would encourage caution in pushing these methods as the strengths and weaknesses of the approaches are not well established. Some of the methods are aggressive in removing rare sequences that may be true sequences, others seem to overfit complicated models, and as described above, others may be splitting 16S rRNA genes from the same genome into different OTUs. Furthermore, the lack of randomness in sequencing errors has not been addressed thoroughly, which creates the possibility that a spurious sequence with sufficient sequencing coverage could be treated as a new OTU rather than be folded into a similar OTU. Finally, these methods have not been well validated for the breadth of sequencing platforms that people are using. I am far more confident in the quality of sequences generated from fully overlapping 250 nt MiSeq reads for the V4 region than I am for single HiSeq reads of the V4 region. There is a trend for people to push the length of the region and throughput at the expense of quality. In short, I agree that a species likely requires a very high threshold for 16S rRNA gene sequences; however, I am not convinced by the papers he has cited that the data accumulated in the literature is of sufficient quality to trust OTUs generated with high thresholds. Combined with the reality of intragenomic variation, I see value in having a more nuanced recommendation.

4. I am happy to receive Edgar's critique regarding the methods used in mothur. I do not see how his section comparing mothur and pairwise alignments or adverse triplets helps make his points about the OTU threshold. I would suggest removing these sections unless he can find a way to tie them in better to his bigger claims - I certainly wouldn't lead off the Discussion with a critique of my use of the Matthew's Correlation Coefficient. That is a weak way to summarize his story. The following two comments will address these specific comments that, again, I do not feel have a direct connection to the goal of the paper.

A. The comparison between NAST-based profile alignments and pairwise alignments has previously been published. We too saw that pairwise alignment had smaller distances than profile alignments (doi: 10.1371/journal.pcbi.1000844 and doi: 10.1371/journal.pone.0008230). By definition, a pairwise alignment optimizes the similarity between the two sequences. In contrast, by using a profile-based alignment where the reference is aligned to the secondary structure of the 16S rRNA molecule, additional information is incorporated. This frequently increases the distance between sequences because it incorporates this extra information. I have also addressed this previously in the literature (doi: 10.1038/ismej.2012.102). I agree that the example Edgar shows is a problem. It is a well-known issue with profile alignments - if there are problems in the reference, there will be problems with the alignment. When using the SILVA reference alignment, such errors can be corrected by fixing the reference alignment. Furthermore, I would point out that an advantage of using a profile alignment like the NAST aligner in mothur is that it is considerably fast compared to a pairwise alignment. Generating pairwise alignments for N sequences would take N times longer than a profile alignment (i.e. profile alignments scale linearly while pairwise alignments scale quadratically). With large datasets pairwise alignments can be prohibitive while it only takes seconds with a profile alignment.

B. Regarding the section, "Comments on the MCCsw metric"... I readily acknowledge that because evolution does not care to conform to a similarity threshold when creating species, there will be "adverse triplets" around any threshold. As I've pointed out above, there are adverse triplets in the case of S. epidermidis V4 sequences and full length E. coli 16S rRNA gene sequences. In fact, this is why we have developed the MCC metric. It evaluates how well an algorithm balances the need to split and lump similar 16S rRNA gene sequences when assigning sequences into a bin. We have used MCC in a fundamentally different method than Edgar has in this paper. We used it assuming that the taxonomic databases are not helpful. He uses it assuming that it is the ground truth. Perhaps there is room for both views, but given the points I raised above, I am happy to stick with my approach over Edgar's.

show less

Meta Analysis Of Microbiome Studies Identifies Shared And Disease-Specific Patterns

Claire Duvallet, Sean M. Gibbons, Thomas Gurry, Rafael A. Irizarry, Eric J. Alm

Review posted on 09th June 2017

My research group reviewed the preprint version of this manuscript on May 18, 2017 and we prepared this joint review.

The manuscript from Duvallet and colleagues seeks to create a database of case-control gut microbiome studies from 16S rRNA gene sequences and then use that database to look for consistent signatures of health and disease across a number of diseases. Overall, this is an interesting idea that is similar to an approach our lab and others have pursued to look across studies to identify signatures that are emblematic of lean or obese individuals. Unfortunately, the work has a number of technical problems and attempts to say too much without obtaining a full representation of data from the literature or incorporating the clinical nuances of the diseases they study.

Most of our concerns would overcome and the findings and impact of the paper would be greatly strengthened by testing the hypothesis that the core microbiome identified in Figure 3 is indeed sufficient enough to classify cases and controls across all of the studies. The authors should test the sensitivity/specificity of the core microbiome to successfully classify general disease and control cases across studies. If the hypothesis is that there is a core microbiome that is common to all diseases then the predictive accuracy of models using these core members should be relatively successful in finding generalized disease microbiomes regardless of whether it is cancer, obesity, diarrhea, etc. Even if the model is not generalizable across all diseases, it would be important to know whether a model for a single disease group is predictive of controls and cases within that disease group. Again, this was an approach that we used with Random Forest modeling to use one study to predict obesity status in other studies.

In the re-analysis of the CDI data, we are concerned that the non-CDI diarrheal controls have been grouped with the healthy controls (Figure 2). That seems to be the case at least for the study from our lab referenced- Schubert - as our study had 94 CDI cases, 89 diarrheal controls and 155 non-diarrheal controls. Thus the number of 243 controls (Table 1) would appear to represent a pooling of the diarrheal and non-diarrheal controls. We would suggest instead only using the diarrheal controls. We would strongly encourage the authors to confirm for each disease group that the control samples are similar across all studies.

Similar to the concern over the data used for the Schubert CDI data, the definition of ‘cases’ may need reconsidering for some diseases - what is a case for an HIV patient, for instance? Actively replicating virus? Reduced CD4 count? People that are HIV positive are often quite healthy with no detectable viral load. Similarly, IBD encompasses a range of bowel diseases and a ‘case’ of UC is different from a ‘case’ of Crohn’s. Further, a note or clarification of whether any of these patients were on antibiotics (and if this could be a confounding factor) is necessary.

When testing and generating the ‘core’ microbiome across all diseases, we wonder whether testing for the core falsely amplified the CDI-related microbes in the pan-disease core because there were so many that were altered in cases? Similarly, we wondered how the authors could control for the variation in effect sizes when generating the list of core microbes? We would encourage something like a Z-transform like was used in the Sze and Schloss obesity meta-analysis

The ROC curves appeared to be inverted in the case of Dinh 2015 (supplemental figure 5). This can result from inverted categories, and can make the resulting AUC artificially appear lower than 0.5. We suggest the authors recalculate the AUCs from inverse ROC curves, so that all AUCs are between 0.5 and 1. In addition, the 0.5 AUC line only matters if one has an equal number of cases and controls - kappa - corrects for distribution in data. If 90% of the samples are cases, then one would expect to be correct 90% of the time, not 50.

Although the authors picked datasets that specifically dealt with the disease they were interested in, there are a number of other datasets that were not included but could have easily been added. For example, we found 10 studies that included obesity data with their sequence data. The control samples used in the Schubert and Baxter studies from the Schloss lab could also be used to look at the effects of obesity.

The authors should also note that the samples used in the Zackular study are a subset of Baxter and so the studies are not independent.

Comments on overall writing style:

Overall we felt that the paper’s organization and purpose is unclear. For instance, in the abstract, “Here, we introduce the MicrobiomeHD database, which includes 29 published case-control gut microbiome studies spanning ten different diseases" - Is this paper about the database? This database is hardly mentioned in the rest of the paper and the reader needs more details about how database was formed. Furthermore, the database does not appear to be comprehensive and there is no indication of whether the database will continue to be maintained over time. Alternatively, is the paper about the ‘core’ microbiome being able to predict healthy/disease? If so then it needs a direct test of this hypothesis. Is the paper about re-analyzing and confirming the findings of previous studies? Or by doing a true meta-analysis where the effect sizes are compared across studies? If either of those are true then the paper needs to be structured and concluded in a way that emphasizes those conclusions. The current version is a bit muddled in its structure and purpose.

It is oversimplifying the complex nature of the diseases to classify treatment therapies into antibiotics, probiotics or FMT. Each of these therapies can be affected by differences between human patients. At the very least, the discussion should include more caveats about these treatments particularly there are a few studies showing negative effects of antibiotic use and long-term development of colorectal cancer (Cao et al 2017) as well as studies showing manipulating the gut microbiome in mice via fecal transplants can spur the development of CRC tumors in a mouse tumorigenesis model (Zackular et al 2015).

show less

Cohesion: A method for quantifying the connectivity of microbial communities

Cristina Herren, Katherine McMahon

Review posted on 07th March 2017

The preprint from Herren and McMahon describes a new metric - cohesion - to describe the overall connectedness within a community using temporal data. I was excited to see this preprint because I am familiar with McMahon's long history of developing rich time series data for microbial communities in Wisconsin lakes. I also have a lot of my own time series data from humans and mice where we struggle to incorporate time into the analysis to understand the interactions between bacterial populations.

A significant struggle in analyzing time course community data is the ability to synthesize observations for large numbers of taxa over time. Many of the existing methods people use attempt to adapt methods from cross sectional studies. For example, a study may sample a large number of lakes, people, soils, etc and characterize their microbial communities. They'll then calculate correlations across those samples based on the relative abundance of the populations. Alternatively, they'll used presence/absence data to generate co-occurrence matrices. The problem with these studies is that the next step is to often infer something about the interactions between the populations - even if the populations would never possibly co-occur. Herren and McMahon's efforts to study the connectedness of individual populations and their cohesion is very welcome because it has the potential to get us closer to describing the actual interactions between populations.

To briefly summarize the approach, the method starts by calculating the Pearson correlation between all pairs of populations across time and then discounts the correlation that would be expected if all interactions were random. This is important because of the compositional nature of the data and the effects of different population sizes. Next, the method calculates the average positive and negative corrected correlation for each population. These become the positive and negative connectedness values for each population. Finally the positive and negative cohesion values for each community is calculated by determining the sum of the product of the connectedness value and the relative abundance for that population.

The following are general critiques and questions, which I appreciate may be beyond the scope of the current manuscript (note, I am not a reviewer for the manuscript at a journal):

1. To develop the cohesion metric for a community, the authors sum over all of the populations in the community. This raised three questions for me. First, independent of the relative abundances in each sample, is the *number* of positive and negative connections for each population relevant? It might be worthwhile exploring which populations have more positive/negative connections than others. What does that distribution look like? Second, does the connectedness metric value itself have any value? What are the populations that are highly connected with other populations. Finally, the method generates a cohesion value for each time point. If I think of Lake Mendota as a community that was sampled over time, it would be interesting to know whether it has been more cohesive than Lake Monona over the 19 years of sampling. Thinking of my own work, I would be interested in knowing whether mice that are more susceptible to C. difficile colonization are less cohesive than those that are resistant. Again, this would require a composite score, not individual scores for each time point.

2. Continuing on my self-serving thread, I wonder how sensitive the method is to the time interval between samples and the number of samples. In my experiments I may have 20 daily samples from a mouse - is this sufficient? What if we miss a day - how will having a jump between points affect the metrics? As the authors state, the Lake Mendota dataset has 293 samples collected over 19 years (e.g. 1.3 samples/month). This is a very unique dataset that is unlikely to be repeated elsewhere. What if we were to get more frequent samples? What if they were more spaced out? What if we only had a year's worth of data? It would be interesting to see the authors describe how their cohesion values change when they subset the dataset to simulate more realistic sampling schemes.

3. A significant challenge in developing these types of metrics is not knowing what the true value of the metric is in nature. I appreciate Herren and McMahon's effort to validate the metrics by comparing their results to count data and to explaining the variation in Bray-Curtis distances. The manuscript reads almost like they want their method to recapitulate what is seen with those distances. But we already have Bray-Curtis distances, if that's the goal, then why do we need the cohesion metric? It would be interesting to see the authors simulate data from communities with varying levels of cohesion and abundance to see that the method gets back the expected cohesion value. Perhaps it would be possible to generate an ODE-based model to generate the data instead of variance/covariance data. There is one simulation described at the end of the Results (L300); however, it is unclear whether the lack of a meaningful R-squared value was the expected result or not.

4. Throughout the manuscript, the authors make use of parametric statistics such as Pearson's correlation coefficients and the arithmetic mean. Given that relative abundance data are generally not normally distributed and are likely zero-inflated, I wonder why the authors made these choices. I would encourage the authors to instead use Spearman correlation coefficients and median values. Related to this point, a concern with using these correlation coefficients is the problem of double zeros where two populations may be absent from the same communities. These will appear to be more correlated with each other than they really are, which is why we don't use these metrics for community comparison - we use things like Bray-Curtis. I wonder whether subtracting the null model counteracts the problem of double zeroes.

5. The authors translate their count data into relative abundance data before calculating their correlation and Bray-Curtis values. I wonder if the authors subsampled or rarefied their data to a common number of individuals. Both of these metrics are sensitive to uneven sampling. Even if the counts are converted to relative abundances, this would not remove the effects. For example, if one sample has 1000 individuals and another has 100, the limit of detection on the first would be 10-fold higher than the second. There may be populations that represent 0.5% of both communities that would not be seen in the second. If they haven't already, I would encourage the authors to subsample their dataset to a common number of individuals.

6. The "Description of datasets" section of the Methods describes the various datasets in general terms, but what is the nature of the data? How were the phytoplankton counted? How many individuals were sampled from each sample?

7. It would be great to have the code that was used made publicly available on GitHub

8. The authors present the material in a format that I have not previously seen in the microbial ecology literature (i.e. ISMEJ where this appears to be destined for review). The authors flip back and forth between presenting a different stage of the algorithm and validating that step. I think this is a bit more aligned with how one would present the material in a talk than in a paper. I've seen similar methods development described before where there might be a methods section on algorithm development and then the results section would test the assumptions and performance of the algorithm. I'm curious to see whether this structure persists through the editorial process.

show less

See response

16S rRNA Gene Sequencing as a Clinical Diagnostic Aid for Gastrointestinal-related Conditions

Daniel E Almonacid, Laurens Kraal, Francisco J Ossandon, Yelena V Budovskaya, Juan Pablo Cardenas, Jessica Richman, Zachary S Apte

Review posted on 06th December 2016

To be clear, I was not asked to review this manuscript by a journal and have no connection to uBiome. This review has been cross posted at and makes reference to the version of the preprint posted on October 31, 2016.

Almonacid and colleagues describe the use of 16S rRNA gene sequencing as a clinical diagnostic tool for detecting the presence of bacteria and archaea commonly associated with fecal samples in health and disease. On the whole, the method is not novel in that many people have been doing 16S rRNA gene sequencing of samples for many years now. The potential novelty of the manuscript is that it attempts to place the value of this technology in a clinical diagnostics rather than exploratory setting. The potential impact of this paper is reduced because it is more of a proof of concept rather than a comparative demonstration relative to other methods. Overall, the methods are poorly described and there are a number of overly generalized claims that are not supported by the literature or their data. The most glaring problem is that the authors assume that the presence of a V4 sequence that is identical to that of a pathogen is proof for evidence of the organism.

Major comments

1. L16-18, 43-51. I'm curious whether the authors actually have citations to back up the primacy of manual culture-based methods in clinical diagnostic laboratories or their limitations. My understanding is the much of clinical diagnostics is highly automated and while it may use some amount of cultivation, the actual analyses are quite modern. The authors at least need to recognize the high levels of automation and use of qPCR, ELISA, and mass spectroscopy-based approaches in most diagnostic labs. In fact, the authors later use one of these methods, Luminex‘s xTAG Gastrointestinal Pathogen Panel to help develop the panel of organisms used in their own method. The authors' new method may be novel, but they should portray its novelty using a relative modern comparison rather than a straw man. The manuscript would be considerably strengthened by comparing the Luminex method (or any other method) to the current method.

2. The authors have tested whether they are able to distinguish distantly related pathogens, but have not done due diligence in determining whether the approach can distinguish pathogenic and non-pathogenic organisms. As an example, they state that "the pathogen Peptoclostridium difficile is found in ~2% of the healthy cohort which shows that asymptomatic P. difficile colonization is not uncommon in healthy individuals (L211)." This statement is emblematic of a number of problems with the authors' analysis. First, the presence of P.difficile/C.difficile does not mean that it is in fact pathogen as there are many non-toxigenic and, thus non-pathogenic, strains of this organism - the V4 region is simply not a virulence factor. Second, there is already a toxin-based assay for toxin-producing strains that is likely more sensitive and specific than this sequence-based approach and much cheaper for this and other pathogens. Third, the V4 region is only about 250 nt in length. There is always the risk that closely related, but different organisms may have the same sequence and that the same organism may generate different sequences because there is intra-genomic variation. When I used blastn to compare the region of the P. difficile sequence in Table S2 that would be amplified by their primers to NCBI's reference 16S rRNA gene sequences, it returned two additional P. difficile strains (JCM 1296 and ATCC 9689) that are identical to each other but 1 nt different than the sequence in Table S2. It is interesting that none of the sequences in the NCBI reference were an exact match as required by the current method. When I performed a similar analysis using the author's E. coli/Shigella sequence, it matched multiple Escherichia and Shigella strains, most of which were not pathogenic. Based on all of this, I am not sure how much utility a clinical diagnostic laboratory would gain from using this method over others. None of these points are considered in the authors' discussion.

3. The authors lay out a "healthy reference range" for each of their 28 targets (L199-210). I worry about such a claim, when really the authors are likely only defining an operational healthy range so that they can optimize the sensitivity and specificity of pathogen detection. Claiming a healthy range as they have assumes that the subjects are truly healthy (there is no indication of whether the subjects were honest in self-reporting) and that the microbial communities did not change between collection and analysis. To this second point, the Methods are poorly described and validated. Specifically, I am unclear what "specifications" were laid out by the NIH Human Microbiome Project that would be relevant for this method (L100-102). Furthermore, what is the composition of the lysis and stabilization buffer that allows samples to be stored at ambient temperatures. The authors need to either provide data or a reference to support this claim including evidence that the community composition does not change. All this is necessary to report for others hoping to repeat the authors' work and for improving the clarity of the writing.

4. I am impressed by the authors' ability to quantify the relative abundance of these strains using PCR and sequencing. This runs a bit counter to the prevailing wisdom that there are PCR biases at work that would skew the representation of taxa such that the final proportions are not representative of the initial proportions. I'm a bit confused by the description of the experiment. Namely, what was the diluent DNA that is mentioned in the Methods (L142)? Although the quantitative results are impressive, I am a bit concerned that the authors used DNA fragments that overlap the V4 region of the 16S rRNA gene rather than genomic DNA.

5. Similar to the previously described concerns regarding the methods description, the list of accessions in the curated database that is described should be made publicly available since this is a critical component to the method (L171-185). More details are needed that describe how this database was created. The manuscript states "After optimizing the confusion matrices for all preliminary targets...", but it is unclear what "optimizing" means and what was altered to generate better performance. Furthermore, I am curious whether uBiome paid for a license to use the SILVA reference. Unlike many other references, this is not a database that is free for non-academic usage ( Considering they are a for-profit company and are likely to commercialize this, they may want to consider a database that is more public. That being said, I don't know why the authors would need to use the SILVA reference since they are not making use of the alignment, taxonomy, or metadata features contained within the database.

Minor comments:

6. L78-80. "Regularly evaluating the microbiome to monitor overall health is therefore gaining traction in contemporary medicine and needs to be part of modern diagnostics."

7. L102-109 include no citations. Although these may be "standard protocols", specific protocols should still be cited as there are no standards and to give credit to those that developed the protocols.

8. L112-125. The authors present a method for denoising and building contigs from their sequence data that uses Swarm. As far as I know, this approach to denoising the data is novel and has not been validated in this paper or others. Alas, I'm not sure why they bothered with the Swarm clustering since they take the contigs and map them against the SILVA reference database for exact matches. The justification for these two steps is not clear and needs to be clarified.

9. L154. "Two out of 35 control samples did not pass our sequencing quality thresholds". If I am right in assuming that this is previously mentioned 10,000 sequence threshold (L129), then the authors should be specific in stating that here. If there are other thresholds, then those should be stated at some point in the manuscript.

10. "dysbiosis" is used throughout the manuscript. This is a trendy piece of jargon that is pretty meaningless. Furthermore, their method does not really address the whole community, which is usually done when describing a dysbiotic state. This manuscript describes the quantification of single strains.

11. I do not believe that Peptoclostridium difficile is a valid name for Clostridium difficile. At this point, it appears that the most recent valid name is Clostridioides difficile (

show less