Preprint reviews by Patrick Schloss

Meta Analysis Of Microbiome Studies Identifies Shared And Disease-Specific Patterns

Claire Duvallet, Sean M. Gibbons, Thomas Gurry, Rafael A. Irizarry, Eric J. Alm

Review posted on 09th June 2017

My research group reviewed the preprint version of this manuscript on May 18, 2017 and we prepared this joint review.

The manuscript from Duvallet and colleagues seeks to create a database of case-control gut microbiome studies from 16S rRNA gene sequences and then use that database to look for consistent signatures of health and disease across a number of diseases. Overall, this is an interesting idea that is similar to an approach our lab and others have pursued to look across studies to identify signatures that are emblematic of lean or obese individuals. Unfortunately, the work has a number of technical problems and attempts to say too much without obtaining a full representation of data from the literature or incorporating the clinical nuances of the diseases they study.

Most of our concerns would overcome and the findings and impact of the paper would be greatly strengthened by testing the hypothesis that the core microbiome identified in Figure 3 is indeed sufficient enough to classify cases and controls across all of the studies. The authors should test the sensitivity/specificity of the core microbiome to successfully classify general disease and control cases across studies. If the hypothesis is that there is a core microbiome that is common to all diseases then the predictive accuracy of models using these core members should be relatively successful in finding generalized disease microbiomes regardless of whether it is cancer, obesity, diarrhea, etc. Even if the model is not generalizable across all diseases, it would be important to know whether a model for a single disease group is predictive of controls and cases within that disease group. Again, this was an approach that we used with Random Forest modeling to use one study to predict obesity status in other studies.

In the re-analysis of the CDI data, we are concerned that the non-CDI diarrheal controls have been grouped with the healthy controls (Figure 2). That seems to be the case at least for the study from our lab referenced- Schubert - as our study had 94 CDI cases, 89 diarrheal controls and 155 non-diarrheal controls. Thus the number of 243 controls (Table 1) would appear to represent a pooling of the diarrheal and non-diarrheal controls. We would suggest instead only using the diarrheal controls. We would strongly encourage the authors to confirm for each disease group that the control samples are similar across all studies.

Similar to the concern over the data used for the Schubert CDI data, the definition of ‘cases’ may need reconsidering for some diseases - what is a case for an HIV patient, for instance? Actively replicating virus? Reduced CD4 count? People that are HIV positive are often quite healthy with no detectable viral load. Similarly, IBD encompasses a range of bowel diseases and a ‘case’ of UC is different from a ‘case’ of Crohn’s. Further, a note or clarification of whether any of these patients were on antibiotics (and if this could be a confounding factor) is necessary.

When testing and generating the ‘core’ microbiome across all diseases, we wonder whether testing for the core falsely amplified the CDI-related microbes in the pan-disease core because there were so many that were altered in cases? Similarly, we wondered how the authors could control for the variation in effect sizes when generating the list of core microbes? We would encourage something like a Z-transform like was used in the Sze and Schloss obesity meta-analysis

The ROC curves appeared to be inverted in the case of Dinh 2015 (supplemental figure 5). This can result from inverted categories, and can make the resulting AUC artificially appear lower than 0.5. We suggest the authors recalculate the AUCs from inverse ROC curves, so that all AUCs are between 0.5 and 1. In addition, the 0.5 AUC line only matters if one has an equal number of cases and controls - kappa - corrects for distribution in data. If 90% of the samples are cases, then one would expect to be correct 90% of the time, not 50.

Although the authors picked datasets that specifically dealt with the disease they were interested in, there are a number of other datasets that were not included but could have easily been added. For example, we found 10 studies that included obesity data with their sequence data. The control samples used in the Schubert and Baxter studies from the Schloss lab could also be used to look at the effects of obesity.

The authors should also note that the samples used in the Zackular study are a subset of Baxter and so the studies are not independent.

Comments on overall writing style:

Overall we felt that the paper’s organization and purpose is unclear. For instance, in the abstract, “Here, we introduce the MicrobiomeHD database, which includes 29 published case-control gut microbiome studies spanning ten different diseases" - Is this paper about the database? This database is hardly mentioned in the rest of the paper and the reader needs more details about how database was formed. Furthermore, the database does not appear to be comprehensive and there is no indication of whether the database will continue to be maintained over time. Alternatively, is the paper about the ‘core’ microbiome being able to predict healthy/disease? If so then it needs a direct test of this hypothesis. Is the paper about re-analyzing and confirming the findings of previous studies? Or by doing a true meta-analysis where the effect sizes are compared across studies? If either of those are true then the paper needs to be structured and concluded in a way that emphasizes those conclusions. The current version is a bit muddled in its structure and purpose.

It is oversimplifying the complex nature of the diseases to classify treatment therapies into antibiotics, probiotics or FMT. Each of these therapies can be affected by differences between human patients. At the very least, the discussion should include more caveats about these treatments particularly there are a few studies showing negative effects of antibiotic use and long-term development of colorectal cancer (Cao et al 2017) as well as studies showing manipulating the gut microbiome in mice via fecal transplants can spur the development of CRC tumors in a mouse tumorigenesis model (Zackular et al 2015).

show less

Cohesion: A method for quantifying the connectivity of microbial communities

Cristina Herren, Katherine McMahon

Review posted on 07th March 2017

The preprint from Herren and McMahon describes a new metric - cohesion - to describe the overall connectedness within a community using temporal data. I was excited to see this preprint because I am familiar with McMahon's long history of developing rich time series data for microbial communities in Wisconsin lakes. I also have a lot of my own time series data from humans and mice where we struggle to incorporate time into the analysis to understand the interactions between bacterial populations.

A significant struggle in analyzing time course community data is the ability to synthesize observations for large numbers of taxa over time. Many of the existing methods people use attempt to adapt methods from cross sectional studies. For example, a study may sample a large number of lakes, people, soils, etc and characterize their microbial communities. They'll then calculate correlations across those samples based on the relative abundance of the populations. Alternatively, they'll used presence/absence data to generate co-occurrence matrices. The problem with these studies is that the next step is to often infer something about the interactions between the populations - even if the populations would never possibly co-occur. Herren and McMahon's efforts to study the connectedness of individual populations and their cohesion is very welcome because it has the potential to get us closer to describing the actual interactions between populations.

To briefly summarize the approach, the method starts by calculating the Pearson correlation between all pairs of populations across time and then discounts the correlation that would be expected if all interactions were random. This is important because of the compositional nature of the data and the effects of different population sizes. Next, the method calculates the average positive and negative corrected correlation for each population. These become the positive and negative connectedness values for each population. Finally the positive and negative cohesion values for each community is calculated by determining the sum of the product of the connectedness value and the relative abundance for that population.

The following are general critiques and questions, which I appreciate may be beyond the scope of the current manuscript (note, I am not a reviewer for the manuscript at a journal):

1. To develop the cohesion metric for a community, the authors sum over all of the populations in the community. This raised three questions for me. First, independent of the relative abundances in each sample, is the *number* of positive and negative connections for each population relevant? It might be worthwhile exploring which populations have more positive/negative connections than others. What does that distribution look like? Second, does the connectedness metric value itself have any value? What are the populations that are highly connected with other populations. Finally, the method generates a cohesion value for each time point. If I think of Lake Mendota as a community that was sampled over time, it would be interesting to know whether it has been more cohesive than Lake Monona over the 19 years of sampling. Thinking of my own work, I would be interested in knowing whether mice that are more susceptible to C. difficile colonization are less cohesive than those that are resistant. Again, this would require a composite score, not individual scores for each time point.

2. Continuing on my self-serving thread, I wonder how sensitive the method is to the time interval between samples and the number of samples. In my experiments I may have 20 daily samples from a mouse - is this sufficient? What if we miss a day - how will having a jump between points affect the metrics? As the authors state, the Lake Mendota dataset has 293 samples collected over 19 years (e.g. 1.3 samples/month). This is a very unique dataset that is unlikely to be repeated elsewhere. What if we were to get more frequent samples? What if they were more spaced out? What if we only had a year's worth of data? It would be interesting to see the authors describe how their cohesion values change when they subset the dataset to simulate more realistic sampling schemes.

3. A significant challenge in developing these types of metrics is not knowing what the true value of the metric is in nature. I appreciate Herren and McMahon's effort to validate the metrics by comparing their results to count data and to explaining the variation in Bray-Curtis distances. The manuscript reads almost like they want their method to recapitulate what is seen with those distances. But we already have Bray-Curtis distances, if that's the goal, then why do we need the cohesion metric? It would be interesting to see the authors simulate data from communities with varying levels of cohesion and abundance to see that the method gets back the expected cohesion value. Perhaps it would be possible to generate an ODE-based model to generate the data instead of variance/covariance data. There is one simulation described at the end of the Results (L300); however, it is unclear whether the lack of a meaningful R-squared value was the expected result or not.

4. Throughout the manuscript, the authors make use of parametric statistics such as Pearson's correlation coefficients and the arithmetic mean. Given that relative abundance data are generally not normally distributed and are likely zero-inflated, I wonder why the authors made these choices. I would encourage the authors to instead use Spearman correlation coefficients and median values. Related to this point, a concern with using these correlation coefficients is the problem of double zeros where two populations may be absent from the same communities. These will appear to be more correlated with each other than they really are, which is why we don't use these metrics for community comparison - we use things like Bray-Curtis. I wonder whether subtracting the null model counteracts the problem of double zeroes.

5. The authors translate their count data into relative abundance data before calculating their correlation and Bray-Curtis values. I wonder if the authors subsampled or rarefied their data to a common number of individuals. Both of these metrics are sensitive to uneven sampling. Even if the counts are converted to relative abundances, this would not remove the effects. For example, if one sample has 1000 individuals and another has 100, the limit of detection on the first would be 10-fold higher than the second. There may be populations that represent 0.5% of both communities that would not be seen in the second. If they haven't already, I would encourage the authors to subsample their dataset to a common number of individuals.

6. The "Description of datasets" section of the Methods describes the various datasets in general terms, but what is the nature of the data? How were the phytoplankton counted? How many individuals were sampled from each sample?

7. It would be great to have the code that was used made publicly available on GitHub

8. The authors present the material in a format that I have not previously seen in the microbial ecology literature (i.e. ISMEJ where this appears to be destined for review). The authors flip back and forth between presenting a different stage of the algorithm and validating that step. I think this is a bit more aligned with how one would present the material in a talk than in a paper. I've seen similar methods development described before where there might be a methods section on algorithm development and then the results section would test the assumptions and performance of the algorithm. I'm curious to see whether this structure persists through the editorial process.

show less

See response

16S rRNA Gene Sequencing as a Clinical Diagnostic Aid for Gastrointestinal-related Conditions

Daniel E Almonacid, Laurens Kraal, Francisco J Ossandon, Yelena V Budovskaya, Juan Pablo Cardenas, Jessica Richman, Zachary S Apte

Review posted on 06th December 2016

To be clear, I was not asked to review this manuscript by a journal and have no connection to uBiome. This review has been cross posted at and makes reference to the version of the preprint posted on October 31, 2016.

Almonacid and colleagues describe the use of 16S rRNA gene sequencing as a clinical diagnostic tool for detecting the presence of bacteria and archaea commonly associated with fecal samples in health and disease. On the whole, the method is not novel in that many people have been doing 16S rRNA gene sequencing of samples for many years now. The potential novelty of the manuscript is that it attempts to place the value of this technology in a clinical diagnostics rather than exploratory setting. The potential impact of this paper is reduced because it is more of a proof of concept rather than a comparative demonstration relative to other methods. Overall, the methods are poorly described and there are a number of overly generalized claims that are not supported by the literature or their data. The most glaring problem is that the authors assume that the presence of a V4 sequence that is identical to that of a pathogen is proof for evidence of the organism.

Major comments

1. L16-18, 43-51. I'm curious whether the authors actually have citations to back up the primacy of manual culture-based methods in clinical diagnostic laboratories or their limitations. My understanding is the much of clinical diagnostics is highly automated and while it may use some amount of cultivation, the actual analyses are quite modern. The authors at least need to recognize the high levels of automation and use of qPCR, ELISA, and mass spectroscopy-based approaches in most diagnostic labs. In fact, the authors later use one of these methods, Luminex‘s xTAG Gastrointestinal Pathogen Panel to help develop the panel of organisms used in their own method. The authors' new method may be novel, but they should portray its novelty using a relative modern comparison rather than a straw man. The manuscript would be considerably strengthened by comparing the Luminex method (or any other method) to the current method.

2. The authors have tested whether they are able to distinguish distantly related pathogens, but have not done due diligence in determining whether the approach can distinguish pathogenic and non-pathogenic organisms. As an example, they state that "the pathogen Peptoclostridium difficile is found in ~2% of the healthy cohort which shows that asymptomatic P. difficile colonization is not uncommon in healthy individuals (L211)." This statement is emblematic of a number of problems with the authors' analysis. First, the presence of P.difficile/C.difficile does not mean that it is in fact pathogen as there are many non-toxigenic and, thus non-pathogenic, strains of this organism - the V4 region is simply not a virulence factor. Second, there is already a toxin-based assay for toxin-producing strains that is likely more sensitive and specific than this sequence-based approach and much cheaper for this and other pathogens. Third, the V4 region is only about 250 nt in length. There is always the risk that closely related, but different organisms may have the same sequence and that the same organism may generate different sequences because there is intra-genomic variation. When I used blastn to compare the region of the P. difficile sequence in Table S2 that would be amplified by their primers to NCBI's reference 16S rRNA gene sequences, it returned two additional P. difficile strains (JCM 1296 and ATCC 9689) that are identical to each other but 1 nt different than the sequence in Table S2. It is interesting that none of the sequences in the NCBI reference were an exact match as required by the current method. When I performed a similar analysis using the author's E. coli/Shigella sequence, it matched multiple Escherichia and Shigella strains, most of which were not pathogenic. Based on all of this, I am not sure how much utility a clinical diagnostic laboratory would gain from using this method over others. None of these points are considered in the authors' discussion.

3. The authors lay out a "healthy reference range" for each of their 28 targets (L199-210). I worry about such a claim, when really the authors are likely only defining an operational healthy range so that they can optimize the sensitivity and specificity of pathogen detection. Claiming a healthy range as they have assumes that the subjects are truly healthy (there is no indication of whether the subjects were honest in self-reporting) and that the microbial communities did not change between collection and analysis. To this second point, the Methods are poorly described and validated. Specifically, I am unclear what "specifications" were laid out by the NIH Human Microbiome Project that would be relevant for this method (L100-102). Furthermore, what is the composition of the lysis and stabilization buffer that allows samples to be stored at ambient temperatures. The authors need to either provide data or a reference to support this claim including evidence that the community composition does not change. All this is necessary to report for others hoping to repeat the authors' work and for improving the clarity of the writing.

4. I am impressed by the authors' ability to quantify the relative abundance of these strains using PCR and sequencing. This runs a bit counter to the prevailing wisdom that there are PCR biases at work that would skew the representation of taxa such that the final proportions are not representative of the initial proportions. I'm a bit confused by the description of the experiment. Namely, what was the diluent DNA that is mentioned in the Methods (L142)? Although the quantitative results are impressive, I am a bit concerned that the authors used DNA fragments that overlap the V4 region of the 16S rRNA gene rather than genomic DNA.

5. Similar to the previously described concerns regarding the methods description, the list of accessions in the curated database that is described should be made publicly available since this is a critical component to the method (L171-185). More details are needed that describe how this database was created. The manuscript states "After optimizing the confusion matrices for all preliminary targets...", but it is unclear what "optimizing" means and what was altered to generate better performance. Furthermore, I am curious whether uBiome paid for a license to use the SILVA reference. Unlike many other references, this is not a database that is free for non-academic usage ( Considering they are a for-profit company and are likely to commercialize this, they may want to consider a database that is more public. That being said, I don't know why the authors would need to use the SILVA reference since they are not making use of the alignment, taxonomy, or metadata features contained within the database.

Minor comments:

6. L78-80. "Regularly evaluating the microbiome to monitor overall health is therefore gaining traction in contemporary medicine and needs to be part of modern diagnostics."

7. L102-109 include no citations. Although these may be "standard protocols", specific protocols should still be cited as there are no standards and to give credit to those that developed the protocols.

8. L112-125. The authors present a method for denoising and building contigs from their sequence data that uses Swarm. As far as I know, this approach to denoising the data is novel and has not been validated in this paper or others. Alas, I'm not sure why they bothered with the Swarm clustering since they take the contigs and map them against the SILVA reference database for exact matches. The justification for these two steps is not clear and needs to be clarified.

9. L154. "Two out of 35 control samples did not pass our sequencing quality thresholds". If I am right in assuming that this is previously mentioned 10,000 sequence threshold (L129), then the authors should be specific in stating that here. If there are other thresholds, then those should be stated at some point in the manuscript.

10. "dysbiosis" is used throughout the manuscript. This is a trendy piece of jargon that is pretty meaningless. Furthermore, their method does not really address the whole community, which is usually done when describing a dysbiotic state. This manuscript describes the quantification of single strains.

11. I do not believe that Peptoclostridium difficile is a valid name for Clostridium difficile. At this point, it appears that the most recent valid name is Clostridioides difficile (

show less