Review for "TaxAss: Leveraging Custom Databases Achieves Fine-Scale Taxonomic Resolution"

Completed on 9 Nov 2017 by Patrick Schloss . Sourced from https://www.biorxiv.org/content/early/2017/11/05/214288.

Login to endorse this review.


Comments to author

Author response is in blue.

The preprint by Robin Rowher and colleagues seeks to develop a workflow that complements methods for classifying 16S rRNA gene sequences with greater precision than is found in the Wang naive Bayesian classifier. This is an issue that many people have raised with me. A lack of classification for a sequence can be blamed on inadequacies of the taxa represented in the database, lack of taxonomic data (e.g. at the species level) within the database, and the selection of the region within the 16S rRNA gene to classify. This paper seems primarily concerned with the first problem by supplementing ecosystem-specific sequences and touches on the second problem by adding finer taxonomic information for the ecosystem-specific sequences. I felt like the authors were a bit conflicted over what they wanted this manuscript to be. Is it a description/announcement of a new method, TaxAss? Is it a validation study? Is it a benchmarking study? Overall, it is a description of a new method that is being used by the authors and others. However, I feel like it needs some help to improve the description as there are points in the manuscript that are not clear. Furthermore, I felt that the validation and benchmarking could use some help to quantify the need for the method and to demonstrate that the method overcomes that problem.

General comments...

1. As described in Figure 1, sequences that fall below an empirically determined threshold when compared to an ecosystem-specific database are classified using a comprehensive database and those that are above the threshold are classified against the ecosystem-specific database. Perhaps it's because I'm familiar with people using blastn to classify sequences, as I read the manuscript, it was not clear whether the sequences in the two arms were then classified using the Wang method or blastn. Reading through the source code, it looks like blastn is only used to split the dataset and once split, the data are classified using the Wang method. Perhaps this could be clarified in the text.

2. It is not clear how the FreshTrain database was developed or how it is curated to add finer taxonomic names to the sequences. The authors have done this for the readers who are interested in fresh water bacteria, but what steps should someone interested in gut microbiota take to recreate the database to classify their data? More importantly, how did the authors decide on their taxonomic levels of lineage, clade, and tribe? Why not follow the phylogenetic approach used by the greengenes developers for defining family, genus, and species for environmental sequences?

3. I am also not clear why the authors did not want to pool FreshTrain with one of the comprehensive databases. A simple `cat` command would pool the two files producing a file that could then be used as a single reference. The downside of this would be that they would need to add the same level of taxonomic detail that is in the FreshTrain database to the greengenes database. Also, a downside of the greengenes database is that the core reference appears to be moth balled going forward while RDP and SILVA are still actively being developed.

4. One motivation that the authors state for the method is the issue of "forcing". I would call these "false positives", but I get their point. The authors raise this issue numerous times. Yet I was unable to find a citation that quantifies forcing and the authors do not appear to measure the amount of forcing in their data. Perhaps this is what they were getting at in Figure 3? If that is the case, then I am a bit troubled because they are accepting the FreshTrain data as the ground truth, when it has not been validated yet. I could also imagine that even with FreshTrain, there might be forcing if a taxonomic name is set for the full length sequence, but two variable region sequences are identical even though their parent sequences have different taxonomies. More importantly, the source code indicates that the authors are using any confidence score with out applying a filter. The suggested confidence score is 80%, not 0%. I don't think that the problem with classifications from the Wang method is forcing, rather, it's that the classifications don't go deep enough. Something may classify as a Bacillus with 20% confidence and so researchers should work their way up the taxonomy until the classification is above 80%, which might be Firmicutes. In offline conversations with the authors, they reassured me that they are applying an 80% threshold in separate scripts. It would probably be worth adding that they are using 80% as a threshold in the Methods seciton.

5. Related to this point, at L122 the authors state that "In a large database an OTU dissimilar to any reference sequences will not be classified repeatably as any one taxon, resulting in a low bootstrap confidence." This is correct, but is a bit misleading. I would suggest saying "...repeatedly as any one genus, resulting in a low bootstrap confidence and reclassification at a higher taxonomic level where there is sufficient bootstrap confidence". I am concerned that the results and the discussion of forcing are based on not using a confidence threshold rather than the default 80% threshold.

6. To measure forcing, I would like to see the authors run the greengenes and FreshTrain databases back through the classifier using a leave-one-out testing procedure and quantify how many times the incorrect classification is given, when using the 80% (or even the 0%) threshold. Again, I suspect the results would indicate that the problem isn't one of forcing, but of "holding back". To be clear, this isn't necessarily a problem with the Wang method, but the databases. Addressing this point is where I think the authors could really do the field a service. It would be a really helpful contribution to show the percentage of forcing (false positives) and holding back (false negatives?) in a leave-one-out scheme and on a real dataset when classifying with (1) each of the comprehensive databases, (2) using TaxAss with the comprehensive databases and FreshTrain, (3) merging the comprehensive databases with FreshTrain and running them through the Wang classifier.

7. I am not sure what the authors mean by "maintaining richness" as they use it in the manuscript. Could the problem they are trying to address be described better? Also, I would ask whether they know what the *true* richness is and if not, why they think that one value of richness is better than another. Perhaps this corresponds to what I might call "underclassifciation" or "false negatives".

L25 - why not include the RDP reference database in this list?

L49 - "Course" should be "Coarse"



Here are my responses to Pat Schloss's thoughtful review:

1. You are correct: "blastn is only used to split the dataset and once split, the data are classified using the Wang method." We avoid classifying with BLAST because it doesn't take into account the phylogenetic structure of the reference database, which is why the Wang classifier is preferred for classification. We use BLAST to split the dataset because the Wang classifier's algorithm can cause misclassifications if the database is really small and you try to classify a sequence with no similar references (what we called "forcing" in the manuscript). We agree this distinction should be added more clearly to the Fig. 1 description!

2. We didn't delve into the creation of custom databases in this paper because we felt it was outside the scope- TaxAss is only a tool to use them, not to make them. To learn more about how the FreshTrain was created, we suggest reading the Newton et al. 2011 citation (ref 12). We also cite papers about creating other ecosystem-specific databases, some of which also have really detailed methods descriptions. See text starting on lines 105 and 295 to find these references.

3. "I am also not clear why the authors did not want to pool FreshTrain with one of the comprehensive databases. A simple `cat` command would pool the two files producing a file that could then be used as a single reference."
This was also my first thought when faced with trying to use the FreshTrain for classification! But, it wouldn't work to simply concatenate two databases because some reference sequences would end up duplicated (corresponding to the yellow bars in Figure 2). For example, in Freshwater the major bacteria acI has a detailed phylogeny in the FreshTrain that includes multiple clades and tribes. Reference sequences for acI also exist in Greengenes, however they're called ACK-M1 and don't have detailed genuses or species under them. If the FreshTrain and Greengenes were concatenated, the Wang classifier would find two equally good references with different names. Then (because it is a stochastic algorithm that bootstraps its classification many times) it would place the acI OTU 50% of the time into "acI" and 50% of the time into "ACK-M1." That would result in a bootstrap confidence of ~50% for each, and the OTU would be called "unclassified." We'll add this into the discussion section about Current FreshTrain Usage.

"Also, a downside of the Greengenes database is that the core reference appears to be moth balled going forward while RDP and SILVA are still actively being developed."
Exactly! This is why TaxAss is great- it gives you the flexibility to pair your existing custom database with the most current version of whichever comprehensive database you prefer. It would take a lot of effort to incorporate the FreshTrain's phylogenetic structure into SILVA, and then that would need to be repeated every time SILVA or the FreshTrain is updated. That type of ARB work is beyond the scope of many 16S analyses, so TaxAss is a way for everyone to use the most up-to-date version. In regard to Greengenes being "moth balled," we agree that it doesn't look like it will be updated going forward. But we used it for this paper because it is still pretty comprehensive and it includes classifications to the species level (SILVA only goes to genus), which allowed for a better comparison since we were placing a lot of emphasis on improved fine-level classifications.

4. "One motivation that the authors state for the method is the issue of 'forcing'. I would call these 'false positives', but I get their point."
I like this suggestion for vocabulary and it seems like forcing is confusing to a lot of people. I'm also considering just calling it misclassification.

"The authors raise this issue numerous times. Yet I was unable to find a citation that quantifies forcing and the authors do not appear to measure the amount of forcing in their data. Perhaps this is what they were getting at in Figure 3? If that is the case, then I am a bit troubled because they are accepting the FreshTrain data as the ground truth, when it has not been validated yet."
You're right that this is what we're trying to get at in Figure 3! (Specifically, the red bars of figure 3 over named organisms.) To clarify, the ground truth in figure 3 is TaxAss (FreshTrain + Greengenes), and the forcing is observed through a FreshTrain-only classification. I don't know a way to quantify the exact amount of forcing/false positives because it is impossible to know the "truth" and it would be different with different datasets and different databases. However, we can see with some sequences that it is happening to an extent that it would negatively impact our interpretations, so I would argue that is evidence enough that it's a problem.

How we know the forcing/false positive problem is real: We know that the FreshTrain database is limited in scope and does not contain all organisms in a dataset. And some organisms it doesn't contain have good references in Greengenes so we know what they are. So when we classify organisms we know are not contained in the FreshTrain using the FreshTrain, and we see some of them end up with a high-confidence FreshTrain classification, we know it's a problem! We can't say for sure exactly *how big* a problem it is, but we look at its practical impact on the datasets we tested. In Fig. 3 and Supp Fig. 3 we define organisms not included in the FreshTrain as those below the BLAST percent identity cutoff. When those are classified in the FreshTrain as part of a FreshTrain-only classification we see that most of these end up "unclassified", but that some end up with classifications that we know to be incorrect "forcing." The misclassified sequences seem like a small problem at the phylum level, but at finer taxonomic levels we see that they will negatively impact our analysis. Just based on the rank abundance curves, you can see that incorrect sequences are being lumped in with the major lineages adding error to our analyses. We are not the first to notice this problem, Woodhouse et al. (Ref. #23) also observed cyanos being forced/misclassified when trying to use the FreshTrain pre-TaxAss.

"I could also imagine that even with FreshTrain, there might be forcing if a taxonomic name is set for the full length sequence, but two variable region sequences are identical even though their parent sequences have different taxonomies."
Actually, that scenario wouldn't cause forcing it would cause false negatives/underclassification. This scenario would effectively be like having duplicate reference sequences as in your comment #3, where Wang does 50/50 and ends unclassified.

"More importantly, the source code indicates that the authors are using any confidence score with out applying a filter. The suggested confidence score is 80%, not 0%. ... In offline conversations with the authors, they reassured me that they are applying an 80% threshold in separate scripts. It would probably be worth adding that they are using 80% as a threshold in the Methods section."
We mention the bootstrap cutoff in passing in line 585, but I agree that perhaps it would be clearer if we add a bit more detail to the "How to use TaxAss" section. We left a lot of these details out because they're included in the step-by-step directions on the github website ( https://htmlpreview.github.... ) but it sounds like we should add another paragraph to the methods summarizing these directions.

5. "Related to this point, at L122 the authors state that 'In a large database an OTU dissimilar to any reference sequences will not be classified repeatably as any one taxon, resulting in a low bootstrap confidence.' This is correct, but is a bit misleading. I would suggest saying '...repeatedly as any one genus, resulting in a low bootstrap confidence and reclassification at a higher taxonomic level where there is sufficient bootstrap confidence'."
I'm not sure I understand your point, but I think it's that most bacteria are classified to the phylum level at least by Greengenes so it's misleading to say they are unclassified. However, many phyla should be unclassified at the phylum level in the FreshTrain database because their references don't exist at all, so I was trying to say hypothetically if that happened in Greengenes (the phylum ref. of an OTU not included b/c it's super weird and unknown) the Wang classifier would still work and call it unclassified, but that forcing is an issue when the database is smaller. I'm not sure how to make that more clear- suggestions welcome!

"I am concerned that the results and the discussion of forcing are based on not using a confidence threshold rather than the default 80% threshold."
Pat's so worried because we applied the cutoff in an R script, not it the mothur call. But don't worry! The 80% bootstrap confidence threshold was applied prior to all the analyses, just not in the mothur command. This way we were able to change it easily without re-running the classify.seqs command, which was helpful to compare the bootstrap cutoff's impact during development. Details:

- The R script that applies the bootstrap cutoff is find_classification_disagreements.R . The easiest way to see when it's called is using the step-by-step directions:
https://htmlpreview.github.... .

- It's first used in step 13 to define what counts as a "conflict" among coarse-level assignments (i.e. things where both gg and fw were above the bootstrap cutoff and were given different names. Since the FreshTrain doesn't differ from Greengenes above family-level, we consider these "forcing" and trust GG more.).

- In step 13, an intermediate file of bootstrap values is also exported, which is used to apply the bootstrap cutoff in step 14 plot_classification_disagreements.R when it compares the percent reads classified. Step 14 is where Fig 4 is created.

- The bootstrap cutoff is again applied in step 15 when the final taxonomy file is created (also within find_classification_disagreements.R)

- Step 15.5.a is where Fig 2 is created, and it's script plot_classification_improvement.R applies the bootstrap cutoff by incorporating an intermediate file of final.taxonomy.pvalues that is created in step 15.

- Step 15.5.b is where Fig 3 is created, and it also uses the previously used scripts find_classification_disagreements.R and plot_classification_disagreements.R to apply the bootstrap cutoff.

- The function that implements the cutoff within find_classification_disagreements.R is called do.bootstrap.cutoff(), and is defined on line 301 of the script.

6. "To measure forcing, I would like to see the authors run the Greengenes and FreshTrain databases back through the classifier using a leave-one-out testing procedure and quantify how many times the incorrect classification is given, when using the 80% (or even the 0%) threshold. Again, I suspect the results would indicate that the problem isn't one of forcing, but of "holding back". ..."
We do see "holding back" as a major problem with using the FreshTrain alone- most sequences not in the FreshTrain database ended up unclassified. However, if that were the only problem you could just do a sequential classification instead of TaxAss- i.e. classify everything with the FreshTrain, then take all the unclassifieds and classify them with Greengenes. That's why we stress forcing so much- it's why TaxAss is necessary.

"It would be a really helpful contribution to show the percentage of forcing (false positives) and holding back (false negatives?) in a leave-one-out scheme and on a real dataset when classifying with (1) each of the comprehensive databases, (2) using TaxAss with the comprehensive databases and FreshTrain, (3) merging the comprehensive databases with FreshTrain and running them through the Wang classifier."
Leave-one-out schemes are often used to test classification algorithms, but I'm not convinced they would actually add that much to our understanding here. That's because the quantitative amount of forcing would be different with different datasets and databases, so very careful quantification of one example wouldn't apply very well to other work. I think knowing qualitatively it's a problem at a level impacting analysis is reason enough, which we could see clearly in Fig. 3.

7. "I am not sure what the authors mean by "maintaining richness" as they use it in the manuscript. Could the problem they are trying to address be described better? Also, I would ask whether they know what the *true* richness is and if not, why they think that one value of richness is better than another. Perhaps this corresponds to what I might call "underclassifciation" or "false negatives".
Yes, when we talk about maintaining richness we mean avoiding the underclassification (false negatives) of sequences not in the FreshTrain. We don't know the total "true" richness because that's impossible to know without a perfect *true* taxonomy database. But we know that when organisms with known references in Greengenes end up unclassified in a FreshTrain-only classification that that is an incorrect loss of richness. TaxAss avoids this by classifying those sequences in Greengenes and keeping their known classifications.

It seems like Figure 3 was not explained well because the two confusing ideas here are:
1. forcing/misclassifications/false positives (red bars in Fig 3) and
2. lost richness/underclassification/false negatives (blue bars and red "unclassified" bars in Fig 3).
I welcome any more ideas on how to explain this better!

L25 - "why not include the RDP reference database in this list?"
I did include RDP in the intro on line 103, but I didn't include in the abstract b/c I was trying to be more concise and I didn't realize it was still actually being used. I thought it was much smaller than both SILVA and Greengenes now?

L49 - "Course" should be "Coarse"
ahhh I'm so embarrassed!!

Thanks again to Pat Schloss for sharing his thoughts and starting this discussion! I encourage other readers and users of TaxAss to weigh in with their thoughts, critiques, and questions too.