Review for "INC-Seq: Accurate single molecule reads using nanopore sequencing"

Completed on 29 Apr 2016 by Jean-Marc Aury.

License: http://creativecommons.org/licenses/by/4.0/

Login to endorse this review.


Comments to author

Author response is in blue.

The manuscript "INC-Seq: Accurate single molecule reads using nanopore sequencing" by Li et al. describes a methodology dedicated to obtain accurate DNA reads using the Oxford Nanopore sequencing technology. The INC-Seq strategy differs from existing methods which are bioinformatics-only methods. Here, authors combine a molecular approach and bioinformatics software to generate accurate reads. The application is restricted to 16S bacterial rRNA profiling, and is well-adapted to deal with highly similar 16S sequences. The INCSeq software is available online, easy to install and use.

My main concern about the manuscript is related to the weak description of the datasets and the validation protocol which is not sufficiently rigorous.

Major Compulsory Revisions:

1) Clearly, having long and accurate reads is perfect for rRNA classification. The INC-Seq method is able to produce such kind of data but at the expense of the coverage, which is the key to a good sensitivity. I think this aspect has to be discussed in more depth in the article. The authors state that INC-Seq enable identification of species at 0.1% abundance, however is the detection robust at this abundance? Table2 suggests the opposite, as it is confounded with false positive

2) There are several tools that are dedicated to the error correction of long and noisy reads. It is of interest to know how INC-Seq performs compared to Nanocorrect (github.com/jts/nanocorrect) and Canu (github.com/marbl/canu), as example. Even if nanocorrect is deprecated, it is based on the same idea (pbdagcon). A comparison with "pure bioinformatics tools" could enhance the importance of the library construction step.

3) It is not easy to know during first reading how many datasets have been used, and if they have been simulated or sequenced. I recommend the authors to name their datasets and to add a section to describe how each dataset has been generated and for which goal. Furthermore, I think authors should give complete metrics for each datasets (number of reads, number of bases, N50 read size, average read length …) in the main text.

4) The choice of using PBSIM to generate ONT reads is weird. I don't know how PBSIM models sequencing error and bias, but ONT and PacBio reads have clearly different error patterns. ONT reads contain non-random deletions and PacBio reads contain internal stretches that are essentially junk. NanoSim (github.com/bcgsc/NanoSim) is dedicated to ONT, and should better handle the specific error pattern.

5) Checking the error rate of the INC-Seq reads using a simulated dataset is not suitable. The final error rate highly depends on the error pattern, and if the authors use a false error pattern, then the computed error rate will be false. For example, ONT reads contains systematic errors in homopolymers, and in these cases, even with a high coverage, errors will remain in the INCSeq reads. The error rate should be evaluated using real data, as this is the case in section 3.4 and Figure3. I think authors should remove section 3.2 from the main text, or fuse it with sections 3.4 and 3.5.

6) The comparison of ONT raw and INC-Seq performance when applied to 16S rRNA classification is not fair. Indeed the simulation doesn't take into account chimeras produced during the INC-Seq library preparation, and systematic errors that affect the consensus quality, and as a consequence the rRNA classification.

7) Authors state that chimeras obtained from inter-molecular ligation lead to reads that are longer than expected. That's true in majority of cases, but if the ligated rRNA molecules are near identical, it can lead to a consensus with an expected length. In these cases, the consensus will be a mixed between the rRNAs. Authors should discuss this issue and estimate the "divergence threshold" that lead to INC-Seq reads longer than expected.

8) What is the difference between Table2 and Figure4B? Why both replicates are not present in Table2? Table2 suggest a high variability in relative abundances. Is the Pearson coefficient welladapted here? In section 3.5, authors merged relative abundances of both replicates. Does it suggest the necessity of sequencing replicates when using INC-Seq method?

Minor Essential Revisions:

1) Figure2: please use species name instead of accession identifiers.

2) I think authors should present complete metrics for ONT raw and INC-Seq reads (number of reads, number of bases, N50 read size, average read length …) in the main text.

3) Figure 1B is cited before Figure 1A in the main text.

4) Authors should remove "long" in the abstract: "…as a strategy for obtaining long and accurate nanopore reads…". INC-Seq allows obtaining accurate nanopore read, but not long nanopore reads.

Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? If not, please specify what is required in your comments to the authors.

Yes.

Are the conclusions adequately supported by the data shown? If not, please explain in your comments to the authors.

Yes.

Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? If not, please specify what is required in your comments to the authors.

Yes.

Are you able to assess any statistics in the manuscript or would you recommend an additional statistical review? If an additional statistical review is recommended, please specify what aspects require further assessment in your comments to the editors.

Yes, and I have assessed the statistics in my report.


Quality of written English

Please indicate the quality of language in the manuscript:
Acceptable.

Declaration of competing interests

Please complete a declaration of competing interests, considering the following questions:
1. Have you in the past five years received reimbursements, fees, funding, or salary from an
organisation that may in any way gain or lose financially from the publication of this
manuscript, either now or in the future?
2. Do you hold any stocks or shares in an organisation that may in any way gain or lose
financially from the publication of this manuscript, either now or in the future?
3. Do you hold or are you currently applying for any patents relating to the content of the
manuscript?
4. Have you received reimbursements, fees, funding, or salary from an organization that
holds or has applied for patents relating to the content of the manuscript?
5. Do you have any other financial competing interests?
6. Do you have any non-financial competing interests in relation to this paper?
If you can answer no to all of the above, write 'I declare that I have no competing interests'
below. If your reply is yes to any, please give details below.

I declare that I have no competing interests; however I should mention that I am part of the
MinION Access Programme (MAP).

I agree to the open peer review policy of the journal. I understand that my name will be included
on my report to the authors and, if the manuscript is accepted for publication, my named report
including any attachments I upload will be posted on the website along with the authors'
responses. I agree for my report to be made available under an Open Access Creative Commons
CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments
which I do not wish to be included in my named report can be included as confidential comments
to the editors, which will not be published.

I agree to the open peer review policy of the journal.

Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0140-7/13742_2016_140_AuthorComment_V1.pdf)