Preprint reviews by Jean-Marc Aury

Multi-locus and long amplicon sequencing approach to study microbial diversity at species level using the MinIONTM portable nanopore sequencer

Alfonso Benítez-Páez, Yolanda Sanz

Review posted on 17th October 2016

I've read the manuscript by Benitez-Paez et al. titled "Multi-locus and ultra-long amplicon sequencing approach to study microbial diversity at species level using the MinION portable nanopore sequencer" with great interest. The authors describe a method dedicated to metagenomic studies based on amplicon sequencing and focused on microbial diversity using 16S and 23S sequences. The manuscript presents an interesting approach and describes the construction of the rrn database that could be a valuable dataset for the community. However it will require considerable improvement prior publication.


Major revisions
1. The authors claimed that their approach is more accurate than other approaches, based on the sequencing of the 16S region. Authors have to benchmark their strategy against 'classical' amplicon pipelines (based on a shorter region, sequenced using for instance the Illumina or Nanopore technologies) and show that using a larger sequence allow detecting more accurately the microbial diversity. Using a larger region for the taxonomic assignation required an adapted database. This database is probably less comprehensive than existing 16S databases, so even if the method is more accurate, it may be less sensitive.

2. The authors chose to multiplex the experiments, and they sequenced two samples using a single flowcell. Some sentences are confusing, for example l225 "One example was the ability to discern close species, such as Lg and Lf presented distinctively in HM-782D and D6305 samples, respectively". It means that the right species are identified thanks to a larger amplicon's size. But if I well understand it's only due to the multiplexing and the detection of the correct barcode, because the two close species are not in the same sample.

3. I think the authors should use their approach on a more complex sample. They introduced a human fecal DNA sample but they've never presented the results obtained on this more complex sample. In my opinion, the method is less sensitive (than 16S approach) when used on unexplored environmental samples which may contain species without close relatives in public databases. The authors should discuss these aspects (completeness of the rrn database and efficiency of the primers used) in the manuscript as it may indicate the limit of the method to deal with complex samples.

4. Line 382: Figure1 is missing in the manuscript.

5. The figure 3 is of interest because it shows the importance of using a larger region. As it is a key point, I think the authors should introduce this result earlier in the manuscript. Furthermore, the figure is a bit confusing as it suggests that the ITS region is more discriminant than the rrn region. Indeed ITS sequences show a higher variability at 97%, 98% and 99% identity.

Minor revisions
1. Methods are spread in the manuscript (lines 106-113 and lines 156-166) the authors should better organize the description of the methods to avoid redundancy.

2. Line 64: Use ONT instead of ONT nanopore technology.

3. I suggest the authors to describe the organization of the 16S, 23S and ITS sequences, especially the size of these regions.

4. Line 189: "… and not from sequencing." Contrary I think the sequencing is one of the limits, low abundant species are observable only with a higher sequencing depth.

5. There are a lot of missing spaces between words through the manuscript.

Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included?
If not, please specify what is required in your comments to the authors.
Yes.

Are the conclusions adequately supported by the data shown?
If not, please explain in your comments to the authors.
No.

Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting?
Yes.

Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used?
There are no statistics in the manuscript.

uality of written English
Please indicate the quality of language in the manuscript:
Needs some language corrections before being published.

Declaration of competing interests
Please complete a declaration of competing interests, consider the following questions:
Have you in the past five years received reimbursements, fees, funding, or salary from an organization that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?
Do you hold any stocks or shares in an organization that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?
Do you hold or are you currently applying for any patents relating to the content of the manuscript?
Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?
Do you have any other financial competing interests?
Do you have any non-financial competing interests in relation to this manuscript?
If you can answer no to all of the above, write ‘I declare that I have no competing interests’ below. If your reply is yes to any, please give details below.
I declare that I have no competing interests; however I should mention that I am part of the MinION® Access Programme (MAP).

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
I agree to the open peer review policy of the journal.

Authors' responses to reviews:

Reviewer #1 concerns:
Major revisions
1. The authors claimed that their approach is more accurate than other approaches, based on the sequencing of the 16S region. Authors have to benchmark their strategy against 'classical' amplicon pipelines (based on a shorter region, sequenced using for instance the Illumina or Nanopore technologies) and show that using a larger sequence allow detecting more accurately the microbial diversity. Using a larger region for the taxonomic assignation required an adapted database. This database is probably less comprehensive than existing 16S databases, so even if the method is more accurate, it may be less sensitive.
R./ We thank to the reviewer for its constructive criticism. We have sequenced the V4-V5 16S amplicons for the mock communities HM782-D and D6305, we have processed such sequences with regular protocols used in short-read approaches and results were compared with those obtained with our ultra-long read approach. Results can be followed along the manuscript and in the Supplementary Tables 1 and 2.

2. The authors chose to multiplex the experiments, and they sequenced two samples using a single flowcell. Some sentences are confusing, for example l225 "One example was the ability to discern close species, such as Lg and Lf presented distinctively in HM-782D and D6305 samples, respectively". It means that the right species are identified thanks to a larger amplicon's size. But if I well understand it's only due to the multiplexing and the detection of the correct barcode, because the two close species are not in the same sample.
R./ We have re-written this paragraph in order to express our thoughts in a better way (lines 270-282).

3. I think the authors should use their approach on a more complex sample. They introduced a human fecal DNA sample but they've never presented the results obtained on this more complex sample. In my opinion, the method is less sensitive (than 16S approach) when used on unexplored environmental samples which may contain species without close relatives in public databases. The authors should discuss these aspects (completeness of the rrn database and efficiency of the primers used) in the manuscript as it may indicate the limit of the method to deal with complex samples.
R./ We appreciate the reviewer suggestions. We're aware of limitations of our approach and we discussed the reviewer's concerns in lines 322-324 and 353-359. This is a preliminary study and we are on progress to strengthen our approach by sequencing the rrn region of several human derived samples with R9.4 chemistry, thus providing invaluable information regarding the 16S, 23S and ITS sequences present in this type of environments.

4. Line 382: Figure1 is missing in the manuscript.
R./ This was a bad indexation of the "Figure 1" in the main text of the manuscript. We have remove this citation.

5. The figure 3 is of interest because it shows the importance of using a larger region. As it is a key point, I think the authors should introduce this result earlier in the manuscript. Furthermore, the figure is a bit confusing as it suggests that the ITS region is more discriminant than the rrn region. Indeed ITS sequences show a higher variability at 97%, 98% and 99% identity.
R./ We have done changes suggested by the reviewer and the respective figure and supporting analysis are now early cited in the main text (lines 152-171). Additionally, we have included a statement which explain the higher diversity observed in the ITS region.

Minor revisions
1. Methods are spread in the manuscript (lines 106-113 and lines 156-166) the authors should better organize the description of the methods to avoid redundancy.
R./ We have removed that redundant information regarding methodology a processing of MinION reads.

2. Line 64: Use ONT instead of ONT nanopore technology.
R./ Changed as suggested.

3. I suggest the authors to describe the organization of the 16S, 23S and ITS sequences, especially the size of these regions.
R./ This information is now stated at lines 328-330 in the reviewed version of the manuscript.

4. Line 189: "… and not from sequencing." Contrary I think the sequencing is one of the limits, low abundant species are observable only with a higher sequencing depth.
R./ We agree the reviewer thought, however, our argumentation tried to explain the bias origin to obtain certain type amplicon reads despite their templates are theoretically in equal proportions that other for which we retrieved a greater number of sequences. In the line of this reasoning, we demonstrated previously that PCR step is the major source of coverage bias when comparisons are made within sample. For inter-sample comparison, we definitively think that coverage has to be improved.

5. There are a lot of missing spaces between words through the manuscript.
R./ Missing spaces are generated when different MS Word versions introduce changes in the same document. So, we will try to deliver a reviewed version improved in such sense.


Reviewer #2: The authors present an approach for improved microbial molecular profiling using the Oxford Nanopore Technologies MinION. Given that one of the areas of excitement for the MinION was its potential for environmental monitoring this method has relevance for future studies. The authors attempted a multiplex approach on the MinION - the ability to multiplex samples when using Oxford Nanopore Technologies sequencing has widespread potential for use by other researchers. The authors also created a curated rrn database which appears to be a useful resource for the community.
Major comments:
My biggest issue with the study is that the experiment needs replication:

1. How is it possible to accurately differentiate species which vary by less than the error rate of the sequencing technology? I'm concerned that the assignment of reads to taxa will be to a great extent arbitrary. I feel that this experiment needs replication to ensure that the taxonomic structure of the communities are being captured reproducibly. Alternatively the experiment could be repeated using an alternative technology - the most analogous being PacBio.
R./ We completely understand the reviewer's concerns, but at the same time we think that the approach to make taxonomy assignment of ultra-long reads presented in this study was not arbitrary at all. We were aware from the beginning of this challenge consisting of trying to make a correct taxonomic assignation to DNA reads with a high rate of errors. The molecular basis of our approach was to cover a wider genomic region with a higher presence of hypervariable sites that can overcome the high proportion of errors contained in nanopore data (now declared explicitly in lines 80-83 and 86-90). Despite to gain more informative regions for our analysis, we additionally have incorporated in our pipeline several filtering steps that ensure us to keep reliable data to make precise identifications with the side-effect of losing a notable proportion of reads. As a consequence, we think that our method is reliable but not definitive and being aware of that we stated the limitations of the current state in different paragraphs of the Discussion (see lines 322-324, 346-349, and 353-359). Unfortunately, we cannot replicate the experiment in the same conditions given that nanopore technology is being permanently updated and the flowcells, sequencing kits, MinKNOW, and Basecaller used in our experiment are not distributed anymore. The last release of this technology shows dramatically lower error and higher throughput therefore results would have to be better. Indeed, new results using R9.4 chemistry and other configuration updates, and being scope of future manuscripts, are consistent with the preliminary results presented here and it has permitted to gain more specificity and sensitivity on our approach. We additionally included further experiments suggested by reviewer#1 aiming benchmarking our MinION approach with common Illumina MiSeqs routines for microbial diversity.

2. The authors state that the problems they had generating 2D reads were indicative of bad ligation of the HP adaptor. Is this problem intrinsic to their method or simply an isolated incident for this run? In which case the optimal course of action would be to re-run the experiment.
R./ This seems to be one of the main technical problems of this technology during the hands-on procedures. The proportion of 2D reads roughly reach the 25% of the total amount of template reads. Being aware of that, ONT has developed 1D sequencing kits and, according to personal communications with Technical Support staff, the production of 1D libraries will be the address for future developments of nanopore sequencing technology. Therefore we see no major concerns to use 1D reads for our aims.

3. The multiplexing of multiple samples on the same MinION flowcell does not appear to work well, at least in this experiment. A very high proportion of the reads are lost when attempting to bin the reads by barcode. Given the relatively low throughput on the MinION multiplexing of a mixed community sample was always an ambitious project to attempt but the capability to multiplex this technology could have important implications for the PromethION, and future iterations of the MinION, where the much greater throughput makes it a more realistic prospect. However it might be more reasonable to repeat this experiment without the multiplexing aspect in order to ensure high coverage of the community and thus an accurate assessment of the taxonomic profile.
R./ We thank again to the reviewer for her/his criticisms. One of the aspects to explore in this study was the affordability of MinION to perform microbial diversity studies, that normally encompasses sequencing of several samples in multiplex fashion. We were aware of potential limitations of MinION in this sense, so, multiplexing of three mild-complexity samples was almost fully solved in our experiment. Limitations of the technology in terms of throughput are discussed and further replies to reviewer's concerns were answered in previous points and in the manuscript as well.

Minor comments:
1.In the background section the authors comment that studies of microbial diversity are limited by short-read strategies. Could they elaborate, giving examples, and also specify the technologies they are referring to rather than referring to them as "popular sequencing platforms".
R./ Done as suggested. This information now is presented at lines 80-83.

2.The section on the background on the MinION offers an unnecessarily broad historical overview of the technology and could be restricted to a short comment on data quality and how it has been used in relevant studies.
R./ We appreciate such suggestion and we have reduced the historical overview of the nanopore technology.

3.Can the authors state more explicitly what the limits of detection are? At what % abundance in a sample can a species still be reliably detected? And how great a depth of coverage of the overall community is needed to obtain this? This could be investigated with in silico subsampling of the
reads. In the absence of experimental replication, subsampling of reads would also allow the authors to give an indication of the variability of their results. This could be presented as error bars on bar charts (replacing the current pie charts).
R./ We have done a similar approach as that suggested by the reviewer in order to determine the limits of detection of our approach and the coverage needed. We have re-written some of these procedures in the M&M section for a better understanding (lines 463-468). We have calculated the minimum coverage needed from data obtained for D6305 community that contain a structure more close to complex samples where species are present in non-even proportions. This data is now presented in lines 250-252.

4.Can the authors make clear how many reads were assigned to species not in the control communities.
R./ Changes made as suggested. (lines 228-236 and 240-247)

5.A possible explanation for the result that E.coli is preferentially amplified:
- E.coli and mid-GC species are known to sequence well on the MinION. This should result in the true E.coli reads being well sequenced with a low error rate and thus assigned correctly. Whilst other species are sequenced with a higher error rate and then are more likely to have their reads mis-assigned to other species. In addition, random errors will normalise GC content of the reads towards 50% which will then make reads more likely to be mis-assigned to mid-GC species.
- Can the authors correlate coverage bias for/against species with the GC content of the species to attempt to shed some light on this?
R./ We have calculated linear correlation among coverage bias and GC content of rrn region and not significant Pearson's r value was obtained. Moreover, the average GC of rrn regions from all species analyzed was 51.8 ± 2.5 suggesting not big influence of amplicon GC composition in sequencing performance.

6.The authors state that they can distinguish reads generated from the two community samples. Why then did they lose so many reads while de-multiplexing?
R./ We have discussed this issue in the first point of the major concerns.

7.To determine the effectiveness of the different regions of the rrn one experiment would be to split the reads post alignment into 16S, ITS and 23S then re-align the split reads and see how well each region performs compared to using all 3. Ideally this would be a carried out experimentally, with each region amplified and sequenced separately. However the in silico version of the experiment would give some indication of predictive power.
R./ We thank this reviewer suggestion that could be a good approximation to analyze in future studies. However, we have already demonstrated using reference data the discrimination power of different rrn regions. Moreover, our data cannot entirely support the proposed analysis given that not all reads reached the expected rrn full-length, therefore, they do not contain the same proportion of information in terms of 16S, ITS, and 23S regions.

8.The authors state that the ITS has the highest level of variation in the rrn. At what level of relatedness does the ITS become oversaturated with variation and thus lose its usefulness for separating taxa?
R./ We have observed that both sequence and structural variation reside in ITS, and therefore, theoretically, this region would be able discriminate at species level. Unfortunately, we cannot precisely determine nor quantify this given that it requires extensive and time-consuming analyses being out the scope of this manuscript.
Taxonomic assignments were based on alignments of >70% identity. However this is far below the level of variation between species making mis-alignment between species entirely possible. Is taking the best hit for a read the best way to represent the data? I would speculate that results based on this will be highly variable and that the authors would need to replicate their data to prove otherwise. A more representative analysis of the data might be to use software (eg MEGAN) that uses all taxonomic hits for a read (above a reasonable identity threshold) and does a last common ancestor analysis to place the reads to taxa.
R./ We were aware that such level of sequence identity could be a source of misalignments and therefore of incorrect taxonomy assignments. However, using the described best-hit approach of LAST outputs, we could not reconstruct the microbial composition of one but two different mock communities. Moreover, data of other species potentially present in minor proportion (<1%) in respective communities (and discussed in above points) supports the reliability of this approach that evidently will gain accuracy with future nanopore chemistry releases. Usage of other methods for taxonomy assignment such as MEGAN could be not work properly given they used limited information (e.g 16S) from reference databases such SILVA and, as above mentioned, all our reads not necessarily contain this information.

9.References 12 and 14 are essentially the same.
R./ Duplication was removed.


show less


The use of Oxford Nanopore native barcoding for complete genome assembly

Sion C. Bayliss, Vicky L. Hunt, Maho Yokoyama, Harry A. Thorpe, Edward J. Feil

Review posted on 16th June 2016

The manuscript "The use of Oxford Nanopore native barcoding for complete genome assembly" by Bayliss et al. describes a methodology dedicated to genome assembly of small genomes by combining the MinION device with the Illumina sequencing technology. Several studies have already reported the use of the Oxford Nanopore technology for genome assembly, as mentioned by the authors. Here, authors take advantage of the possibility of multiplexing libraries on a single sequencing run. The manuscript describes the sequencing data and the minor differences between the USA300 reference genome and the MHO_001 strain. The assembly method is classical and has already been used in Risse et al. (GigaScience 2015) to assemble the Bacteroides fragilis genome. My main concern about the manuscript is related to the weak description and analysis of the multiplexed data. In addition, the impact is limited because the method used and the results obtained improve slightly the state of the art for nanopore-based assembly.


Major Compulsory Revisions
1) I recommend the authors to give a better insight of the usage of barcodes. Several questions are emerging, like for example the accuracy of the demultiplexing process (given the high error rate of nanopore reads), the fraction of cross-contamination (reads that are assigned to a wrong sample). These two informations could affect the required sequencing coverage.
2) The samples used for the pooled library are not defined. Authors should add a full description of the dataset. I couldn't find it at the given URL: http://porecamp.github.io
3) Please describe in more details the illumina data. Authors argue that the sequencing was performed on MiSeq and HiSeq platforms, which dataset was used to perform the genome assembly with Spades? What is the read length?
4) Please describe in more details how splitbarcodes.py performs the demultiplexing step.

Minor Essential Revisions
1) Line 96: Metrichore instead of Metrichor
2) Line 104: "…BLAST similarity with with previously…"
3) Line 165: "…lost using one method sequencing method preferentially…"
4) Line 100: Is Spades output a single contig with the same circularization point as the reference genome? Was this region correctly assembled?

Jean-Marc Aury

Level of interest
Please indicate how interesting you found the manuscript:
An article of limited interest

Quality of written English
Please indicate the quality of language in the manuscript:
Acceptable

Declaration of competing interests
Please complete a declaration of competing interests, considering the following questions:
Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?
Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?
Do you hold or are you currently applying for any patents relating to the content of the manuscript?
Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?
Do you have any other financial competing interests?
Do you have any non-financial competing interests in relation to this paper?
If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.

I declare that I have no competing interests; however I should mention that I am part of the MinION® Access Programme (MAP).

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
I agree to the open peer review policy of the journal.

Authors' response to reviews:

Response to reviewers:


We would first like to thank the reviewers for taking time to provide detailed and insightful responses to the manuscript. We are confident that we have addressed these legitimate concerns with significant modifications to the manuscript and additional analysis. We hope that you share our view that the manuscript is now substantially improved, and provides more convincing support for the conclusions drawn. Moreover, we feel that the additional technical analysis widens the appeal of the paper by making the workflow more accessible.




Revisions

Reviewer #1 (JM Aury)

The manuscript "The use of Oxford Nanopore native barcoding for complete genome assembly" by Bayliss et al. describes a methodology dedicated to genome assembly of small genomes by combining the MinION device with the Illumina sequencing technology. Several studies have already reported the use of the Oxford Nanopore technology for genome assembly, as mentioned by the authors. Here, authors take advantage of the possibility of multiplexing libraries on a single sequencing run. The manuscript describes the sequencing data and the minor differences between the USA300 reference genome and the MHO_001 strain. The assembly method is classical and has already been used in Risse et al. (GigaScience 2015) to assemble the Bacteroides fragilis genome. My main concern about the manuscript is related to the weak description and analysis of the multiplexed data. In addition, the impact is limited because the method used and the results obtained improve slightly the state of the art for nanopore-based assembly.

Major Compulsory Revisions

1) I recommend the authors to give a better insight of the usage of barcodes. Several questions are emerging, like for example the accuracy of the demultiplexing process (given the high error rate of nanopore reads), the fraction of cross-contamination (reads that are assigned to a wrong sample). These two informations could affect the required sequencing coverage.

We thank the reviewer for raising this important point. In response to this and point 4) below we evaluated how the reads were demultiplexed (revising the demultiplexing script in the process, see point 4 below) and aligned the reads to the complete genome produced by the assembly using the BLASR long read aligner.
The revised filtering script (FilterBarcodes.pl) allowed for more stringent parameters to be used for identifying barcodes in the 2D failed reads. This produced a set of 1499 2D failed reads. The failed 2D reads were combined with the 1324 2D pass reads and the Illumina reads to produce an identical assembly to that used during the previous workflow (which used a slightly larger set of nanopore reads, 3105 vs 2821). The pass and fail reads were then aligned to this assembly using BLASR. This resulted in 99.70% of the pass reads and 86.19% of the failed reads aligning to the reference genome. The aligned pass reads had an average 85.87% sequence identity to the reference genome. The aligned fail reads had an average 77.76 % sequence identity to the reference genome. We also ran BLASR on the filtered reads that contained barcodes for other, non-target multiplexed samples and the 2D fail reads in which I was unable to detect a barcode. Only a low percentage of these reads aligned to the reference genome, 27/9774 (0.29%) of barcoded non-sample and 722/9501(7.60%) of unbarcoded reads. In summary, only ~15% of the demultiplexed reads represented either contamination during library prep or a failure to correctly identify the barcode. The majority of the pass reads were correctly demultiplexed by Metrichor.
These methodologies and results have been added to the manuscript (Lines 129 and 154), Figure 1 has been undated accordingly and Table 1 has been added to summarise the BLASR results.


2) The samples used for the pooled library are not defined. Authors should add a full description of the dataset. I couldn't find it at the given URL: http://porecamp.github.io

We apologise for the lack of clarity here. The other isolates were provided by attendees of the Porecamp workshop and these data are being prepared for a separate publication. Inclusion of the very diverse samples from such a broad assemblage of organisms (including eukaryotes) and isolation methodologies would be best served as a separate manuscript. The purview of the current manuscript is focused on the problem of hybrid assembly for bacterial genomes. We have provided additional analysis of the demultiplexed reads (both pass and fail) and confirmation of SNPs by short read variant calling (see Reviewer 2 Point 1 below), and are confident we have addressed these problems and clarified the methodology. We have modified the text to read “The additional DNA samples included in the pooled library were a diverse assemblage of bacterial and eukaryotic DNA samples provided by attendees during the PoreCamp Workshop 2015 at the University of Birmingham. The additional pooled library samples are being prepared for separate publication. Details on the PoreCamp Workshop and associated publications can be found at http://porecamp.github.io/“.


3) Please describe in more details the illumina data. Authors argue that the sequencing was performed on MiSeq and HiSeq platforms, which dataset was used to perform the genome assembly with Spades? What is the read length?

The MiSeq and HiSeq datasets were pooled. The pooled sample represented a sample with sufficient coverage for our analysis. The text now reads “A single 250 bp paired end library was constructed and sequenced on both MiSeq and HiSeq Illumina platforms. The reads from both sequencing runs were combined before downstream analysis”.


4) Please describe in more details how splitbarcodes.py performs the demultiplexing step.

We have modified the text as follows: “2D reads that failed the sample QC were demultiplexed using an in-house script (FilterBarcodes.pl). The twelve 40 bp barcodes used for library construction were compared in a moving 40 bp window to the sequence in the first and last 150 bp of each read. The barcode requiring the least insertions, deletions or substitutions to be permuted into a sequence in the beginning or end of a read, with a maximum cut-off of 14 permutations, was considered a match. Each read could only be assigned to one individual sample, in the case of a tie the reads were discarded. Sequence preceding or following the presence of a barcode at the beginning or end or a read, respectively, were trimmed as adapter sequence”.



Minor Essential Revisions

1) Line 96: Metrichore instead of Metrichor

Corrected in text.


2) Line 104: "…BLAST similarity with with previously…"

Corrected to “The two smaller contigs were 100% identical in both aligned sequence and alignment length to previously sequenced S. aureus lineage USA300 plasmids”


3) Line 165: "…lost using one method sequencing method preferentially…"

The line was revised in response to Reviewer 2 comment 7. The text now reads “However, the clear benefit of hybrid sequencing is that it allows for the generation of larger assemblies with less uncertainties than by using a single sequencing technology preferentially over another”.


4) Line 100: Is Spades output a single contig with the same circularization point as the reference genome?

Additional details have been added to the Materials and Methods section: “The contigs were circularised by MUSCLE alignment (default parameters) of identical overlapping regions at the end of contigs and removal of one alternative overlapping sequence using an in-house script (CirculariseOnOverlaps.pl) [8]⁠. Start sites were fixed relative to the beginning of the relevant reference sequence”. The output of SPAdes v3.6.1 typically produces a redundant overlap of length x at the end of the contig ( x often being the length of the K-mer used to produce the contig ).


Was this region correctly assembled?

Despite minor discrepancies we are, in general, highly confident in the assembly. The overlapping region in both plasmids was identical. There are a few mismatches between the overlapping regions in the chromosome. This region lies between a 16S and 23s ribosomal RNA which are notoriously hard to assemble correctly. The sequence presented in the manuscript represents the least differences from the USA300 reference genome and is therefore likely to be closer to the 'true' sequence than the alternative. The intergenic regions in all of the ribosomal RNA operons in MHO_001 show some deviation from the USA300_FPR3757 reference. MAUVE alignments/screenshots of the overlap regions compared to USA300_FPR3757 is available as Supplementary Figures 1 and 2. A CLUSTAL alignment of the region is available as Supplementary Figure 3. The pileup file showing the read coverage of long reads in the region and the genome is shown in Supplementary Figure 4. The data used to generate them in available in the GitHub directory. Text has been added to the results - “There was minor sequence dissimilarity, including a small deletion, in ribosomal RNA operons. This could either reflect evolutionary changes in these highly conserved sequences or minor misassembly; these regions are typically difficult to assemble”. A clustalw alignment of the overlapping regions is available on the GitHub repository.



Reviewer #2 (AB Paez)

The authors report the complete genome assembly of the Staphylococcus aureus MHO_001 strain using hybrid approaches based on analysis of Illumina and MinION data. Although such type of analysis can be found regularly in the scientific literature, the authors put special attention to the usage of a very small subset of MinION reads to get a complete genome assembly of their pathogenic isolate. MinION data was combined with a the large amount of Illumina reads, obtained by massive and parallel sequencing in MiSeq and HiSeq platforms, to produce three major scaffolds representing the bacterial chromosome of S. aureus MHO_001 and two mobile elements consisting of two plasmids SAP046A.
In my opinion, this work represent an interesting workflow to help in the achievement of valuable information regarding the complete sequencing of bacterial genomes, where the genome structure complexity and the limitations in terms of read length of the second generation of sequencing technologies constitute the major issues. As a consequence, I recommend this manuscript for publication in GigaScience journal once the following minor and major changes have been incorporated to the main text and supporting material:

Major points

1) MinION reads are characterized by their high rate of errors that can reach up to 40% in 1D reads. The authors declare usage of only 2D reads for the hybrid genome assembly, however, even in this type of reads the per-base accuracy reaches 85% as the most (for R7.3 chemistry). With the above information and the low number of MinION reads used for the manuscript general aim, I would suggest to perform some additional analysis in order to validate the strategy presented. This will imply to corroborate by Sanger sequencing, when technically possible, the genetic variants and chromosome rearrangements found in the MHO_001 strain in comparison with the USA300.

In order to address this point we would like to first comment briefly on the hybrid assembly methodology used in this manuscript. SPAdes does not directly use the error-prone long reads to assemble the resulting contigs. Rather it relies on a methodology called 'read threading' to reduce the complexity of the De Bruijn graph, built from the low error rate short reads, to resolve repeats and loops/bulges within the graph. Therefore we would expect the resulting graph to have a similar error rate to an assembly generated from Illumina short reads alone. Illumina short reads are known to have a very low error rate and have become the tool of choice for calling high confidence SNPs in a number of recent resequencing studies (including Croucher 2011, Croucher 2015, Aanensen 2016, Holden 2013).
However, the question of assembly errors relative to the 'true' sequence is an important one and we are unaware of any direct comparisons of a 'read-threaded' genome to identify the potential error rate. In order to address this we took a mapping based approach using the low-error rate short reads independently. The short reads were mapped against USA300_FPR3757 and stringent parameters were used to identify only high confidence SNPs. These mapping based SNPs were compared to the list of SNPs generated by comparing the hybrid assembly to USA300_FPR757. Excluding SNPs found in regions exclusive to USA300 or repeat regions (which are notoriously problematic for mapping based approaches) 108/114 (92.1%) comparable SNPs were supported by this analysis. On visual investigation of the BAM file 6/9 of the unsupported SNPS were identified as being well supported variants with 100% read support that had been filtered based on significant strand bias caused by lower than average coverage (having only one supporting read on the reverse strand at 8x coverage). The remaining 3 SNPs (2.6%) were unsupported by mapping and may represent assembly error. Furthermore mapping of the reads to the genome allowed us to identify the number of reads supporting the edge of a large genomic structural variant. All structural variants were highly supported by 8-10x coverage of long reads and >25x coverage of short reads. The exception to this was one edge of the translocation event which only had moderate coverage with long reads (3-4x) but high coverage with short reads. In addition to this short reads were mapped to MHO_001. No indels were called using either samtools/GATK or pindel. 5 SNPs were called, 3 of which represented the SNPs unsupported by mapping to USA300 above. The remaining 2 were present in repeat regions.
We have added a spreadsheet of the comparison of SNPs and structural variant analysis as Supplementary Table 1. We have added the following text to the manuscript:

Materials and Method - SNPs were called between the chromosome and reference genome using MAUVE [14]⁠. SNPs were further confirmed by mapping short reads independently to USA300_FPR3757 and calling variants. Mapping was performed using BWA, reads at indel sites were realigned using the GATK toolbox and SNPs were called using samtools [12,15]⁠. The variant call file (VCF) was filtered for variants supported by a minimum read depth of 4 (minimum 2 per strand), >30 map quality, >50 average base quality, no significant strand bias and >75% of reads supporting the variant. Indels were additionally confirmed using pindel [16]⁠. The VCF file was filtered to remove regions unique to MHO_001 or USA300_FPR3757. Repeat regions of >50bp, which are notoriously problematic for short read mapping, were identified using nucmer and removed from the comparison [17]⁠ [Supplementary Table 1].

Results and Discussion - The chromosome showed minor differences to the USA300 reference genome USA300_FPR3757 including 155 SNP differences and the loss and gain of mobile genetic elements (Figure 2). In order to provide an independent confirmation of the 155 SNP differences identified by MAUVE between aligned regions of MHO_001 and USA300_FPR3757 the short reads were mapped to USA300_FPR3757 and variants were called using strict parameters. Of the 155 MAUVE SNPs 41 (26.5%) were present in repeat regions and excluded from the comparison. Of the remaining 114 SNPs, 111 (97.36%) were supported by short read mapping to USA300_FPR3757. The remaining 3 SNPs (2.6%) were unsupported. No indels were identified by short read mapping to MHO_001 by either GATK/samtools or pindel. In summary, of the 114 SNPs identified by MAUVE that could be robustly investigated by short read mapping 111 (97.4 %) were confirmed using low error rate short reads. Furthermore, the long and short read coverage support at the edge of each of the large structural variants in MHO_001 was 8-10x for nanopore reads, with the exception of the 3' edge of the transposed 13,356 bp insertion sequence (IS) which had a read coverage of 3x, compared to the genomic average of 6.8x coverage. The edge of each structural variant was supported by >25 short reads.


2) The authors declare that S. aureus MHO_001 was isolated from a case of asymptomatic nasal carriage (lines 53-54), however, during my review I was unable to find but no ethics any ethical nor methodological information regarding the collection of this sample and associated consent of patient, if necessary. Please add such information to the main text of the document.

This line was added in text “S. aureus strain MHO_001 was recovered in 2015 from asymptomatic nasal carriage via a standard nasal swab of a healthy individual with informed consent. ”



Minor points:

1) Along description of the MinION sequencing library the authors never declared the pore chemistry used for sequencing (lines 72-77). Please detail the R7.3 or R9 flow cells used to obtain nanopore data.

We used a Flow Cell Mark I R7.3. The text now reads “[samples were] loaded onto a MinION(TM) Flow Cell Mark I R7.3”.


2) In similar manner, please detail MinKNOW and Metrichor versions used to operate the MinION device and perform basecalling, respectively.

The text now reads “[samples were] loaded onto a MinION(TM) Flow Cell Mark I R7.3 on a MinION(TM) Mark I controlled by MinKNOW version 0.50.2.15 software (Oxford Nanopore, UK). Base calling was performed using Metrichor ONT Sequencing Workflow Software v1.19.0 with the Basecall_Barcoding workflow (Oxford Nanopore, UK)”.


3) In lines 81 and 89, accession numbers for read data must be presented primarily as "study accession" (PRJ numbers) and, secondarily, "sample accession" can be specified if needed (ERP or ERS numbers).

Both lines were corrected to “study accession PRJEB14152”.


4) In lines 111-113, describe what type of data is presented within and outside of parenthesis.

In order to address this I have moved the description of this data to the Results and Discussion as it was more pertinent for this section than the Materials and Method. The text now reads “There was a discrepancy observed between the coverage of short and long reads of plasmidic and chromosomal contigs (Figure 2, top and middle panels). The average chromosomal coverage was 49.6x (7.0 SD) with short read data and 6.8x (2.6 SD) with nanopore reads. The average short read coverage of plasmids A and B was 78.35 (8.9 SD) and 7302.04 (85.4 SD) respectively. This represents an coverage increase of 1.5- and 150-fold relative to the chromosome. The opposite trend was observed with long reads; plasmids A and B had and average coverage of 4.05 (2.0 SD) and 2.9 (1.7 SD) respectively, which represents a 40% and 60% decrease in coverage relative of the chromosome. In addition to this the smaller of the two plasmids was only intermittently covered by nanopore reads”.


5) Change the expression "close genome" to "complete genome" all across the text.

Modified throughout the manuscript.


6) Split the Figure 2 into panels A, B, C, D, etc, and make description in the legend to better understand the plots presented. Define information shown in all axes.

Figure 2 has been split into panels A, B and C and the figure legend has been revised for clarity. The legend now reads:

Figure 2. Alignment of MHO_001 chromosome (A), plasmid A (B) and plasmid B (C) to the USA300_FPR3757 genome and reference plasmids alongside long and short read coverage. The bottom panels show alignments between MHO_001 and the reference sequences. Contiguous sequences are shown by connecting red lines and inversions are depicted in blue. Coding sequences (CDS) are annotated as blue rectangles with the exception of ribosomal RNA operons which are represented by red rectangles. Those above the line represent open reading frames on the forward strand and those under the line on the reverse strand. Notable mobile genetic elements or genomic features are annotated. A scale bar in basepairs (bp) is present underneath each sequence. The middle panels represent per base read coverage of short reads across the MHO_001 genome. The data was binned every 1000 bp. The y-axis, representing per bin read coverage, has been constrained to 200, 350 and 8000 reads per bin for the MHO_001 chromosome, plasmid A and plasmid B respectively. The top panel represents the per base read coverage of nanopore long reads across the MHO_001 genome. The data was binned every 1000 bp. The y-axis, representing per bin read coverage, has been constrained to 20 reads per bin for each contig.



7) Unless massive sequencing technologies deliver error-free data and the assembly algorithms evolve towards complete accuracy it cannot be possible to tell "...identification of correct assembly..." (lines 163-165). I suggest to modify this statement to "...large assemblies with less uncertainties..." that results more adequate to the current development state of sequencing technologies.

The text now reads “However, the clear benefit of hybrid sequencing is that it allows for the generation of larger assemblies with less uncertainties than by using a single sequencing technology preferentially over another”.


8) In line 121, change MINion to "MinION". Additionally, MinION is a trade mark of the Oxford Nanopore Technologies (ONT), therefore, it must be cited always across the document as MinION(TM).

Modified throughout the manuscript.


show less


INC-Seq: Accurate single molecule reads using nanopore sequencing

Chenhao Li, Kern Rei Chng, Jia Hui Esther Boey, Hui Qi Amanda Ng, Andreas Wilm, Niranjan Nagarajan

Review posted on 29th April 2016

The manuscript "INC-Seq: Accurate single molecule reads using nanopore sequencing" by Li et al. describes a methodology dedicated to obtain accurate DNA reads using the Oxford Nanopore sequencing technology. The INC-Seq strategy differs from existing methods which are bioinformatics-only methods. Here, authors combine a molecular approach and bioinformatics software to generate accurate reads. The application is restricted to 16S bacterial rRNA profiling, and is well-adapted to deal with highly similar 16S sequences. The INCSeq software is available online, easy to install and use.

My main concern about the manuscript is related to the weak description of the datasets and the validation protocol which is not sufficiently rigorous.

Major Compulsory Revisions:

1) Clearly, having long and accurate reads is perfect for rRNA classification. The INC-Seq method is able to produce such kind of data but at the expense of the coverage, which is the key to a good sensitivity. I think this aspect has to be discussed in more depth in the article. The authors state that INC-Seq enable identification of species at 0.1% abundance, however is the detection robust at this abundance? Table2 suggests the opposite, as it is confounded with false positive

2) There are several tools that are dedicated to the error correction of long and noisy reads. It is of interest to know how INC-Seq performs compared to Nanocorrect (github.com/jts/nanocorrect) and Canu (github.com/marbl/canu), as example. Even if nanocorrect is deprecated, it is based on the same idea (pbdagcon). A comparison with "pure bioinformatics tools" could enhance the importance of the library construction step.

3) It is not easy to know during first reading how many datasets have been used, and if they have been simulated or sequenced. I recommend the authors to name their datasets and to add a section to describe how each dataset has been generated and for which goal. Furthermore, I think authors should give complete metrics for each datasets (number of reads, number of bases, N50 read size, average read length …) in the main text.

4) The choice of using PBSIM to generate ONT reads is weird. I don't know how PBSIM models sequencing error and bias, but ONT and PacBio reads have clearly different error patterns. ONT reads contain non-random deletions and PacBio reads contain internal stretches that are essentially junk. NanoSim (github.com/bcgsc/NanoSim) is dedicated to ONT, and should better handle the specific error pattern.

5) Checking the error rate of the INC-Seq reads using a simulated dataset is not suitable. The final error rate highly depends on the error pattern, and if the authors use a false error pattern, then the computed error rate will be false. For example, ONT reads contains systematic errors in homopolymers, and in these cases, even with a high coverage, errors will remain in the INCSeq reads. The error rate should be evaluated using real data, as this is the case in section 3.4 and Figure3. I think authors should remove section 3.2 from the main text, or fuse it with sections 3.4 and 3.5.

6) The comparison of ONT raw and INC-Seq performance when applied to 16S rRNA classification is not fair. Indeed the simulation doesn't take into account chimeras produced during the INC-Seq library preparation, and systematic errors that affect the consensus quality, and as a consequence the rRNA classification.

7) Authors state that chimeras obtained from inter-molecular ligation lead to reads that are longer than expected. That's true in majority of cases, but if the ligated rRNA molecules are near identical, it can lead to a consensus with an expected length. In these cases, the consensus will be a mixed between the rRNAs. Authors should discuss this issue and estimate the "divergence threshold" that lead to INC-Seq reads longer than expected.

8) What is the difference between Table2 and Figure4B? Why both replicates are not present in Table2? Table2 suggest a high variability in relative abundances. Is the Pearson coefficient welladapted here? In section 3.5, authors merged relative abundances of both replicates. Does it suggest the necessity of sequencing replicates when using INC-Seq method?

Minor Essential Revisions:

1) Figure2: please use species name instead of accession identifiers.

2) I think authors should present complete metrics for ONT raw and INC-Seq reads (number of reads, number of bases, N50 read size, average read length …) in the main text.

3) Figure 1B is cited before Figure 1A in the main text.

4) Authors should remove "long" in the abstract: "…as a strategy for obtaining long and accurate nanopore reads…". INC-Seq allows obtaining accurate nanopore read, but not long nanopore reads.

Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? If not, please specify what is required in your comments to the authors.

Yes.

Are the conclusions adequately supported by the data shown? If not, please explain in your comments to the authors.

Yes.

Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? If not, please specify what is required in your comments to the authors.

Yes.

Are you able to assess any statistics in the manuscript or would you recommend an additional statistical review? If an additional statistical review is recommended, please specify what aspects require further assessment in your comments to the editors.

Yes, and I have assessed the statistics in my report.


Quality of written English

Please indicate the quality of language in the manuscript:
Acceptable.

Declaration of competing interests

Please complete a declaration of competing interests, considering the following questions:
1. Have you in the past five years received reimbursements, fees, funding, or salary from an
organisation that may in any way gain or lose financially from the publication of this
manuscript, either now or in the future?
2. Do you hold any stocks or shares in an organisation that may in any way gain or lose
financially from the publication of this manuscript, either now or in the future?
3. Do you hold or are you currently applying for any patents relating to the content of the
manuscript?
4. Have you received reimbursements, fees, funding, or salary from an organization that
holds or has applied for patents relating to the content of the manuscript?
5. Do you have any other financial competing interests?
6. Do you have any non-financial competing interests in relation to this paper?
If you can answer no to all of the above, write 'I declare that I have no competing interests'
below. If your reply is yes to any, please give details below.

I declare that I have no competing interests; however I should mention that I am part of the
MinION Access Programme (MAP).

I agree to the open peer review policy of the journal. I understand that my name will be included
on my report to the authors and, if the manuscript is accepted for publication, my named report
including any attachments I upload will be posted on the website along with the authors'
responses. I agree for my report to be made available under an Open Access Creative Commons
CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments
which I do not wish to be included in my named report can be included as confidential comments
to the editors, which will not be published.

I agree to the open peer review policy of the journal.

Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0140-7/13742_2016_140_AuthorComment_V1.pdf)


show less


LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads

Review posted on 12th June 2015

The manuscript “Scaffolding draft genomes with nanopore reads” by Warren et al. describes a software dedicated to genome assembly improvement, using data provided by the Oxford Nanopore technology. The LINKS algorithm differs from existing methods, which are alignment-based, and is elegant. Moreover methods are very well described. The authors used several various datasets to access the performance of the LINKS tool, and I should mention that LINKS is, in my knowledge, the first study describing a scaffolding method based on nanopore reads.

The LINKS software is available online, easy to install and use, and it runs very fast. I would like to congratulate all authors.

My main concern about the manuscript is related to the low quality of the results. The improvements of the continuity with the nanopore reads are not convincing. Indeed, several peer-reviewed articles have already reported near-perfect genome assembly for bacterial genome using both a combination of short and long reads (Madoui MA et al, Genome assembly using nanopore-guided long and error free DNA reads, BMC Genomics, 2015), or only Pacific Biosciences long reads (Chin CS et al., Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data, Nature Methods, 2013). Moreover, I’m dubious about the application of the algorithm presented here to scaffold more complex genomes using nanopore reads.

I do not feel adequately qualified to assess the statistics, especially the description of the statistical properties of errors, and the description of mixture models. Furthermore an existing study, which is not cited by Warren et al., already describes error rates of the MinION device (Jain M. et al, Improved data analysis for the MinION nanopore sequencer, Nature methods, 2015). Authors should mention and describe the differences with this existing study.

Major Compulsory Revisions 1) It has been reported, in the last six months, several high-quality genome assemblies for E. coli MG1655 using MinION and Illumina reads (http://www.genoscope.cns.fr/nas/ and http://schatzlab.cshl.edu/data/nanocorr/). The authors should compare their tool to these existing results and should highlight the benefits of using LINKS. As an example, the optimal coverage needed for the two approaches (denovo assembly vs scaffolding) could be compared. 2) The comparison with the SSPACE long read tool exhibits similar results. However, when LINKS is used in an iterative mode, final assemblies seem to be of higher continuity. I suggest the authors to make it the default mode of the method. Indeed, gradually increasing the distance between k-mer pairs is a good way to take advantage of long reads and is a fair comparison with alignment-based scaffolders. 3) The scaffolding of the white spruce genome assembly shows that LINKS scale well to larger genomes. Using a closely related genome to scaffold a second draft genome is unsafe, as you’ll miss specific structural variations. That’s why it should be combined with other kind of data, like sequencing data or maps. There is a very limited number of scaffolders (as an example, Gritsenko AA et al, GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies, Bioinformatics, 2012) that can take into account this additional data, I suggest the authors to highlight this feature. Nevertheless, this part of the work is not in accordance with the title of the manuscript. Second, the continuity of final assembly is not very well documented. Authors only discuss the NG50 (P9;L11-12), I suggest to add a table with all descriptive metrics of both the input and final assembly. Finally, it’s not clear how the quality assessment using gap-filling software and MPET data is achieved. The fact that validation rate decreased at each iteration is expected, as mentioned by authors. As a consequence, the authors could not validate scaffold integrity when parameter d is larger than 12kb. In my opinion, the sentence P9L22-23 is a shortcut. 4) As described for example, in the Figure 1 of the following study (Schatz M. et al, Assembly of large genomes using second-generation sequencing, Genome Research, 2010), k-mer uniqueness is highly dependent of the k value and the input genome. I guess using k=15 is not enough for a large majority of genome, even with pairing information. So to deal with more complex genome users will have to increase the k parameter, which is not compatible with the current error rate of the nanopore technology. The algorithm itself could be applied on large and complex genomes, but probably not with nanopore reads, as suggested in the title. 5) One key aspect of a scaffolder is the estimation of the size of the gaps, however this point is never addressed in the manuscript. 6) Table1 first suggests that iterative mode provides better results (NG50 of 633Kb vs 293Kb, 1D and 1F). However, the NGA50 and NA50 are similar, suggesting that iterative mode produces longer scaffolds but with a higher error rate.

Minor Essential Revisions 1) P3L7: The cited manuscript has now been peer-reviewed, and the correct reference is from Nature Biotechnology. I think authors should at least mentioned two other peer-reviewed manuscript (Chin CS et al., Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data, Nature Methods, 2013) and (Madoui MA et al, Genome assembly using nanopore-guided long and error free DNA reads, BMC Genomics, 2015). 2) P313: “At the moment, sequence reads ….and indels rates”. As mentioned previously, several methods described high-quality genome assemblies using the MinION device. 3) P6L20: ABySS reference is missing. 4) Table S2 is a good summary, I suggest the authors to use it in the main text. 5) P9L19: Not clear where the 58.6% comes from, could we find it in Figure3? 6) P10L3-8: The analogy with the 10X Genomics technology is not appropriate; indeed it generates linked short sequences, but saying that you are able to generate linked short sequences starting from long reads is not very informative. 7) P13L13: Using LAST with default parameters is not optimal, as already stated in (Quick J. et al, A reference bacterial genome dataset generated on the MinION™ portable single-molecule nanopore sequencer, GigaScience, 2014) and (Jain M. et al, Improved data analysis for the MinION nanopore sequencer, Nature methods, 2015). Moreover, in Figure S2, the authors used blastn to align nanopore reads, and surprisingly they found that 184 regions of E. coli are devoid of nanopore reads. It’s weird as denovo assembly of this dataset provides a perfect circle genome. 8) Figure1: If I well understand, contig3 and contig2 are overlapping. It’s not clear in the second part of the figure, as none k-mer pair establish a link. In the third part, what do black arrows mean? 9) Figure2 is a good way to show at the same time the quality and the continuity of several assemblies. However this figure is hard to read, I suggest the authors to create a figure for each genome. 10) Figure3: the panels are not order logically compared to the corresponding legend. In the legend, where does the 84,529 number come from? 11) P4L12: Why is Figure S2 cited here? 12) The authors use sometimes version 1.5+ (P9L5) and sometimes version 1.5 (P11L2) of LINKS, why this discrepancy?

Level of interest An article whose findings are important to those with closely related research interests Quality of written English Acceptable Statistical review Yes, but I do not feel adequately qualified to assess the statistics. Declaration of competing interests I declare that I have no competing interests; however I should mention that I am part of the MinION® Access Programme (MAP).

Authors' response to reviewers: (http://www.gigasciencejournal.com/imedia/5046484471782784_comment.pdf)

show less