Review for "Streaming algorithms for identification of pathogens and antibiotic resistance potential from real-time MinIONTM sequencing"

Completed on 18 Jan 2016 by Willem van Schaik .

Login to endorse this review.

Comments to author

Author response is in blue.

The manuscript by Cao et al. describe a pipeline for the real-time analysis of sequencing data produced by Oxford Nanopore’s MinION system. This is an exciting development, potentially with major implications for clinical microbiology. Please note that I am not a computer scientist or bioinformatician, and I have therefore focused on the microbiological interpretation of the data in the manuscript.

The manuscript is interesting, but is, in places, poorly organized. I am also not convinced that the typing approach (based on gene absence/presence) developed by the authors will be useful in real-life situations. On the other hand, the near-real time identification of species and antibiotic resistance genes is exciting and is a convincing illustration of the power of the MinION sequencer.

Major points

p. 4, column 1: The authors should also reconsider whether the discussion on the correct identification of K. pneumoniae ATCC700603 fits in the ‘Results’ section. The finding that 20% of reads map to the closely related species K. pseudopneumoniae is unsurprising, as horizontal gene transfer is extremely common in the genus Klebsiella and this may lead to reads of one genome sequence mapping to multiple species. The authors do not satisfactorily explain how they assigned ST-489 to this strain, as in Table 3 two STs have equally high scores (ST-489 and ST-851). It is also confusing to read about STs being assigned to ATCC700603 in this part of the manuscript: the approach to assign STs is explained in the next section of the manuscript. I would urge the authors to re-structure their manuscript, to first give a more general outline of the approaches and pipeline and then to illustrate this by presenting and discussing ‘real-life’ data.

We have restructured the paper to address this point. We now only discuss the results of an analysis against a database which includes K. quasipneumoniae (although it does not include the K. quasipnemoniae strain which we have sequenced). We hope this is now less confusing. Its worth noting here that the 20% figure (in Figure 4b in the previous version) mentioned here relates to the proportion of reads from a K quasipnemoniae genome which map to K. variicola better (with higher mapping quality) than they map to K. pnuemoniae. This is because we use a competitive alignment procedure for the species typing, and because the initial analysis did not include K. quasipneumoniae in the reference database.

The assignment of STs to the genomes based on in silico MLST using MinION data is only partially successful: the authors can probably assign strains to clonal complexes, but sequence quality is (still?) too low to reliable assign an ST. I believe the authors should make this point more clearly (see also my remarks on K. pneumoniae ATCC700603 above). The use of gene absence/presence for typing purposes is somewhat problematic as genes can be gained and lost quite easily (e.g. by gain or loss of a plasmid) and this would hide the close evolutionary links between closely related isolates. I can see why it may be interesting to perform this analysis in this context (i.e. ‘do we detect the strain which we know we are sequencing?’) but I am highly sceptical whether this is useful for practical typing purposes. The discussion on the different pan-genome sizes of K. pneumoniae, S. aureus and E. coli is naïve. As the authors rightly remark, these values are importantly skewed by the number of genome sequences that have been sequenced. This is illustrated by the recent analysis of 32 S. aureus genomes by Hennig et al. (doi:10.1186/1471-2105-16-S11-S3) resulting in a pan-genome size of 8647 with 1846 core genes (=21%). Based on these points, I would urge the authors to remove this section from their manuscript or, alternatively, to discuss the limitations of this approach.

We now mention the MLST typing results in the abstract, with particular emphasis on the high coverage needed to make confident assignments.

The gene presence/absence typing approach is designed to provide preliminary strain information extremely rapidly, using both 1D and 2D reads. It is primarily designed for the situation in which an exemplar strain has already been sequenced. We argue that this does have applicability, for example in an outbreak situation where it is very useful to know if a known strain is present in a new sample. The bootstrap approach to estimate the confidence intervals does confer some robustness against loss or gain of genes, because it sub-samples the genes observed (with replacement). In order to get a confident assignment, 95% of these bootstrap replicates have to come up with the same strain assignment. If the strain assignment was heavily influenced by a small number of genes acquired via LGT, then the confidence intervals would be very wide.

Its worth noting that our pipeline did identify the correct strain in each of the 5 samples we tested (including the mixture sample), unlike 'Whats in my pot', which uses a kmer based approach, which did not identify the correct strain for any of the samples we sequenced. We believe that this is because kmer based approaches require substantially more high quality (i.e. 2D) reads in order to infer down to the strain level.

We agree that the discussion on the pan-genome may have distracted the focus of the paper, and have removed this from the manuscript.

We have included the following in the discussion in order to acknowledge potential shortcomings of the presence/ absence typing:
Our strain typing module has the advantage of being able to rapidly type a known strain with a small number of low quality (i.e. mostly 1D) reads. Competing approaches which use kmers, such as that implemented in 'WIMP' appear to require substantially more high quality data. The drawback of our approach is that if a large number of genes are lost or gained in a single event, such as the gain or loss of a plasmid, the maximum likelihood strain may be incorrect, although the bootstrap-derived confidence intervals will be wide in this case.

Minor issues:

p. 2. What is meant by ‘At the high level’?

The description of the framework “at the high level”.

p. 2. ´With the emulation, we was able to stream the sequencing data with a hypothetical throughput of 120 times higher what we obtained.´. Change ´we was´ to ´we were´. This line is also somewhat confusing in this context, as the reader may wonder why this high ´hypothetical throughput´ was not reached in one of the runs and/or why a new run was not performed to test whether this hypothetical throughput could be reached. These points are better explained on p. 10 and could be summarized that the pipeline is scalable and could be adapted to much higher data throughputs (i.e. for those that are expected from the PromethION platform)..

We have fixed the grammatical error as suggested.

In this experiment, we wanted to test the scalability of our pipeline to a hypothetical throughput that is higher than the capacity of the MinION. We made that clearer in page 2 and page 10 as suggested by the reviewer.

p. 2. ‘where bioinformatics analysis methods were established’ should read ‘where bioinformatics analysis methods are well-established’ or something along those lines.

Corrected as suggested.

p. 4. Please correct K. variicolla to K. variicola.

Corrected as suggested.

p. 7. I believe the concept of a ‘probabilistic Finite State Machine’ needs more introduction for the non-expert audience at this point in the manuscript.

We have added a new sentence to describe the probabilistic Finite State Machine in a high level for non-expert audience as suggested.

p. 13. resFinder should be ResFinder.

Corrected as suggested.

p. 15. ‘flank sequences’ should probably read ‘sequences flanking the antibiotic resistance genes’

Corrected as suggested.