Review for "Real-time strain typing and analysis of antibiotic resistance potential using Nanopore MinION sequencing"

Completed on 25 Jul 2015

Login to endorse this review.

Comments to author

Author response.

The content of the paper is an interesting proof-of-principle, however as written it is unclear where the novelty actually lies (i.e. which benefits are derived from the MinION device which is external to the paper; and which are actually related to the authors' contribution with respect to sequence analysis methods). The abstract is misleading as it focuses on clinical applications, whereas the only testing conducted was on type strains (of a single species, which should be mentioned).

We have changed the abstract as suggested by the reviewer to de-emphasise the clinical application. We have also rewritten the paper to focus on the methodological aspects of the paper, which is where the novelty of this work lies.

Specific comments:


(1) Should explicitly state that the input DNA was extracted from an isolated bacteria. As written this is unclear and could be read as diagnosing DNA extracted direct from patient specimen.

This is important as species identification (and some claim certain resistances also) can be obtained within minutes from an isolate using mass spec machines which are in all labs, and phenotypic antibiotic susceptibility profiles can be obtained from an isolate within ~12 hours using Vitek machines which are also common in diagnostic labs.

We now explicitly state that input DNA was extracted from isolated bacteria.

(2) Should state the scope of the evidence presented in the paper - i.e. 3 culture collection isolates of K. pneumoniae with reference genomes available for comparison. As written one might assume this was tested on multiple species, and the use of the word 'clinical' three times in the abstract implies that the methods was tested on clinical strains, which is not the case.

We have made the changes as suggested by the reviewer. We have removed the term ‘clinical’ from the abstract. We have also expanded to the scope of the paper to include a new dataset which is a mixture of E. coli and S. aureus.


The intro names the hurdles to widespread adoption of HTS for identifying infectious agents and guiding patient treatment as (a) lack of portability, (b) high cost of sequencing devices and (c) difficulty obtaining actionable data within a few hours.

This is not an accurate characterisation… (a) current gold standards for species & strain identification and antimicrobial susceptibility profiling are not portable either, they rely on dedicated diagnostic labs usually onsite within the hospital; (b) per-sample costs are generally seen as a greater barrier than establishment costs associated with purchasing the devices; e.g. mass spec technology was widely adopted by hospital labs in the last few years because despite a high up-front cost, the cost per sample is a few cents.

Clearly portability and device cost are promising advantages of the MinION, but it is not correct to characterise these as the major barriers to the use of HTS in routine diagnostics.

We have re-written the background section and removed the inaccuracy.

The focus on reporting time-of-sequencing is a bit problematic here. The speed of data generation is determined by the Nanopore device, not just the analysis method. Nanopore claims their next device release (in a matter of months) will generate data much faster. The rapidity claims in this paper need to explicitly address (a) how much data is needed to get the result, (b) how much analysis time is required to analyse that data to get the result, and (c) how much time it took to generate the data + analyse it in the current test. (c) is only of minor interest at this stage, since it is specific to the model of device that the authors used to generate their limited test data (3 genomes), however this is all that is reported. (a) and (b) are critical to understand how the analysis methods perform in terms of speed & efficiency, which, which is really the important information for this paper as it allows readers to understand how the method might be expected to perform as the

device technology improves.

We now show sequence data yield as well as proportions in Figures 4 and 5 and now report the amount of data generated at all of the critical timepoints as well as the time taken, for example:

"In all three Klebsiella pneumoniae samples, we successfully detected Klebsiella pneumoniae as the major species present in the isolate. This was achieved with as little as 120 sequence reads requiring only 5 minutes of sequencing time (Figures~4a), b) and c)). For Klebsiella pneumoniae strains ATCC BAA-2146 and ATCC 13883, it required less than 500 reads (10 and 15 minutes of sequencing, respectively) to reach a 95% confidence interval of less than 0.05."

We also report the number of reads required to identify the presence of all antibiotic resistance genes as well. We also now make it clearer that our pipeline can still run in real-time on a 16-core desktop computer if the throughput is 120x higher that what we obtained.

In general, the paper lacks a mature discussion of the benefits and challenges offered by real-time (but low per-base quality) data streaming from the Nanopore sequencer, which is the whole impetus for the study, and a justification of the why these particular approaches were taken over alternative ways of handling such data.

We have included the following in the discussion to address this point:

"In recent years HTS has become an integrative tool for infectious disease research~\cite{DunneWF2012, FrickeR2014}. There have been several reports emphasizing the use of HTS methods to characterize clinical isolates, to study the spread of drug resistant microorganisms and to investigate outbreak of infections~\cite{HudsonBM2014, PettyZC2014, StoesserGB2014}.
These studies predominantly use massively parallel short-read sequencing technologies such as the Illumina Miseq, NextSeq or HiSeq. These sequencers achieve a very high base calling accuracy which makes them ideally suited to applications which require accurate calling of single nucleotide polymorphisms (SNPs), including reconstructing the evolutionary history of different bacterial isolates; tracking transmissions during an outbreak; placing a new isolate on a phylogenetic tree and population genetic analyses. However, these technologies sequence a single base per cycle for millions of sequence fragments in parallel, where each cycle takes at least 5 minutes.

The Oxford Nanopore MinION device, on the other hand, generated as many as 500 reads in the first 10 minutes of sequencing in our hands (which is 3 times lower than the theoretical maximum). The error rate of these reads was substantially higher than the corresponding Miseq data . Existing bioinformatics algorithms - which have been developed initially for highly accurate Sanger and subsequently short-read sequencing - rely on accurate base calling, making their application to MinION data challenging. As an example, most existing strain typing approaches often use a MLST system, either on a pre-defined set of house keeping genes~\cite{MaidenBF1998}, or on core genes set~\cite{CodyMR2013}. These approaches are highly standardized, reproducible and portable, and hence are routinely used in laboratories around the world. Rapid genomics diagnosis tools using MLST from high-throughput sequencing such as SRST2 ~\cite{InouyeDR2014} have also been developed. While we showed that MLST can be adapted to identify bacterial strain type from nanopore sequencing, this requires high coverage sequencing of the gene set to overcome the high error rates. Similarly, other researchers have shown that error correction can overcome the high error rate providing enough coverage is obtained.

The main contribution of this manuscript is to demonstrate that despite the higher error rate, it is possible to return clinical actionable information, including species and strain types from as few as 500 reads. We achieved this by developing novel approaches which are less sensitive to base-calling errors and which use whatever subset of genome-wide information is observed up to a point in time, rather than a panel of pre-defined markers or genes. For example, the strain typing presence/absence approach relies only on being able to identify homology to genes and also allows for a level of incorrect gene annotation."


Species typing

(1) How is the approach different from that of the existing software packages such as MetaPhlAn?

We now discuss differences with MetaPhlAn as follows:

"Our species typing module has some similarities to the approach used by MetaPhlAn~\cite{SegataWB2012}, in that it uses the proportion of reads which map to different taxonomic groupings to estimate the proportion of different species in a sample. MetaPhlan optimises computational speed by aligning to a precomputed database of sequences which are pervasive within a single taxonomic grouping but not seen outside that grouping. This allows it to blast against a database which is 20 times smaller than a full bacterial genomic database. This was designed to make metagenomics inference feasible on datasets with millions of reads. On the other hand, our species typing approach is designed to make a inference using only hundreds of reads, and moreover, also continuously updates confidence intervals so the user knows when they can stop sequencing and make a diagnosis."

(2) "Species typing" was tested with only one species, which are from the type culture collection and the authors knew what the species was before hand. This is not actually a test of species typing, it is just confirmation that, for one particular species, this approach retrieves the correct species. This is fine but the way that this is discussed in the abstract and results implies more testing than this.

We now include other species in the analysis. To make the problem more challenging, we use a mixture sample of E. coli and S. aureus, and show that the pipeline correctly identified the proportions of the two species in the mixture sample (Figure 4d). We have modified the discussion to make the fact that we knew the species beforehand more salient.

Strain typing

Re K. pneumoniae ATCC 700603. There is some confusion here with K. pneumoniae species designations which should be cleared up before publication as it will confuse readers and create misinterpretations.

From Supplementary Figure 1, it is clear that this strain is a different species of Klebsiella that shares a common ancestor with K. pneumoniae and K. variicola. It has been recognised for ~15 years that strains commonly identified as K. pneumoniae actually include 3 distinct species. These were first dubbed KpI, KpII, KpIII; however these now have species names K. pneumoniae, K. quasipneumoniae, and K. variicola, respectively. See PMID: 26100894 for a whole-genome phylogenetic data that clears this up and explains some of the details. The sequenced strain ATCC 700603 is listed in the K. pneumoniae BIGSdb (whole genome MLST) database as a KpII, which indicates it belongs to the K. quasipneumoniae species (defined in PMID: 24958762). Unfortunately there are lots of isolates and genomes in various collections that are labelled as K. pneumoniae that are actually K. quasipneumoniae or K. variicola.

In regard to the current manuscript:

(a) it is not correct to say that "ATCC 700603 is an ancestor of K. pneumoniae and K. variicola"; it is actually a different species of Klebsiella that shares a common ancestor with these other two species. I suggest the authors state this and cite the BIGSdb identification of this as KpII (and thus as a different species K. quasipneumoniae). This would negate the need for Supplementary Note 2.

(b) If the genome of the strain that is being sequenced (ATCC 700603) is the only isolate of its species in the database used for identification, this is is not a meaningful test. It would be more appropriate (and more useful for future users) to use a database that contains multiple K. pneumoniae, K. quasipneumoniae, and K. variicola genomes; named as such; and show that the analysis approach is able to correctly differentiate these species and the strain. According to the Methods, suitable reference genomes were already included in their strain typing database from ERP000165 (from PMID: 26100894, which includes several K. quasipneumoniae strains identified as such in the paper, although perhaps not in the ENA).

The reviewer is correct that although these strains were present in a manuscript we cited, they were not available through the ENA. The section is now written as follows to address the points raised by the reviewer.

"Interestingly, we found the analysis of K. pneumoniae ATCC 700603 sample reported a mixture of about 80% K. pneumoniae and 20% K. variicolla. These proportions did not change after sequencing 500 reads (25 minutes), suggesting a stable prediction of species proportions in
the sample. Application of our strain typing algorithm (see below) identified the strain of this sample as strain type of this sample as
ST-489, which was confirmed from the assembly of the MiSeq sequence data for this sample. ST-489 has been reported to have been mis-classified as K. pneumoniae rather than the recently proposed new species K. quasipneumoniae~\cite{BrissePG2014, HoltWZ2015}.
Despite this species being missing from our original database,
our pipeline reported the sample to be a mixture of the two closest species (K. pneumoniae and K. variicolla) of the sample, highlighting its ability to flag species not previously known. Finally, we selected the assemblies of two K. quasipneumoniae strains, K268An (ST-334) and DR85/08 (ST-734) from Holt et al (2015)~\cite{HoltWZ2015} and added to our bacterial genome database. We did not include strain ST-489 in this database. The species detection pipeline correctly identified sample strain ATCC 700603 as K. quasipneumoniae using only 300 reads (Figure~4e))."


The authors show good results using the 7-locus MLST scheme for K. pneumoniae, with 2 uncertain allele calls in cases where the alleles were very similar and sequence data on which to base the calls were limited. They then propose to do more accurate strain-typing using all of the sequenced reads, rather than just those covering the 7 house-keeping genes, and so created an approach based on presence or absence of genes.

This is an interesting and valid approach to take, which is actually supported by recent comparisons of gene content and SNP variation matrices in K. pneumoniae (see PMID: 26100894). However a lot of clinical typing approaches are now using whole-genome MLST (wgMLST) or core genome MLST (cgMLST), including for K. pneumoniae (see BIGSdb and PMID: 25341126). The advantage of this approach over trees, gene content or other similarity-based methods is that, as with 7-gene MLST, it creates a language with which strains can be identified and referred to, which facilitates inter-lab reporting and comparison which is recognised as very important in public health and diagnostic settings (see PMID: 23979428). It would greatly improve the paper if the authors could demonstrate the identification of K. pneumoniae clones using this approach as well as simple 7-gene MLST.

This would be a very interesting approach to take, but we feel it is out of scope of the current work.

Strain typing using gene content

The relevance and power of this approach is going to be entirely determined by the population structure of the organism under analysis, and the diversity represented in the database used.

(a) The results as presented are rather uninformative as they do not show the underlying gene content variation in the database used, or give any indication of how close a match is obtained. If the genomes sequenced in the test also appear in the database then naturally one would expect to get high confidence identification of that strain. However this is a pretty unreal and unfair test. How does performance change if you remove the sequenced strain from the database? Does the method actually identify the strain that is closest in terms of gene content and SNPs? That is the performance one would expect from a 'strain typing' method. The K. pneumoniae genome is highly plastic and in real K. pneumoniae populations, new genes are acquired frequently, including via large plasmids that can bring in hundreds of new genes at once. How will this impact accuracy?

In order to investigate this, we removed our three K. pneumoniae strains (ST11, ST489 and ST3) from the database and re-ran the strain typing pipeline . The system reported strain types ST258, strain 1kgm, which is a novel ST and ST380 for our three strains respectively. We cross-checked with a K. pneumoniae phylogenetic tree from (PMID: 26100894), and we found that ST258 was indeed the closest to ST11. We were unable to confirm the relatedness of strain 1kgm and ST380 to our ST489 and ST3 strains as they were not in the tree. We will investigate this further and will report in another manuscript.

(b) K. pneumoniae is fairly unusal amongst bacterial pathogens in having quite extreme gene content variation. How relevant is this approach to strain typing going to be for other organisms (e.g. S. aureus) that have much less gene content variation?

We now include more test to show that our method is also able to strain type both E. coli and S.aureus from data collected on a mixture of E. coli and S. aureus. This is written up in the paper as follows:

"We streamed sequence reads from the mixture sample through the strain typing systems for E. coli and S.aureus, and in both cases, the correct strain types of two species in the sample were also recovered. The correct type for E. coli strain in the the 75%/ 25% E. coli,S.aureus mixture was recovered after 25 minutes of sequencing with about 1000 total reads (or approximately 750 E. coli derived reads). (Figure 5d)). The pipeline was able to correctly predict the S.aureus strain (which is known to have much less gene content variation) in this mixture sample after two hours of sequencing with about 2,800 total reads (or approximately 700 S.aureus derived reads)."

Antibiotic resistance genes.

It is not very clear what was done here, why this approach was chosen and how it actually performed. The authors should report the actual genes that were detected in the reference genomes, and which of those were detected in the MinION analysis. Summarising by class of antibiotics makes it very difficult to understand what is happening at the gene level, and where errors might be arising.

We now include information on which actual gene was reported in the results table (Table 4)

Why is the NDM-1 gene last to be detected? Is this on a plasmid? Does the library prep reduce representation of plasmid DNA somehow? Is this just chance? Is GC content relevant here? I notice that the chromosomally encoded SHV gene is always detected in the first hour… is this because the chromosomal DNA is better represented in the library sequenced? Or is it because there are actually two copies of SHV, one on the chromosome and a second on a plasmid, that is providing enhanced read representation of this gene?

These are great observations. There are multiple reasons which would affect the order in which antibiotic genes are sequenced. Apart from sampling randomness, which plays a large part since on only a couple of 2D reads, or ~5 1D reads are required before the presence is assigned, copy number may be important. We do not think chromosomal DNA is better represented in the library, nor do we think that GC bias is likely to be of much importance here (there may be subtle GC biases, but none strong enough to affect the representation of different genes in the sequence data). Full reporting of these issues is beyond the scope of the current paper.


Species typing

More detail should be provided here. Reading between the lines, I am guessing that this process populates a set of counts of reads mapped to each of the possible 1487 species, where a hit to any genome of species i is added to the value Ni, which is a running total of reads assigned to i. The 2 species with the highest Ni values are plotted in the graph, with a CI calculated using Ni and the total read count N. Can more than 2 species be plotted in real time? The authors need to be explicit about how and why they are doing these things. What happens if the top 2 species only account for a small overall proportion of reads? E.g. can results be summarised by genus?

More than two species can be plotted (e.g. Fig 4e). We do not currently support summarising by genus, but this would be good to add in future. We have added more detail in the methods as follows:

"Our species typing method considers the proportions {p1,p2,..,pk} of k species in the mixture as the parameters of a k-category multinomial distribution, and the read counts {c1, c2, ..,ck} for the species as an observation from c1+c2+..+ck independent trials drawn from the distribution. It then uses the MultinomialCI package in R~\cite{SisonG1995} to calculate the 95% confidence intervals of these proportions from the observation."

Strain typing

It is difficult to follow the formulae as the first two equations are written in normal text and the last as a proper formula. Please use consistent mathematical notation for formulae.

This has been addressed in the Strain Typing subsection of the Methods. The manuscript was rewritten in latex to make this easier.

In general, the problem with the likelihood function being calculated is that its behaviour will be highly dependent on the strain collection used to build the database. The formula probably happens to work well enough in the test case, because (a) the data used to build the database happens to include a wide spectrum of diverse lineages across the whole species of K. pneumoniae (see PMID: 26100894 from which the majority of genomes were sourced), which is not available for most pathogens, and (b) gene content variation in this collection closely mirrors lineage divergence (shown in PMID: 26100894). However for other bacterial pathogens which (a) have a very different population structure and levels of gene content variation, and (b) have much more biased sampling of species-wide diversity in whole genome collections, this approach is unlikely to work well. These issues should at least be discussed in the current paper, as at the moment readers are left with very little

information on which to determine how applicable this approach really is to other organisms.

We discuss the effect of gene content bias in the paper as follows:

"The degree of between strain gene variation is quite variable across different bacterial species, and this will impact on the time taken for our confidence intervals to converge. For example only ~6% of the K. pneumoniae pangenome (N = 328) of 29,886 genes are core genes K. pneumoniae genomes\cite{HoltWZ2015}, whereas 45% of the S. aureus pangenome (N = 10) and 20% of the E. coli pan-genome (N = 22) are core genes\cite{hall2010pan}, although it is important to note that the percentage of core genes is a function of the number and diversity of strains sequenced."

The paper would benefit greatly from an increased focus on the methodology, which is after all the unique work being presented (not the MinION device).The authors need to provide justification for the choice of methods, and discussion of how these methods will perform as databases grow and change.

For example, after some thought I can see that:

Formula 1: the value here will be highly dependent on the strain collection used to build the database… consider how the ability of each gene to uniquely identify a strain will change as the database increases. For a core gene, this value will increase, as the numerator increases at a rate proportionally bigger than that of the denominator. For a rare gene, this value will decrease, as the numerator will not change as more genomes are added but the denominator will increase substantially.

I *think* this is desirable behaviour because this value is actually used in Formula 2 to represent the probability of the gene being observed because other strains besides Sk are present; however as written this is not very clear.

We have rewritten the section on the strain typing methodology, and also now include more intuition into how the model will work in different scenarios. The section now includes the following . Note that the size of the database is not directly relevant, but the fraction f of strains in the database which contain a specific gene is important. (Note: c=0.2 is the mixture proportion of the background model.)

"To gain some insight into how this model works in response to gene presence, consider a gene $g$ which is present in a fraction $f$ of strains, including $St_j$ but not including $St_k$.
For simplicity assume that each strain has $N$ genes.
The difference in log-likelihood $St_j$ and $St_k$ conditional on $g$ can be approximated by $\log(1/c) + \log(1/f)$, showing that a more specific gene has a stronger effect in our model than a common gene in distinguishing strains.

To gain insight into the effect of gene absence in contrast to gene presence, assume instead that the only difference between $St_j$ and $St_k$ is that a single gene ($g$) is deleted in $St_j$, and denote by $N = N_j = N_k-1$. If we sequence $N\ln(2)$ genes from $St_j$ without seeing gene $g$, the difference in log-likelihood becomes $N\ln(2)*(\log(N)-\log(N-1))\approx 1 \text{ bit}$, corresponding to the likelihood for $St_j$ being twice as big as the likelihood of $St_k$. For example, if a strain has 1000 genes, then we would need to observe $693$ genes without observing $g$ to be able to conclude that the observed data were twice as likely to be generated from the species with a single gene deletion. For comparison, we would need to only sequence $100$ genes from $St_{k}$ to get an expected log-likelihood difference of 1 bits versus $St_j$, demonstrating the extra information in gene 'presence' versus `absence' typing."

Why is 80/20 used? What is the effect of changing this ratio?

80/20 (i.e. $c = 0.2$) is conservative and would converge slower than c = 0.1. We have investigated using c = 0.1 and c = 0.05 and show that it has little effect (confidence intervals converge a bit faster with c = 0.05)

Strain typing & MLST typing

How do these scores behave? The authors discuss the fact that identical scores are obtained for multiple close variants, which is fine. But how should we interpret a small difference in scores? Should these scores be read as a rank? For example, how much variation in scores should be considered to reflect a meaningful alellic difference for MLST typing?

These scores are in bits, so a 1 bit difference corresponds to a 2x difference in likelihood.

Figure 1

Given the focus on clinical application in the abstract and intro, this figure should really make it clear that time 0 is not specimen collection, but isolation of an organism in pure culture. This can take ~24 hours and is the key bottleneck in most current diagnostic tests. Some comparison to current standards (e.g. mass spec species identification, Vitek antibiogram, Illumina MiSeq sequencing) should be shown or at the very least feature in the Discussion.

We now make this clear both in the figure legend, and also include the following in the discussion:

"We have shown that switching from a traditional short-read sequencing pipeline coupled with standard, non-streaming bioinformatics algorithms, to a nanopore sequencing pipeline coupled with streaming bioinformatics algorithms can dramatically cut the time taken from DNA library to results from at least 8 hours down to 30 minutes. With the time for library preparation for nanopore sequencing forecast to be shortened to 10 minutes, the major time bottleneck then becomes the bacterial culture step (which can be 24 hours).
The MinION sequencer can be used on clinical sample without culture, however this then dilutes the proportion of bacterial DNA present. Nevertheless, this may become a viable time-sensitive strategy as sequencing yield increases, particularly with high colony-forming-unit (CFU) infections. Another promising approach may be to use approaches to pre-concentrate bacterial DNA~\cite{hasman2013rapid}.”


Does the work include all necessary controls?

If not, please comment on the additional controls that are required.


Are the conclusions drawn adequately supported by the data shown?

If not, please explain.

No: The paper lacks clarity concerning what the aims and conclusions are. The abstract and introduction focus on clinical applications of sequencing for infection diagnosis, however the emphasis is on the benefits of the MinION sequencing platform, which is external to the paper.

The novel work presented in the paper is actually the methodological approaches taken to perform sequence analysis of MinION reads, however (a) there is no real justification of the methods chosen (nor comparison or even discussion of alternative methods) and (b) the validation of the methods is superficial (the analysis presented is of 3 type culture collection strains, all of the same species, and no clinical strains. Essentially, this is a proof-of-principle paper showing that potentially clinical useful information can be gained from sequencing DNA of bacterial isolates using the MinION device and fairly basic sequence analysis methods. The conclusions should reflect this.

Currently the only conclusion offered is in the abstract, which states: “Here we demonstrate that Oxford Nanopore sequencing device MinION can identify bacterial species and strain information within 30 minutes of sequencing time, initial drug-resistance profiles within two hours, and complete resistance profiles within 12 hours.” While this statement is not untrue, it seems to imply a lot more validation work than is actually contained in the paper; it would be more appropriate to state the evidence (e.g. “Here we demonstrate using 3 previously sequenced K. pneumoniae strains from the ATCC type collection that Oxford Nanopore…”)

In order to make the aims and conclusions of the study clearer we have rewritten the title and abstract as follows:

"Streaming algorithms for identification of pathogens and antibiotic resistance potential from real-time MinION™ sequencing"

"Several systems incorporating real-time analysis of MinION data have been developed recently such as the cloud based platform Metrichor (Oxford Nanopore), work by Quick et al~\cite{QuickAC2015} and
MetaPORE ~\cite{GreningerNF2015}, focusing on placing the sample on a phylogenetic tree but without providing an estimate of the confidence in this assignment."

We also include discussion of alternative algorithms, including MetaPhlan, and approaches from Quick et al, and also tried to make clear that these algorithms are not streaming algorithms, and do not update the uncertainty in their inference in real-time. We have also added an extra mixture sample of E coli and Staph Aureus.

Are sufficient details provided to allow replication and comparison with related analyses that may have been performed?


Does the manuscript adhere to the field standards for experimentation, nomenclature and public availability of data (or any other significant standards)? Is the software freely available, open source and with an appropriate free-to-use license?


Does the method perform better than existing methods (as demonstrated by direct comparison with available methods)?

Reviewer #1: No: There is no comparison with other methods (e.g. Illumina sequencing -> analysis workflow vs MinION sequencing -> analysis workflow; or alternative methods for anlaysing MinION data). There is no justification/discussion of the particular approaches taken in developing the novel sequence analysis methods. A key advantage of MinION sequencing, which these authors are aiming to take advantage of, is the streaming data; however this is not discussed or explored at all.

We now describe in more detail the Illumina short-read sequence workflow we applied to the same samples, and compare results from both pipelines. We have justified the approaches we have take in much more detail. We also discuss other sequence analysis methods and focus on the streaming nature of the algorithms and framework we present.

Is the method likely to be of broad utility? Is any software component easy to install and use?

Please indicate briefly the novel features and/or advantages of the method, and/or please reference the relevant publications and which methods, if any, it should be compared with.

This is really a proof-of-principle rather than a finished method of broad utility.

This reviewer was able to install the 'japsa' java program from the github repository provided under 'Data Availability', but there were no usage instructions.

There is a Supplementary Note 1, which replicates the information given under Data Availability and additionally provides a link to a site where directories can be downloaded for each type of analysis (species typing, strain typing, MLST etc), these contain the reference databases used and bash script containing the commands used to generate the results (this is not linked to from the code repository linked to in the main text).

This is rather confusing and it is unclear how one would use these scripts and the japsa program to analyse other data sets.

We now include a brief documentation on how to use our software pipeline for analyse other data sets in the github We also updated the documentation for each of the programs provided in the Japsa package. Our framework is ready for use by other researchers and we expect it will be of broad utility.

Is the paper of broad interest to others in the field, or of outstanding interest to a broad audience of biologists?

If yes, please explain why.