Judith Risse, Marian Thomson, Garry Blakely, Georgios Koutsovoulos, Mark Blaxter, Mick Watson
Review posted on 11th September 2015
This manuscript describes an easy workflow for generating a complete bacterial genome assembly from a hybrid dataset og Illumina MiSeq and Oxford nano pore MinION reads. It is the first such publication and therefore of interest to researchers interested in performing such assemblies. The paper is a useful guide on how to run hybrid assemblies and how to finish the genomes with available tools. The methods used can be considered standard methods to use for such data and are available to all researchers. The manuscript is well written. There is no ground truth for the genome sequence of this species, thus non-reference based methods are needed to validate the assembly. The authors validate their single-contig assembly by mapping the reads back to the assembly, and report some basis mapping metrics. The assembly is compared to the sequence of a closely related species with a reference genome in GenBank, and regions of difference are reported to have been manually inspected. However, we feel that a more thorough documentation of the validity of the genome assembly is in place. Suggestions include showing the alignment to the NC_003228 genome (e.g. using Mauve or mummerplot) and showing alignments of the reads at the regions of difference. This demonstrate why the hybrid assembly is so useful and it would give a little more information on the specifics of BE1. Could a round of polishing the bases (using a tool such as Pilon) be used to validate the per-base accuracy - arguing that few bases should be changed by such a program if the assembly is of high quality? One of our concerns with the manuscript as it stands is the lack of sufficient detail to allow reproducing the results fully - and reusing of the pipeline by other researchers. Version numbers for the programs used are mentioned for a few tools, but not for all. Exact commands are missing for many programs also, for example for Trimmomatic, poRe and SPADES, SSPACELongRead, GapFiller, Prokka, bwa, samtools, and LAST. One of us attempted to reproduce the main results nonetheless, using educated guesses for the different parameters, and was able to perform many steps but not always with the same resulting outcome. During the process, the following was noted: there are seven md5 checksum files in the MinION dataset that do not have a corresponding fast5 file the author struggled to get poRe dependencies installed due to factors beyond his control, and decided to use poretools (poretools.readthedocs.org) instead to extract the 2D reads in fastq format (command poretools fastq --type 2D FAA37759_GB2974_MAP005_20150423__2D_basecalling_v1.14_2D/ >FAA37759_GB2974_MAP005_20150423_2D.fastq) details on the trimmomatic command were missing, and a best guess based on information from the manuscript ("Sequencing adapters were removed, as were bases less than Q20. Any reads less than 126bp in length after trimming were discarded.") did not result in an identical number of trimmed reads (656976 instead of 898420 reads). The command used was java -jar trimmomatic-0.33.jar PE -threads 24 -phred33 ERR973713_1.fastq.gz ERR973713_2.fastq.gz ERR973713_forward_paired.fq.gz ERR973713_forward_unpaired.fq.gz ERR973713_reverse_paired.fq.gz ERR973713_reverse_unpaired.fq.gzILLUMINACLIP:/path/to/Trimmomatic- 0.33/adapters/NexteraPE-PE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:126 based on the last author's blog post https://biomickwatson.wordpress.com/2015/08/23/assembling-b-fragilis-from-minionand-illumina-data/ an assembly was performed using untrimmed reads, with the following spades command (version 3.5.0): spades.py -o spades_fragilis_raw_ilmn_2D -t 16 -1 ERR973713_1.fastq.gz -2 ERR973713_2.fastq.gz --nanopore FAA37759_GB2974_MAP005_20150423_2D.fastq. This resulted in an assembly with the longest five scaffolds having lengths of 3980468, 827231, 362398, 13363 and 5146 bp, respectively, which with the exception of the second largest scaffold (a 6 bp difference), are identical lengths to those reported in the paper. the manuscript mentions "removal of short and/or low-coverage contigs" but does not describe the cutoffs, or how the coverage was obtained, and no attempt was made to determine the per-scaffold coverage other than using the numbers reported in the scaffold ID's scaffolding using SSPACE-Longreads resulted in a single scaffold of 5188980 bp after the first round. This is in contrast to what was described in the manuscript, where a second scaffolding round was needed to achieve that. Also, the scaffold obtained was 23 bp longer. The command run was: perl SSPACE-LongRead.pl -c top_five_scaffolds.fasta -p FAA37759_GB2974_MAP005_20150423_2D.fastq two gaps remained, both just over 300 bp, but no gap filling was attempted, nor annotation. reads were mapped using bwa 0.7.12 and comparable mapping statistics were obtained (using samtools flagstat) no attempts were made to reproduce figure 2 and 3 Finally, the authors may wish to compare their method with the one described in http://www.biomedcentral.com/1471-2164/16/327 The gigascience website suggests we look at the following aspects, which we reproduce with our comments below each of them: Is the rationale for collecting and analyzing the data well defined? Yes, although the dataset can not really be called "large-scale" within the context of its field Is it clear how data was collected and curated? Yes Is it clear - and was a statement provided - on how data and analyses tools used in the study can be accessed? Not completely, see above. Are accession numbers given or links provided for data that, as a standard, should be submitted to a community approved public repository? Yes Is the data available in the public domain under a Creative Commons license? The ENA seems to have waived rights (https://www.ebi.ac.uk/ena/standards-and-policies) so tentatively yes. Are the data sound and well controlled? One can argue about controls for this type of data, but having the corresponding Illumina data from the same samples suffices in our opinion Is the interpretation (Analysis and Discussion) well balanced and supported by the data? yes Are the methods appropriate, well described, and include sufficient details and supporting information to allow others to evaluate and replicate the work? No, see comments above. What are the strengths and weaknesses of the methods? Possible improvements are discussed above. The paper describes the strengths very well. The method does not has any inherent weaknesses. Have the authors followed best-practices in reporting standards? The authors have not employed any of the checklists, or workflow management systems as described under this point Can the writing, organization, tables and figures be improved? Figures are discussed above. The writing and organisation suffices. ------------- Minor and more detailed comments (using page numbering as at the bottom of the provided PDF): p.3 First sentence: "... a major cause soft tissue infections." Please add "of". p. 3 "Illumina's higher-throughput sequencers produce up to 1.8 terabases of sequence per run" Please specify which instrument is used (HiSeq X), as the NextSeq 500 could also be considered a higher-throughput sequencer p. 3 "Whilst PacBio assemblies are of higher quality, they come at approximately 3-4 times the cost". Cost compared to what approach? p. 4 "By using a hairpin adapter, each molecule is read twice" —> can this be made more clear, e.g. "By attaching a hairpin adapter to one end of the target molecules during library preparation, each molecule is read twice" p. 4 Which Vrije Universiteit from which city? There is more than one Vrije universiteit in the world... p. 6 "...primed with sequencing buffer then 220ng of freshly prepared library diluted in sequencing…" should there be a comma after 'buffer'? p. 7 "Mapping statistics were calculated using count-errors.py , modified slightly to work with our read IDs." Please provide a copy of the final count-errors.py script p. 8 "The 2D alignment lengths were all approximately equal to the read length, albeit with a slight tendency for the alignment length to be greater than 2D sequence length" Please explain why, is this due to deletions in the MinIon reads? p. 9 "the assembly was created using free, open-source bioinformatics tools" Is SSPACELongreads really an open source software? It is free for academic use, but not open source, as far as I can tell... Figures: it would help the reader if you could provide the code (probably written in R) that was used to generate figures 1-3 Signed: Lex Nederbragt and Thomas Haverkamp, Centre for Ecological and Evolutionary Synthesis (CEES) Dept. of Biosciences, University of Oslo, Norway Level of interest Please indicate how interesting you found the manuscript: An article whose findings are important to those with closely related research interests Quality of written English Please indicate the quality of language in the manuscript: Acceptable Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: 1. Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? 2. Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? 3. Do you hold or are you currently applying for any patents relating to the content of the manuscript? 4. Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? 5. Do you have any other financial competing interests? 6. Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.
The reviewed version of the manuscript can be seen here:
All revised versions are also available:
Review posted on 15th August 2012
This paper introduces an update to the SOAPdenovo program, describes its major improvements and shows improved results on two datasets. We reviewed this paper with a group of four people from the same research group.
Major Compulsory Revisions
We have the following major issues with this paper:
The explanation on the improvements in SOAPdenovo2 lack sufficient detail to be able to fully understand them. Papers of this kind usually explain approaches and algorithms used in much more detail. The authors should look at other papers describing new versions of existing software, such as the recent ALLPATHS_LG paper (Ribeiro et al, 2012, http://genome.cshlp.org/content/early/2012/07/24/gr.141515.112.abstract), or even the article describing the first version of SOAP (Li et al, 2009). Improvements of the text are needed so that the reader can understand what changes were implemented and exactly how that improved the program.
Even though we were given access to the underlying raw data, and obtained a pre release versions of SOAPdenovo2 from the authors, we could not replicate the results described in the paper due to a lack of detail in the section on 'Testing and Assessment': the exact commands used for the assemblies are not given.
The article is very biased towards assembly of human genomes. However, SOAPdenovo can be, and often is, used for the assembly of bacterial genomes. The authors use the Assemblathon1 data for their analyses of SOAPdenovo2. In the 'Background' section, the GAGE assembly competition is mentioned, which focusses on comparing programs for assembly of bacterial-sized genomes. However, SOAPdenovo2 was not evaluated against the GAGE data, something we feel is an omission.
One of us tested SOAPdenovo2 on the Rhodobacter sphaeroides dataset from GAGE, and ran the same analysis script as was used for the GAGE publication (http://gage.cbcb.umd.edu/results/index.html). We have included a summary of this analysis as a PDF attached to this report. From the results, we find the following:
SOAPdenovo2, as the first version of the program, still results in many errors in contigs and scaffolds ('corrected' N50's are much lower then N50' values of the sequences generated by SOAPdenovo2)
In our tests of the 'sparse assembly graph' approach, a better assembly was obtained by providing a larger estimated genome size then the real size. Do the authors have an explanation for this effect?
The 'sparse assembly graph' runs improved uncorrected scaffold sizes, however they resulted in a larger number of scaffolds. Also, the corrected scaffolds N50 of these assemblies were in fact lower than reported in the GAGE article for SOAPdenovo1.
We did see an improvement in the contigs from SOAPdenovo2 relative to the first version: fewer errors and higher corrected N50 values, but at the cost of higher contig numbers.
In conclusion, we do not see significant improvements using SOAPdenovo2 versus the first version of the program on the Rhodobacter dataset. We feel the authors should document the performance of SOAPdenovo2 on small genomes with an available reference genome, for example using the data that was the basis of the GAGE competition.
We also tried SOAPdenovo2 on data from one of our own large eukaryotic genomes. The 'default' version of the program crashed, only when we used the sparse assembly graph version did we get the program running. This may have been due to the fact that we were not able to compile the program on our system, and only could use the provided binaries.
GigaScience's description of a technical note requires 'the code described be documented and tested to high standards.' We did not have access to the source code and can therefore not judge whether the code was well documented. Also, we feel the few tests reported in the paper make us uncertain whether the code can be considered 'tested to high standards' (see also above).
The paper makes many claims that are not referring to any articles or actual data. For example, it is written "Scaffold construction is another area that needs improvement in NGS de novo assembly programs." Can the authors point to some references to back up this claim? Similarly, when discussing the original SOAPdenovo program, the authors give three problematic areas as examples -improperly handling of heterozygous contigs, chimeric scaffolds, false contig relationships. However, no documentation of these problems is provided - real tests of assemblies of datasets with a reference genome where these problems can be shown.
The authors tested new YH 2x100 illumina data with SOAPdenovo2 but failed to show comparable analyses of the same data with the original SOAPdenovo program. To fully elucidate the improvements made from the upgrade to SOAPdenovo2, the authors should report on the analysis of these new YH data with both versions of the program.
The authors used analyses from the assemblathon1 (published February 2011) in their comparison of SOAPdenovo2 with the ALLPATHS_LG program. However, new versions of ALLPATHS_LG have been released since February 2011. As such, we feel that the authors should test the most recent version of ALLPATHS_LG against SOAPdenovo2 (using the same data) to ensure a fair comparison between the two programs.
Minor Essential Revisions
There is no reference to table 2 in main text
The doi link for reference reference 11 (http://dx.doi.org/10.5524/100038) was not resolving at the time this manuscript was submitted for review.
Level of interest: An article of importance in its field
Quality of written English: Needs some language corrections before being published
Statistical review: No, the manuscript does not need to be seen by a statistician.
Declaration of competing interests: I declare that I have no competing interests.
Names and affiliations of the reviewers of this report:
Lex Nederbragt, Ole Kristian Tørresen and Karin Lagesen:
Centre for Ecological and Evolutionary Synthesis (CEES), Dept. of Biology, University of Oslo, Oslo, Norway
Jeremy Chase Crawford (currently guest researcher at CEES): Dept. of Integrative Biology & Museum of Vertebrate Zoology, University of California, Berkeley, USA