Review for "NxRepair: Error correction in de novo sequence assembly using Nextera mate pairs"

Completed on 14 Jan 2015 by Torsten Seemann .


Login to endorse this review.

Comments to author

Basic reporting

- The introduction refers to "assembly errors" but does not distinguish between types of errors, like SNPs, indels, or contig joining mistakes
- No explanation of "insert size" or "mate pair" and "paired end" is given, many readers may not understand these concepts
- No reference for "Nextera Mate Pair" is given
- Existing tools REAPR and ALE are described, and a "Bayesian" method is mentioned but no motivation provided for its mentioning
- The phrase "de novo" should be italicised
- "de Bruijn Graph" should be lowercase "graph"
- missing space at "W is 200 bases"
- interval [i-W,i+W] is 2W+1 not 2W as reported

Experimental design

- It is not clear if you re-sequenced the exact same strains as the reference genomes in NCBI and where these strains were obtained from.
- Versions of software (bwa, samtools, etc) need to be reported
- BWA was used with default parameters, which includes lots of partially mapped reads and alternative mappings. It is unclear how nxRepair handled these.
- It should be made clearer that you are using the same reads for both assembly and post-assembly correcting

Validity of the findings

- The sequencing data is only available on Illumina BaseSpace. This needs to be rectified by placing the reads into a Study on NCBI SRA or into ENA so they are guaranteed to be publicly available.
- Table 1 can be improved by adding in the full species name, the genome size, and the global mate pair statistics that were estimated
- Some measure of the yield, quality and average read length (after clipping) should be provided
- It is claimed the nxRepair fixed 6 of 9 genomes, but Table 1 shows only changes to 3 of the 9 genomes?

Comments for the author

- Could this method be incorporated into Spades? Spades already re-aligns the reads back with BWA to correct some errors, so adding in a MP consistency check would be good.
- Do you really need the interval tree data structure, or could the stats you need be computed in a 1-pass manner?
- The use of a uniform distribution for the non-MP reads was interesting. I would have thought most non-MP reads were shadow PE reads, so their distribution would be Gaussian with a low mean and smaller standard deviation, rather than uniform.
- When you break an identified mis-assembly, the trimming part concerns me. Does this mean you are removing a chunk of genomic DNA from the final result? So we could lose genes?