Preprint reviews by Mihaela Pertea

AGOUTI: improving genome assembly and annotation using transcriptome data

Simo V Zhang, Luting Zhuo, Matthew W Hahn

Review posted on 18th February 2016

In this paper the authors introduce a new scaffolder - AGOUTI - that uses RNA-seq data to improve assembly as well as gene annotations. This is a useful tool that will help the scientific community if it does indeed accomplish its goal of being accurate and effective. However I am not convinced of its accuracy from the data presented in this manuscript, as I am detailing below.

My concerns regarding this paper:

1. I think the test data for evaluating AGOUTI is not realistic. The authors just randomly fragment the genome of C.elegans into a multiple number of contigs. In consequence, the contigs that will be given as input to AGOUTI will just be perfectly assembled fragments of the genome. The result of a real assembly will have many errors that are not only due to scaffolding, but caused for instance by the repeats in the genome. There will likely be many gaps in the genome, and the contigs might have many missasemblies as well. These ideal assemblies that the authors simulate will not be encountered in real situations and I believe the accuracies presented here are unrealistic both for the resulting scaffolds as well as for the resulting gene annotation. I am suspecting that this ideal input is the reason for the very small number of errors that AGOUTI does.

A real assembly will heavily impact the accuracy of any gene finding tool used on it. I suggest that the authors start with simulated reads from C.elegans and assemble them using one of the popular de novo assemblers instead, and then evaluate AGOUTI on the resulting assembly and gene annotation.

2. Another concern I have about AGOUTI's performance is related to the comparison to other scaffolders. There is a large number of scaffolders available, but the authors only pick one of them because that's the only one that also uses RNA-seq data. There is no point to using more information (RNA-seq) data if the resulting accuracy is no better than that of a scaffolder that wouldn't use it. Therefore the authors can not just dismiss scaffolders that don't use RNA-seq data without proving that AGOUTI will do a better job than any of them.

3. There are no running times presented in the manuscript. The user can not immediately see if it would be efficient to use this tool especially on larger data sets, instead of other tools. I would have liked to see a comparison on running times to RNAPATH for instance since AGOUTI clearly increases the search space when compared to the former tool.

4. Related to point 3 above, the authors didn't specify what AGOUTI does when it has to choose between edges of equal weight. Does it choose one edge randomly, or does it take into consideration both possible paths, which in consequence increases even more the search space?

Minor concern:

The author should better explain Table 4. Not clear what the cases in the table are and how they can be present for non-consecutive contigs.

