Preprint reviews by Rebecca Farrington

Characterization of the transcriptome, nucleotide sequence polymorphism, and natural selection in the desert adapted mouse Peromyscus eremicus

Review posted on 25th September 2014

Basic reporting

The authors aim to identify genetic mechanisms behind osmoregulation using a species of dessert rodents. This species was chosen because of their adaption to extreme water deprivation, even to the point of living their lives without water. This makes a good model to identify genes under selection for osmoregulation. Studies have been done previously in model organism such as mouse, rat and human, but only on a gene-by-gene basis. This manuscript looks at transcriptome-wide differential expression, selection and nucleotide polymorphism using next-generation sequencing in efforts to advance this area of research.

  1. The authors do a good job at framing their work, showing why the study is needed, the limitations and the how the work will/can lead to future research.

  2. The assembly and annotation steps were well thought out. Assemblies were error corrected, quality filtered and several steps were implemented for annotation using closely related species, Pfam database and extraction of putative coding sequences. The only thing I wonder is why didn’t the authors pool the samples when assembling. This would not change their downstream pipeline much, however, it would help to recover low expressed transcripts. (Are there any citations for this?) Also, I do not understand if or why the addition reads for kidney were not used for assembly.

  3. The author mentioned in results line 185 “The kidney appears to [be] an outlier in the number of unique sequences, though this could […] result [from] the recovery of more lowly expressed transcripts [caused by] deeper sequencing.” Why would this not also be the case for liver, which only has 3M (5%) less sequences?

  4. I am trying to understand the filtering process for the assembled reads. From my understanding (Page 4, lines 103:109) sequences were filtered using Blastn, (Page 4, lines 113:120) annotated using Blastn, HMMER3 and Transdecoder. Is my understanding correct? If so, why were the assembled sequences filtered with Blastn before annotated with Blastn and HMMER3? I thought the point of HMMER3 was to retain divergent sequences not detected by blastn.

  5. For the natural section results, I think it would be interest to add more than two genes. Perhaps the top and bottom 10 genes from the Tajima’s D analysis.

  6. It would also be nice to have the various parts of the analysis in a repository, for reviewing and open science purposes.

Overall I believe it is a good paper with interesting analysis, and cool results.

Experimental design

Some more explicit details to enable replication would be welcome, as described above.

Validity of the findings

No comments

Comments for the author

No comments.

show less

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

Keith R. Bradnam, Joseph N. Fass, Anton Alexandrov, Paul Baranay, Michael Bechner, İ, nanç, Birol, Sé, bastien Boisvert, Jarrod A. Chapman, Guillaume Chapuis, Rayan Chikhi, Hamidreza Chitsaz, Wen-Chi Chou, Jacques Corbeil, Cristian Del Fabbro, T. Roderick Docking, Richard Durbin, Dent Earl, Scott Emrich, Pavel Fedotov, Nuno A. Fonseca, Ganeshkumar Ganapathy, Richard A. Gibbs, Sante Gnerre, É, lé, nie Godzaridis, Steve Goldstein, Matthias Haimel, Giles Hall, David Haussler, Joseph B. Hiatt, Isaac Y. Ho, Jason Howard, Martin Hunt, Shaun D. Jackman, David B Jaffe, Erich Jarvis, Huaiyang Jiang, Sergey Kazakov, Paul J. Kersey, Jacob O. Kitzman, James R. Knight, Sergey Koren, Tak-Wah Lam, Dominique Lavenier, Franç, ois Laviolette, Yingrui Li, Zhenyu Li, Binghang Liu, Yue Liu, Ruibang Luo, Iain MacCallum, Matthew D MacManes, Nicolas Maillet, Sergey Melnikov, Bruno Miguel Vieira, Delphine Naquin, Zemin Ning, Thomas D. Otto, Benedict Paten, Octá, vio S. Paulo, Adam M. Phillippy, Francisco Pina-Martins, Michael Place, Dariusz Przybylski, Xiang Qin, Carson Qu, Filipe J Ribeiro, Stephen Richards, Daniel S. Rokhsar, J. Graham Ruby, Simone Scalabrin, Michael C. Schatz, David C. Schwartz, Alexey Sergushichev, Ted Sharpe, Timothy I. Shaw, Jay Shendure, Yujian Shi, Jared T. Simpson, Henry Song, Fedor Tsarev, Francesco Vezzi, Riccardo Vicedomini, Jun Wang, Kim C. Worley, Shuangye Yin, Siu-Ming Yiu, Jianying Yuan, Guojie Zhang, Hao Zhang, Shiguo Zhou, Ian F. Korf

Review posted on 23rd February 2013

Bradnam et al. systematically evaluate and compare 43 de novo genome assemblies of 3 organisms from 21 teams. My lab and I set out to evaluate the paper.

First, reviewing this paper thoroughly is effectively impossible; it's huge and complicated! So apologies for any mistakes below...

At a high level, this paper is a tour de force, analyzing the results of applying a dozen or more different de novo assembly pipelines to three different data sets, and ranking the results by a variety of different metrics. The three genomes chosen, fish, snake, and bird, are all vertebrate genomes, so they're large and repeat-ridden, and in some cases highly polymorphic, which makes this an extra challenging (but realistic) set of assembly problems. The major problem to be overcome by this paper is that we are evaluating fuzzy heuristic-laden software against fuzzy error-prone data in a situation where we don't know the answer, and I think given these constraints the authors do as good a job as is reasonably possible.

The resulting paper does an excellent job of broadly presenting the challenges of assembly, providing a good if rather high level discussion of the various ranking metrics they used. Their broad conclusions were well supported: assemblers do very different things to the same data, and you need to pick an approach and a set of parameters that maximize the sensitivity and specificity for your project goals; and repeats and heterozygosity will kill you.

From a scientific perspective, I was dismayed by their failure to make use of external data such as synteny and gene model concordance to evaluate the assemblies. The CEGMA scores were probably the closest to this but the numbers were surprisingly low to me, so either CEGMA doesn't work that well on vertebrate genomes or the assemblies are actually worse than the paper made clear. The fosmid and optical map analyses were not that convincing, because while they spoke to some sort of basic agreement with orthogonal data, they didn't have the breadth (fosmid) or resolution (optical map) to provide really solid independent evidence of the quality. When I am evaluating assemblies I look for "surprises" in terms of missing gene models and odd rearrangements compared to neighbors, and I feel like there are some reasonably straightforward things that could have been done here. Nonetheless, the analyses that were done were done well and discussed well and led to clearly defensible conclusions.

I was specifically asked to evaluate reproducibility or replicability of the analyses, which I will address below.

A major missing component of the paper was computational cost. As someone who works primarily on how to achieve assemblies on low-cost hardware, I can assure assembler authors that very few people can easily run multiple assemblies on large amounts of data using their assemblers. This (and ease of use, documentation, and community) is honestly going to drive choice of assembly pipelines far more so than notional correctness. This is especially true since a conclusion of the paper was "try lots of assemblers because they all do differently weird things on different data", which would lead me to the time-saving argument that a 60% accurate easily achievable assembly is considerably better than an 80% assembly that cannot readily be computed. Perhaps assemblathon 3 can be more tuned towards the question of whether or not anyone other than the authors can run these things and achieve good results!

While I'm talking about what I wish could have been done, it would have been nice to have something like RNAseq for the organisms. RNAseq can be used to look at completeness as well, by looking at the intersection between conserved genes and genes that map to the assembly, and I think it would have been invaluable. Internally focused statistical analyses are great, but there's nothing like orthogonal data (as with the fosmids and the optical map) for real evaluation.

I also could not figure out how much of the input data was used for each assembly, and it didn't look like any of the analyses took this into account -- REAPR is the only one that I would have expected to do so, but that paper doesn't seem to be available. How many of the input reads actually map to the genome (or how many of the high-abundance k-mers are present) would have been an interesting metric, although I recognize that repeats and het make this a difficult metric to analyze.

Finally, I think the fact that experts (in many cases the authors of the assembler) are running the assemblers should be mentioned more clearly: these results are presumably the best possible from those software packages, and 3rd-party users are unlikely to do as well.

On reproducibility

We were explicitly asked to assess reproducibility (technically, replicability).

Here my group and I were disappointed, on two accounts.

First, the instructions for replicating the assemblies are in some cases very sparse, and frequently missing entirely. Did the ABySS team really not do any read trimming? (SI file 3, p2-3) Did ALLPATHS really do no trimming? The Ray team should provide the "few modifications" somewhere, too. SGA? PRICE? Am I missing these entries or were they not submitted??

Second, the "forensic" evidence is equally somewhat lacking. There appears to have been little standardization of how to report the exact pipeline used, the versions of the software used, the amount of CPU time and memory required, etc. I think this was a missed opportunity. It's probably too late to remedy and shouldn't kill the paper, of course.

Basically, if replicability of assemblies is considered important for publication, the material in this paper needs some skeptical review and revision by the authors. At the very minimum, I would request that each included team provide the parameters and software version they used.

Miscellaneous comments and questions:

The BCM-HGSC fish assembler used Newbler to assemble Illumina. Any special details on getting this to work? We were under the impression that Newbler couldn't be used on Illumina.

We found the penalty for PacBio/Ns when used for scaffolding to be somewhat odd.

The fosmid data was used by the ABySS fish team in their assembly, and it doesn't sound like anybody else used it. Apart from possible circularity in the analysis, this also might be noted (I didn't see it, but I could have missed it).

The BCM team also claims they used Velvet for their assembly (Table 1), which was also used to assemble the fosmids. This potential circularity presumably was addressed in the withheld-metric analysis but might be worth mentioning somewhere. (Although I didn't see Velvet mentioned in the actual assembly-pipeline details in the SI, so maybe the Table 1 entry is wrong?)

Level of interest: An exceptional article

Quality of written English: Acceptable

Statistical review: Yes, but I do not feel adequately qualified to assess the statistics.

Declaration of competing interests: I declare that I have no competing interests.

show less