Open preprint reviews by Robert Lanfear

Universal Patterns Of Selection In Cancer And Somatic Tissues

Inigo Martincorena, Keiran M. Raine, Moritz Gerstung, Kevin J. Dawson, Kerstin Haase, Peter Van Loo, Helen Davies, Michael R. Stratton, Peter J. Campbell

This is pretty amazing stuff, very much enjoyed the ms. I would love to see two things here:

1. A discussion in the main ms (perhaps it is buried in the supps somewhere, which I didn't read in detail) of the power of the new method. E.g. roughly how many mutations would one have to measure to have a hope of using the models you propose (e.g. 192 rates in the trinucleotide model presumably requires a decent amount of variation as input; this is compounded by trying to estimate mutation rates for each gene, even using a binomial regression).

2. Implementations of the methods, with READMEs etc to get people started. I notice that you mention this on Twitter, but I think the impact of the ms would be much higher with available implementations.

A question related to point 1 on power. It seems like a standard model selection framework (e.g. hLRT or AICc) should work well here to decide how many parameters one can/should be estimating for a given dataset, and thus whether fitting a full 192 rate model is justified. Does this seem right to you?

show less


16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model

Ruibang Luo, Michael C Schatz, Steven Salzberg

My lab just read this preprint, and I thought I'd pass on our thoughts in the hopes that they are helpful.

First, we thought the approach seems eminently sensible, very useful, and that the analyses were informative.

A few things we thought might be worth considering:

1. Similar to some comments on Twitter, we wondered about the choice of variant callers to include as comparisons, and why many commonly-used callers were excluded.

2. We found the table really hard to interpret. Perhaps this is just because it's difficult data to represent, but we did wonder about the extent to which the sets of variant calls overlapped / differed, as well as the FP and FN rates. It also took us quite a bit of head scratching to get to the point where we realised that the dbSNP138 column percentages were so important to the interpretation.

3. We found the definitions of False Positives and False Negatives confusing (we appreciate that there are limitations to the short format though). This is mostly because it seems from the dbSNP138, Omni 2.5, and other results (final para of the ppr), that many/most of the False Positives are probably not False Positives at all. For this reason, we thought it might be clearer to redefine FP and FN simply in terms of differences to the GIAB NA12878 data.

4. Related to 3, but also more generally, we really wanted to see some simple simulations, so that one really could determine FP and FN rates. For my money, a simple and effective approach here is to do something like the NextGenMap paper (https://academic.oup.com/bioin... which simulated 4 read sets from Human, Arabidopsis, and Drosophila, then another 11 read sets from Arabidopsis with increasing polymorphism (0-10%). Results could be summarised quickly in a single figure, comparing real FP and FN rates under simulation.

A final comment, not really relevant to the current paper: we wondered about extending the method to include technical replicates (i.e. independent extractions and libraries from the same sample). E.g. to include 3 technical replicates at 20x each, which should allow one to account quite effectively for errors from library construction and base-calling.

show less