My lab just read this preprint, and I thought I'd pass on our thoughts in the hopes that they are helpful.
First, we thought the approach seems eminently sensible, very useful, and that the analyses were informative.
A few things we thought might be worth considering:
1. Similar to some comments on Twitter, we wondered about the choice of variant callers to include as comparisons, and why many commonly-used callers were excluded.
2. We found the table really hard to interpret. Perhaps this is just because it's difficult data to represent, but we did wonder about the extent to which the sets of variant calls overlapped / differed, as well as the FP and FN rates. It also took us quite a bit of head scratching to get to the point where we realised that the dbSNP138 column percentages were so important to the interpretation.
3. We found the definitions of False Positives and False Negatives confusing (we appreciate that there are limitations to the short format though). This is mostly because it seems from the dbSNP138, Omni 2.5, and other results (final para of the ppr), that many/most of the False Positives are probably not False Positives at all. For this reason, we thought it might be clearer to redefine FP and FN simply in terms of differences to the GIAB NA12878 data.
4. Related to 3, but also more generally, we really wanted to see some simple simulations, so that one really could determine FP and FN rates. For my money, a simple and effective approach here is to do something like the NextGenMap paper (https://academic.oup.com/bioin... which simulated 4 read sets from Human, Arabidopsis, and Drosophila, then another 11 read sets from Arabidopsis with increasing polymorphism (0-10%). Results could be summarised quickly in a single figure, comparing real FP and FN rates under simulation.
A final comment, not really relevant to the current paper: we wondered about extending the method to include technical replicates (i.e. independent extractions and libraries from the same sample). E.g. to include 3 technical replicates at 20x each, which should allow one to account quite effectively for errors from library construction and base-calling.