Completed on 1 Jun 2015 by Lachlan Coin.
Login to endorse this review.
Schroeder et al explore the potential benefits of augmenting the reference prior to calling copy number variation. They make the observation that calling insertions is much harder than calling deletions, and propose an elegant and simple solution is to make a new 'expanded' reference consisting of the genome plus all insertions from a second reference, and then focus on calling only deletions in this expanded reference. The authors have written tools for wrapping any CNV detection algorithm – essentially using the caller on the expanded set, then projecting back to the original co-ordinates. The authors demonstrate that such an approach leads to a higher sensitivity to detect insertions (using a single caller – Delly in both hg18 and ref+ space). They validate the calls by looking for direct evidence in the sequence data supporting either the inserted or non-inserted sequence (i.e. read spanning the breakpoint with or without the inserted sequence). Its a nice approach, but I am concerned that the shortcomings of such an approach have been somewhat overlooked in this paper, and that the benefits have been over-stated, as I outline below.
1. One issue which is not well discussed is what happens when there is not enough information for the caller to make a call on ref+ (i.e. not enough read depth or spanning reads). In this case, Delly would make a call of no CNV (as there is not enough information to make a call). This would have the tendency of creating false positive 'duplication' calls when mapped back to hg18. Indeed the authors saw an inflated FDR, and perhaps this is the reason? Of course, the corresponding outcome of not making a call in hg18 is a false negative, and hence the lower sensitivity of Delly calls in hg18. I would suggest the authors more clearly acknowledge this potential shortcoming of their approach, and discuss potential ways to alleviate this issue (ie. By excluding regions with low coverage?).
2. The increased accuracy figure is misleading, because it implies a 67% increase in accuracy for all insertions, but in fact it is just for those insertions which they have included in their augmented reference. This is of course likely a fraction of all insertions which are present in the donor genome. The proposed approach as laid out only accommodates insertions from a single extra genome. So the 67% gain is highly artificial.
3. I found this 67% figure a bit confusing (stated on line 38) as it seems to contradict line 103 (which says 31%). Probably one is an average per sample, and the other is not, and yet standard errors are reported for both? Also if the authors are reporting sensitivity, it would be good to also see specificity (i.e. not just FDR), so that the author can directly calculate the accuracy from sensitivity and specificity
4. Delly is not really designed for insertion detection. The abstract of Ddelly states that its for finding deletions and tandem duplications. The authors seem to exclude tandem duplications. So this comparison seems bound to favour ref+ as Delly can find deletions, but it can't find insertions very well. Of course a tandem duplication will result in an insertion but probably most of the venter insertions included in this benchmark are not tandem duplications. The authors should state what proportion of the venter insertions are tandem duplications, and thus potentially typeable by Delly. It seems quite likely that the huge increase in sensitivity is mostly a reflection of how much Delly is better at finding deletions than insertions, and the difference would not be as extreme for other callers.
5. I would think a better comparison would be using a tool which actively attempts to find both insertions and deletions (.e.g Pindel? Dindel? amongst others).
1. As this paper explores copy number variation only, perhaps the title might use the term 'copy number variation' instead of structural variation. I realise that this approach could be used for other types of structural variation, e.g. inversions, but this has not been explored in this paper.
2. The authors use the term 'adjacency' without explaining what this means. I think most readers would not be familiar with the use of this term in this context.