Review for "Whole-genome characterization in pedigreed non-human primates using Genotyping-By-Sequencing and imputation."

Completed on 11 Apr 2016 by Lachlan Coin.

Login to endorse this review.

Comments to author

Bimber et al present an interesting approach to whole-genome genotyping variants in an extended primate family using a combination of deep WGS on a few samples, and GBS on the remaining samples, combined with imputation. This is a very interesting approach to try to cost-effectively genotype extended pedigrees in populations which dont have the benefit of whole-genome genotype arrays. It is important to try to find approaches for cost-effective genotyping in this setting, and I think this is a valid approach to try. However, I am not sure that the results presented conclusively demonstrate that this is a good way forward. Also, there were a few things which were hard to follow and could be presented more clearly.

Major comments:

1. What was the rationale for using chromosome 19 only to evaluate the best approach. There does seem to be a lot of variation in the genotype accuracy plots, which using more data (i.e. whole-genome) would help to address.

2. Do you see that some individuals consistently have low accuracy and others have higher accuracy across different chromosomes? One way to show this would be to use different symbols for each of the family members. I think this could be informative in terms of seeing which positions in the pedigree were poorly imputed

3. It seems to me that the genotyping accuracy presented is a bit unsatisfactory – 92% median accuracy is much lower than what is expected from genotyping arrays, low-coverage population sequencing with imputation, etc, and probably makes downstream analyses error-prone. GIGI estimates a posterior probability distribution over genotypes at each marker, why not use this to identify which genotype calls are more confident, and to assign as missing those values which are not confident above a certain threshold. This would allow you to trade of the amount of missing data with the accuracy. It would be preferable for example to have 10% missing data, but a 99% genotyping accuracy.

4. The 25% allele frequency in the family seems to be a very high threshold, and would mean losing a lot of variants. I think its important to investigate genotyping accuracy at different frequency bands, and also to present the number of variants which have family allele frequency in each range.

5. Did you investigate not applying such a stringent selection criteria to the framework markers, but instead using all of them?

6. I don't think you have conclusively showed that GBS is a better strategy than alternative approaches. In particular, I am not convinced that a combination of WGS with skim sequencing might not work better. You write that the optimal strategy is 1 30X WGS per 3-5 GBS., so lets just say that is roughly 1 WGS + 4 GBS, which translate to $3000 + 4 * 50 = $3200. Based on these costings, a GBS run is the same cost as 0.5X of sequencing, so for the same money you could do one 16X and 4 4X genomes, or 1 28X and 4 1X genomes. I think it would be worth comparing the results to what you would get if you employed this strategy (or even just 5 6X genomes). I think you have enough WGS data in this experiment to evaluate this strategy by downsampling some of the 30X genomes (and using the remaining reads to call the gold-standard genotypes for comparison), and just using the GIGI pick samples as 30X imputation framework samples. I think this would show more conlusively that GBS is the most cost-effective approach.

7. Regarding the conclusion that the optimal strategy is 1 30X WGS per 3-5 GBS. Why do you think it is applicable to all family structures? Also the process of selecting informative individuals is not clearly explained and appears to have some circularity. For the given family in this paper, "Individuals with WGS data were added consecutively in the following order: B, H, J, F, M, K, P, C, D". The order was suggested by the GIGI-pick algorithm with WGS data from these individuals. How do you know which individuals to select without WGS. Would GIGI-pick select the same individuals if GBS data was used instead?   

Minor comments:

1. Regarding the line: “We additionally removed a set of 578 markers that consistently performed poorly in imputation. “

How was this assessed? If it was assessed by comparison to the gold standard genotype results from WGS, then it could result in a circular logic, whereby you improve the apparent imputation results by removing markers which are poorly imputed.

2. How did you assess accuracy at missing data sites? I presume they were left out of this calculation. What was the missing rate

3. Its very hard to compare fig 2C and 2D. I wonder if a log x scale would help.