Preprint reviews by Lachlan Coin

A standalone software platform for the interactive management and pre-processing of ATAC-seq samples

Zeeshan Ahmed, Duygu Ucar

Review posted on 07th May 2017

Ahmed and Ucar present a platform for processing ATAC-seq samples. The platform starts with a fastq file, and generates peak calls. The rationale for developing this program is to provide a user-friendly solution for processing ATAC-seq samples, however it is a general pipeline for finding read-depth peaks and could be also applied to Chip-Seq data, for example.


However, this tool does not really make the analysis 'easy' for the end-user, as quite a bit of investment must be made in installing all the software dependencies. Also, it seems that command line tools are required to inspect the results, and there are no tools for assisting in interpretation or visualisiation of the results of the peak calling.


Major revisions.

1. The authors should do a more thorough comparison with features of other programs for processing ATAC-seq data. The alternatives are only mentioned in passing, and no real comparison of features is made.

2. Consider making a galaxy pipeline incorporating this pipeline. In particular, I am not convinced that a standalone tool is a more straightforward solution than using Galaxy, particularly if the user only has a modest number of samples to process. This is because the user still has to install all the software dependencies, One advantage of Galaxy is that the user does not need to install any software, as it will already be installed on a Galaxy installation.

3. One of the major points of this preprint is that it makes the analysis 'interactive'. However, there is little evidence of interactivity, and I am not sure what interactive means in the context of a pipeline for processing data. The interactivity seems to mostly consist of being able to modify various parameters in a GUI, and pressing the 'run' button. To be truly interactive, some kind of visualization of the data must be presented, and the tools to choose specific analysis pipelines on the basis of this visual feedback. At the moment it seems to me that the user must still inspect the results in the file-system, so interactivity consists of being able to change the run parameters in a GUI.

4. The authors have run the pipeline on GM12878 (and presumably other samples) but provide no results from these runs. It would be very useful to see the utility of the pipeline if some indication of the results which were obtained were presented. This gets back to the point about visualisation – it seems the tool provides no data visualization?

5. Downstream processing of peak calls. Does the tool offer any downstream processing of peak calls? I dont think the authors could claim this is a ATAC-seq pipeline if it doesnt offer some downstream tools for e.g. identifying changes in histone positioning, maybe some kind of fourier transform of peaks? Something which would assist in interpreting this as the result of an ATAC-seq experiment.

Minor revisions:

1. There are too many flow diagrams, and many of them essentially represent the same information in different ways. I understand that they each have a subtly different point, e,g. One is to illustrate the file -structure created by the program. However, it would be better to just have a single flowchart illustrating everything, perhaps with extra figure panels to illustrate other relevant points of interest, like the file structure generated. Also in Figure 2, its impossible to get any useful information out of the screenshots included, so I dont see the point. Moreover, the point of the tool is to help biologists not familiar with command line, so why does \figure 2 show commands and outputs on the command line?

show less


Whole-genome characterization in pedigreed non-human primates using Genotyping-By-Sequencing and imputation.

Ben N Bimber, Michael J Raboin, John Letaw, Kimberly Nevonen, Jennifer E Spindel, Susan McCouch, Rita Cervera-Juanes, Eliot Spindel, Lucia Carbone, Betsy Ferguson, Amanda Vinson

Review posted on 14th March 2016

Bimber et al present an interesting approach to whole-genome genotyping variants in an extended primate family using a combination of deep WGS on a few samples, and GBS on the remaining samples, combined with imputation. This is a very interesting approach to try to cost-effectively genotype extended pedigrees in populations which dont have the benefit of whole-genome genotype arrays. It is important to try to find approaches for cost-effective genotyping in this setting, and I think this is a valid approach to try. However, I am not sure that the results presented conclusively demonstrate that this is a good way forward. Also, there were a few things which were hard to follow and could be presented more clearly.


Major comments:

1. What was the rationale for using chromosome 19 only to evaluate the best approach. There does seem to be a lot of variation in the genotype accuracy plots, which using more data (i.e. whole-genome) would help to address.

2. Do you see that some individuals consistently have low accuracy and others have higher accuracy across different chromosomes? One way to show this would be to use different symbols for each of the family members. I think this could be informative in terms of seeing which positions in the pedigree were poorly imputed

3. It seems to me that the genotyping accuracy presented is a bit unsatisfactory – 92% median accuracy is much lower than what is expected from genotyping arrays, low-coverage population sequencing with imputation, etc, and probably makes downstream analyses error-prone. GIGI estimates a posterior probability distribution over genotypes at each marker, why not use this to identify which genotype calls are more confident, and to assign as missing those values which are not confident above a certain threshold. This would allow you to trade of the amount of missing data with the accuracy. It would be preferable for example to have 10% missing data, but a 99% genotyping accuracy.


4. The 25% allele frequency in the family seems to be a very high threshold, and would mean losing a lot of variants. I think its important to investigate genotyping accuracy at different frequency bands, and also to present the number of variants which have family allele frequency in each range.

5. Did you investigate not applying such a stringent selection criteria to the framework markers, but instead using all of them?

6. I don't think you have conclusively showed that GBS is a better strategy than alternative approaches. In particular, I am not convinced that a combination of WGS with skim sequencing might not work better. You write that the optimal strategy is 1 30X WGS per 3-5 GBS., so lets just say that is roughly 1 WGS + 4 GBS, which translate to $3000 + 4 * 50 = $3200. Based on these costings, a GBS run is the same cost as 0.5X of sequencing, so for the same money you could do one 16X and 4 4X genomes, or 1 28X and 4 1X genomes. I think it would be worth comparing the results to what you would get if you employed this strategy (or even just 5 6X genomes). I think you have enough WGS data in this experiment to evaluate this strategy by downsampling some of the 30X genomes (and using the remaining reads to call the gold-standard genotypes for comparison), and just using the GIGI pick samples as 30X imputation framework samples. I think this would show more conlusively that GBS is the most cost-effective approach.

7. Regarding the conclusion that the optimal strategy is 1 30X WGS per 3-5 GBS. Why do you think it is applicable to all family structures? Also the process of selecting informative individuals is not clearly explained and appears to have some circularity. For the given family in this paper, "Individuals with WGS data were added consecutively in the following order: B, H, J, F, M, K, P, C, D". The order was suggested by the GIGI-pick algorithm with WGS data from these individuals. How do you know which individuals to select without WGS. Would GIGI-pick select the same individuals if GBS data was used instead?   

Minor comments:

1. Regarding the line: “We additionally removed a set of 578 markers that consistently performed poorly in imputation. “
How was this assessed? If it was assessed by comparison to the gold standard genotype results from WGS, then it could result in a circular logic, whereby you improve the apparent imputation results by removing markers which are poorly imputed.

2. How did you assess accuracy at missing data sites? I presume they were left out of this calculation. What was the missing rate

3. Its very hard to compare fig 2C and 2D. I wonder if a log x scale would help.

show less


Improving the Power of Structural Variation Detection by Augmenting the Reference

Jan Schroeder, Santhosh Girirajan, Anthony T Papenfuss, Paul Medvedev

Review posted on 17th May 2015

Schroeder et al explore the potential benefits of augmenting the reference prior to calling copy number variation. They make the observation that calling insertions is much harder than calling deletions, and propose an elegant and simple solution is to make a new 'expanded' reference consisting of the genome plus all insertions from a second reference, and then focus on calling only deletions in this expanded reference. The authors have written tools for wrapping any CNV detection algorithm – essentially using the caller on the expanded set, then projecting back to the original co-ordinates. The authors demonstrate that such an approach leads to a higher sensitivity to detect insertions (using a single caller – Delly in both hg18 and ref+ space). They validate the calls by looking for direct evidence in the sequence data supporting either the inserted or non-inserted sequence (i.e. read spanning the breakpoint with or without the inserted sequence). Its a nice approach, but I am concerned that the shortcomings of such an approach have been somewhat overlooked in this paper, and that the benefits have been over-stated, as I outline below.


Comments

Major

1. One issue which is not well discussed is what happens when there is not enough information for the caller to make a call on ref+ (i.e. not enough read depth or spanning reads). In this case, Delly would make a call of no CNV (as there is not enough information to make a call). This would have the tendency of creating false positive 'duplication' calls when mapped back to hg18. Indeed the authors saw an inflated FDR, and perhaps this is the reason? Of course, the corresponding outcome of not making a call in hg18 is a false negative, and hence the lower sensitivity of Delly calls in hg18. I would suggest the authors more clearly acknowledge this potential shortcoming of their approach, and discuss potential ways to alleviate this issue (ie. By excluding regions with low coverage?).
2. The increased accuracy figure is misleading, because it implies a 67% increase in accuracy for all insertions, but in fact it is just for those insertions which they have included in their augmented reference. This is of course likely a fraction of all insertions which are present in the donor genome. The proposed approach as laid out only accommodates insertions from a single extra genome. So the 67% gain is highly artificial.
3. I found this 67% figure a bit confusing (stated on line 38) as it seems to contradict line 103 (which says 31%). Probably one is an average per sample, and the other is not, and yet standard errors are reported for both? Also if the authors are reporting sensitivity, it would be good to also see specificity (i.e. not just FDR), so that the author can directly calculate the accuracy from sensitivity and specificity
4. Delly is not really designed for insertion detection. The abstract of Ddelly states that its for finding deletions and tandem duplications. The authors seem to exclude tandem duplications. So this comparison seems bound to favour ref+ as Delly can find deletions, but it can't find insertions very well. Of course a tandem duplication will result in an insertion but probably most of the venter insertions included in this benchmark are not tandem duplications. The authors should state what proportion of the venter insertions are tandem duplications, and thus potentially typeable by Delly. It seems quite likely that the huge increase in sensitivity is mostly a reflection of how much Delly is better at finding deletions than insertions, and the difference would not be as extreme for other callers.
5. I would think a better comparison would be using a tool which actively attempts to find both insertions and deletions (.e.g Pindel? Dindel? amongst others).



Minor:

1. As this paper explores copy number variation only, perhaps the title might use the term 'copy number variation' instead of structural variation. I realise that this approach could be used for other types of structural variation, e.g. inversions, but this has not been explored in this paper.
2. The authors use the term 'adjacency' without explaining what this means. I think most readers would not be familiar with the use of this term in this context.

show less