Review for "Association Mapping From Sequencing Reads Using K-mers"

Completed on 27 Jan 2018 by Casey Greene . Sourced from https://www.biorxiv.org/content/early/2017/05/23/141267.

Login to endorse this review.


Comments to author

I reviewed this paper at a journal. I thought that the journal in question would make the review public, but perhaps that is only after the paper is accepted. In the interests of improving the discussion of papers before they become published, I'm posting my review here as well.

---

Confidential Competing Interests (required):
None

Reveal reviewer identity to authors (required): Yes

General assessment and major comments (Required):
The authors describe HAWK, a k-mer based approach to association analysis. The idea is certainly clever, and I can imagine this work as a jumping off point for other approaches to the analysis of genetic variants that differ between groups.

I have some concerns about how the work is presented. The method discusses k-mer association analysis as a technique for "sequencing data." Within the manuscript, the method is applied to simulated E. coli genomic data and to the 1k genomes dataset.
- If the authors want to suggest that this works for E. coli, or other bacterial, data they should apply the method to real genome sequencing data from these organisms. It seems like plasmids that vary with the test condition could make the approach somewhat computationally expensive (they would need to be built from k-mers). It'd be nice to see A) if this works in practice; and B) how scaling is affected. If this is not intended to be used for real microbial data, then perhaps the authors should note this.
- The only application in the manuscript is to whole genome data. It seems like this approach would be a relatively inefficient way to deal with RNA-Seq data. Should the domain be refined?
- Is the approach expected to work with exome sequencing data? If so, it would be nice to see an example showing that the capture process doesn't introduce any systematic biases that affect the method's false positive rate.

I downloaded the software and it compiled successfully. It is a bit difficult to use. The documentation is also sparse. It would be helpful to have a wrapper script that would handle the most common workflow as well as documentation with one fully worked example. The version of the source code associated with the published paper should be archived to figshare, zenodo, or a similar service.

Some assertions are made with regard to computational cost of competing methods in the intro. It would be helpful to me to see some benchmarking of HAWK.

"We provide scripts to lookup number... as future work." This is fine to leave for the future, but can you at least provide some documentation of these scripts in the repository's README?

Lines 277-280: is it possible that certain samples have different contamination? I'm not disputing that this is one possible explanation, but it doesn't seem like other possibilities (contamination, etc) have been ruled out to this point.

Minor Comments:
In "Counting k-mers", what is a sample for "appear once in a sample." Is this once in a condition, or is there a first stage of sample filtering before the k-mers are aggregated?

The github repo contains DS_Store files. This should be added to .gitignore

Both the GPL v2 and v3 licenses appear to be included with the CPP source code.

The source code on the website has a version number, but there are no tags in the github repository. Please tag with the version number.

In "verification with 1k genomes data": I think line #169 is referring to significant differences between the YRI and TSI samples using the standard calling algorithm. This paragraph could be reworded for clarity.

typo: "While upto 20%"