Minor allele frequency thresholds strongly affect population structure inference with genomic datasets

Ethan B. Linck, C. J. Battey

Review posted on 15th September 2017

Just sitting with @kdmurray801 and discussing this study. We think there are some issues that should be considered.

1. One structure run per evolutionary replicate (from the simulations) conflates error due to the evolutionary process with error due to the STRUCTURE inference (MCMC-related stochasticity). Using a recommended approach to STRUCTURE analysis (i.e. examining variability among replicate structure runs for each value of K considered) would be more realistic, even if it meant reducing the number of evolutionary replicates.
2. MAF threshold is completely confounded with number of loci, so the signal of MAF threshold cannot be separated from a loss of power due to a smaller marker set with increasing MAF cutoff. Randomly sampling a constant-sized subset of loci within each MAF threshold would overcome this threshold. Drawing these random subsets from a larger set of simulated loci would be required if more than a few hundred loci are desired.
A few more things popped up (mainly to do with the approach to missing data, the estimation of accuracy and doing ANOVA on something you wouldn't expect to meet the assumptions), but these were the key ones that stood out, and I think they could be dealt with fairly easily.

