Review for "Clusterflock: A Flocking Algorithm for Isolating Congruent Phylogenomic Datasets"

Completed on 21 Aug 2016 by Robert Lanfear .


Comments to author

Author response is in blue.

The authors propose a method that uses a modified flocking algorithm figure out how many trees are needed to represent a set of alignments of the same taxa. This is an interesting problem, and the proposed solution is a valuable contribution. Nevertheless, there were a number of places in which I thought the study could and perhaps should be improved. I split them into two below: those to do with the software, and those to do with the manuscript.

Comments on the manuscript/method:

1. Only a passing mention is given to previous solutions to this problem. Given that there are various previous solutions, it would be useful for the reader to be given some comparison of the relative merits and shortcomings of the solutions, to motivate the current study. It would also be worth noting in the discussion whether and how this new method overcomes the limitations of previous methods. This seems like an important point for readers who areconsidering which tool to use.

2. There are no simulations. This seems like an important omission to me, because without simulations it's impossible to know when the method works well, and when it doesn't. Although the authors present an analysis of one empirical dataset in which the algorithm appears to do roughly what it should, this is not sufficient to make robust judgements as to the general performance of the algorithm. Thus, without simulations I would argue that the conclusions of the paper are not supported by the data, specifically it is not possible claim that 'we show that [clusterflock] is particularly well suited to isolating genes into discrete flocks that share a unique phylogenetic history'. Simulations could obviously take many forms, but a simple approach would be to consider 100 datasets with 1-100 trees underlying them. 100 loci could then be sampled across these trees, and fed into the algorithm.

Repeating each simulation 10 times would require only 1000 analyses, and could give a quite detailed picture of the method's performance. Specific questions to ask would be: what is the false positive rate (i.e. how often do you detect more than one cluster when there is only a single underlying tree)? What is the false negative rate (i.e. how often do you cluster together genes with different underlying trees)? What are the detection limits (e.g. how much data and how different do two trees have to be before you can detect the differences)? What aspects of sequence evolution can mislead the algorithm (e.g. rates of evolution, see below)? How does the ratio of the number of loci to the number of trees affect performance (this seems like a particularly important point to address in a flocking algorithm - it's not obvious to me what will happen to trees that are represented by a single locus, and particularly in the case where most trees are represented by a very small number of loci)?

3. The design choices are described relatively thoroughly in the paper, but very few motivations for these choices are given. Thus, while I might be able to re-implement a similar algorithm by reading the paper, I have no idea why most of the choices were made. It would be nice to include the background to the decisions made when implementing the algorithms, because this would facilitate progress in this area.

4. The use of LD seems reasonable here, but it seems like it could also be misled by genes evolving at different rates. This is because higher rates will tend to exacerbate problems like long-branch attraction. Thus, under parsimony, a slow gene and a fast gene may have quite different most-parsimonious topologies. Given the vast differences in rates between many genes, this seems like a potential issue that could at the very least be explored with simulation, e.g. by simulating 100 genes on the same tree, where 50 evolve slowly and 50 evolve more quickly. By varying the rate ratio of the two genes, one could determine whether this is an issue, and at what kinds of scales it manifests itself.

5. A simple question - could the authors include some information on the relative proportion of the runtimes that are associated with different parts of the algorithm. I ask this because it's easy to think of other options (like calculating ML or NJ trees, and then using any of a number of metrics of tree distances) which might improve accuracy but increase runtimes. However, without knowing what the rate-limiting steps of the algorithm are, it's not possible to know whether such improvements are worth even thinking about.

6. Following from point 5: given that you have to run the algorithm 100 times to get some idea of the robustness of the flocking, how does the aggregated runtime compare to other approaches to this problem? E.g. what about software such as concaterpillar or conclustador? The latter states that it is specifically designed to solve the same problem as clusterflock, so it seems worth comparing the two here. Note that I don't think it's necessary to do better than any other software - this is a very interesting approach that should be described regardless of whether it's better on any particular metric - but it does seem important to make some attempt to compare performance in terms of accuracy and speed.

Comments on the software:

1. The way that github has been used is unconventional, and inconvenient. The only way I could download the software was to download a whole collection of other pieces of software along with it. Please give this software its own repository. This will also facilitate future collaboration and development, since github works fundamentally at the level of the single repository.

2. Please mint a DOI for the released version of the software with Zenodo or some other service. This ensures that the software will stay around if the github repo is deleted, and it also ensures that the ms refers to a persistent and tagged version of the software even if the repo stays around and the software continues to be developed.

3. There are no tests in the software. In this case, tests seem rather vital. The paper describes clusterflock 'an open source tool', so presumably the intention is that many others will use it. Simulations will form a useful set of tests on their own, and should be included in the repository with a script to run all tests and check that they produce the expected results. (note - the results don't have to be correct, but there should be some checking to make sure that they are expected). Given that the algorithm is stochastic, it might be useful to include an option to provide a random number seed in the code, in particular to facilitate testing. Unit tests would also be useful, to ensure that key functions are behaving as expected. As it stands, software with no tests does not inspire a great deal of confidence.

4. More documentation is needed. I suspect this is particularly the case here, since the vast majority of the end-users of the tool will not know Perl. It would be worth putting together a comprehensive manual, and in particular providing detailed installation instructions and a quickstart guide. For example, although I am quite proficient in a couple of languages I do not use Perl. Even if I had access to a linux machine to test the software (sadly, I don't, but I hope at least one reviewer does), I'm guessing that getting it up an running would have taken me some time.

5. I searched for a license, and found one in the script. But I am confused. The license states that the work is copyright of the AMNH, but also that it is released under the same terms as Perl itself. These seem incompatible, and also perhaps incompatible with the three dependencies that are packaged in the repo. Can the authors double check this, and when they are sure they have a valid license, include it somewhere obvious in the repository and the manual.

6. Just an observation: 'Clusterflock' is a very popular name for many things, and that makes this tool very hard to find on google. Even typing 'clusterflock phylogenetics github' does not produce a link to the tool. It might be worth considering a name that makes the tool easier to find.

