Review for "Clusterflock: A Flocking Algorithm for Isolating Congruent Phylogenomic Datasets"

Completed on 14 Mar 2016 by Robert Lanfear .

License: http://creativecommons.org/licenses/by/4.0/

Login to endorse this review.


Comments to author

Author response is in blue.

I have a few remaining comments:

1. Usability

While I think the method is interesting, the implementation remains very difficult to use. After two hours of attempting to install and run the software (I am a proficient programmer in python and R, but have zero Perl experience) I gave up. The installation remains complex for non-perl experts, and the sparsity of the documentation does not help (the documentation has been expanded somewhat, but it remains far too sparse to be useful to non-perl programmers). Because of that, the utility of the tool for the end-users (presumably, biologists with multi-locus datasets) is questionable. This is not something I see as a barrier to publication of the method - which itself is interesting - but since the primary focus of this paper is the software itself, this does seem to me to be an issue.

2. DOI

The authors provide no cogent reason not to provide a DOI for their software. I don't know what the issue is here. By not providing a DOI (e.g. through Zenodo), there is no guarantee that the software will stay around. This is a problem for reproducibility and for the general utility of the work. Given that link rot, and lost/broken software in general is such a huge problem in our field, and given that the primary focus of this paper is the provision of 'an open-source tool' I think it's important to properly archive a version of the software with a DOI here. Neither tagging versions in github nor making a copy of the repo on bitbucket guarantees persistence. But the ~10 minutes it takes to provide a DOI through Zenodo does guarantee persistence. It means that, no matter what the authors decide to do with their github repository, the copy of the code used for this ms will be around and will be discoverable from the manuscript itself.

A side note: the authors state that they have tagged the current version of the software as 0.1. However, there are no tags or releases on their github repository. Tags and releases are specific things designed to help people get to particular versions of software: https://help.github.com/articles/creating-releases/ . Minting a DOI with Zenodo would solve this problem too - Zenodo works with tagged versions of the repository only.

3. Simulations

Can the authors please provide data (in a figure) on the number of clusters returned by clusterflock in each of the simulated datasets, versus the number of underlying topologies that were simulated. It's not possible to get this from the currently-presented data, and this is an important part of assessing the accuracy of the algorithm on the simulated datasets.

4. Data availability

Please provide the output data from the simulations: specifically, the data that could be used to recalculate figures 3 and 4 on the identity of the simulated topology versus the topology to which clusterflock assigned that locus.

5. Discussion of performance

Figures 3 and 4 would benefit from having the expected jacard index with random assignment of trees to loci plotted. This way we could see which methods do no better than randomly assigning trees to groups. As far as I can tell, clusterflock with 50% missing data tracks the random expectation very closely (JI = 0.5 with 2 trees; 0.1 with 10 trees; 0.04 with 25 trees). This in itself is interesting - even with data for 50% of the species, clusterflock does not appear to gain any benefit over randomly assigning trees to groups. Can comment on this particular case? It seems counterintuitive to me that with data for 50% of the species at each locus, the method gains no benefit over randomly assigning trees.

More generally, can the authors comment on the meaning (for biologists) of the fact that clusterflock gets a JI of ~0.4 when there are 25 simulated topologies. If the algorithm correctly assigns loci to topologies less than half of the time in these simulations, what does this mean for biological inferences from the data? For example, it seems from the simulated and empirical data that while clusterflock might be useful when the number of clusters is very small (e.g. <10) it might be much less useful with >10 clusters. For example, while the empirical test presented in the paper is compelling, it seems likely that the algorithm may be much less useful if there had been a lot of recombination events (as might be the case in many empirical datasets, such as the analysis of whole-bird genomes from across the avian tree of life).

As above, some comparison with existing approaches to this problem is warranted here: if clusterflock does better than existing approaches (i.e. Concaterpillar, conclustador, etc), then that's great even if the absolute performance remains less than ideal. In this case, biologists should prefer clusterflock because it makes the best inferences. However, if clusterflock is consistently worse than other methods, then we know that it is a neat method that requires additional development before it is useful. In my opinion, knowing which of these situations is the case would vastly strengthen the paper.

Level of interest
Please indicate how interesting you found the manuscript:

An article whose findings are important to those with closely related research interests

Quality of written English
Please indicate the quality of language in the manuscript:

Acceptable

Declaration of competing interests
Please complete a declaration of competing interests, considering the following questions:
1. Have you in the past five years received reimbursements, fees, funding, or salary from an
organisation that may in any way gain or lose financially from the publication of this
manuscript, either now or in the future?
2. Do you hold any stocks or shares in an organisation that may in any way gain or lose
financially from the publication of this manuscript, either now or in the future?
3. Do you hold or are you currently applying for any patents relating to the content of the
manuscript?
4. Have you received reimbursements, fees, funding, or salary from an organization that
holds or has applied for patents relating to the content of the manuscript?
5. Do you have any other financial competing interests?
6. Do you have any non-financial competing interests in relation to this paper?
If you can answer no to all of the above, write 'I declare that I have no competing interests'
below. If your reply is yes to any, please give details below.

None.

I agree to the open peer review policy of the journal. I understand that my name will be included
on my report to the authors and, if the manuscript is accepted for publication, my named report
including any attachments I upload will be posted on the website along with the authors'
responses. I agree for my report to be made available under an Open Access Creative Commons
CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments
which I do not wish to be included in my named report can be included as confidential comments
to the editors, which will not be published.

I agree to the open peer review policy of the journal.

Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0152-3/13742_2016_152_AuthorComment_V2.pdf)