Review for "Prot-SpaM: Fast alignment-free phylogeny reconstruction based on whole-proteome sequences"

Completed on 18 Jun 2018

Comments to author

The authors presented an alignment-free method named Prot-SpaM which could estimate phylogenetic distances between incomplete or complete proteomes, and compared Prot-SpaM with other alignment-free methods in terms of computational time and how similar to reference trees using simulated data, prokaryotic, and eukaryotic datasets.

Recommendation: Author should prepare a major revision for a second review.

Minor Comments:

-. References are not ordered.

-. On page 4, the cartoon for explaining of concept of spaced-word matches is not correct.

-. Figure 2 legend says ProtFSWM not Prot-SpaM

-. On Page 5, authors checked if spaced-words matches in the two compared sequences are one-to-one mapping. Please describe how one-to-one mapping from one-to-multiple or multiple-to-multiple was defined?

-. On page 5, authors used Kimura model to approximate PAM distance. Please put a reference for Kimura model. If there are parameters involved in Kimura model, please describe how to estimate those parameters.

-. Authors used BLOSUM62 when to distinguish homologous spaced-word matches with random spaced-word matches. But, authors approximated PAM distance between protein sequences. Please describe rational different substitution matrices used for different purposes?

-. For Table 1, Table 2, Table 3, Figure 2, and Figure 5, please describe which length of K-mer was used for FFP method.

-. One page 8, "One interpretation is that misleading signal stemming from recombination events between Wolbachia strains is less problematic for alignment-free analysis then a reduction in he dataset size." <= please revise the sentence.

-. On page 9, "we applied Prot-SpaM to all available protein sequences from these 813 taxa. In addition, we ran Prot-SpaM on the protein sequences encoded by the 24 marker genes from Lang et al." <= For application of Prot-SpaM to two different type of datasets, were same selected spaced-word matches and same patterns used?

-. For Table 3, what is the unit of computational time (seconds/minutes/hours)?

Major Comments:

-. Since authors describes a new alignment-free method for whole-proteome phylogeny, please include the existing alignment-free method developed specifically for whole proteome phylogeny ("Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution" by Jun et al, PANS 2010; 107:133-138") for the comparison.

-. No discussion found about Table 3 which summarized computational time for Prot-SpaM, and other alignment-free methods.

-. Authors claims that Prot-SpaM generates more statistically meaningful trees than other alignment-free methods. But, I don't see description of statistical confidence on internal nodes. Please describe how to impose statistical confidence on internal nodes of trees generated by Prot-SpaM.

-. On page 9, "There are some differences within the clades, though, that should be further investigated." <= please at least provide information of RF and Branch score distance between four trees in Figure 4.

-. In Figure 3, Did Prot-SpaM segregate E.coli from Shigella? The paper "Insights from 20 years of bacteria genome sequencing" published in Funct Integr Genomics, 2015; 15:141-161 ( showed a tree which segregated E.coli from Shigella clearly using the method described in "Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution" by Jun et al, PANS 2010; 107:133-138. Please discuss about which alignment-free methods including Jun et al's method have a capability of segregating E.coli from Shigella.

-. On page 8, authors described analysis results of Wolbachia strains without presenting a tree. Please provide the tree with annotations according to discussion to support the analysis results. In the subsection of Wolbachia, authors described how to generate an alignment-based tree with Wolbachia II dataset. But, there is no discussion found about comparison results with other alignment-free methods for Wolbachia II dataset.

-. Selected spaced-word matches and patterns are the most important factors for the method, Prot-SpaM. It seems that a length of l=46, a weight of w = 6, 40 don't-care positions and five patterns were used through the study. I don't see how authors reached these parameter values. Please describe the optimization procedure for these parameters, l, w, five patterns. For these optimized values, does selected spaced-word matches mean spaced-word matches with scores >0? Please describe meaning of 'selected' in "selected spaced-word matches". Second, please clarify whether selected spaced-word matches with a length of l=46, a weight of w = 6, 40 don't-care positions, and five patterns fixed were for any dataset (simulated, prokaryotic, eukaryotic proteomes, protein sequences)? Please provide sets of selected string-word matches and patterns in Supplementary. Third, please describe computational time for defining selected spaced-word matches and patterns. Fourth, Prot-SpaM uses only selected spaced-word matches which indicates fraction of proteomes are compared instead of whole proteomes when being built phylogenies. Please describe what fraction of proteomes on average by Prot-SpaM were being used for datasets discussed in the manuscript.

-. In comparing Table 1 (RF distance) with Table 2 (Branch score distance), since Prot-SpaM captures evolutionary distance between two sequences unlike other alignment-free methods, Prot-SpaM should perform better than other alignment-free methods against alignment-based reference trees by Brach score distance. But, for example, trees by CVTree method were closer to alignment-based reference trees than trees by Prot-SpaM with branch-score distance for some datasets even though CVtree doesn't capture evolutionary distance at all. Please discuss about this issue.

-. In the manuscript, validation step was solely dependent on alignment-based reference trees which sounds like that authors tried to develop an alignment-free method which produces trees most resemblance to alignment-based tree. For example, a reference tree of 813 prokaryotes was based on 24 marker genes and was found to be very similar to be 16S rRNA-based tree. Then, according to the validation procedure, Prot-SpaM tried to prove that the method produces a tree most resemblance to a 16S rRNA-based tree which does not require orthologous analysis and might not be computationally inferior to Prot-SpaM. Furthermore, even though Prot-SpaM pairwise distance captures evolutionary distance, the distances on the Prot-SpaM tree cannot be interpreted with substitution rates since distance-based methods are only applicable. To investigate other advantages of Prot-SpaM, the method needs to be examined over taxonomic classification in comparison with other alignment-free methods since taxonomic classification captures evolutionary information. Please examine a capability of taxonomic classification for the datasets discussed in the manuscript at least at the species level in comparison with other alignment-free methods.

-. The following error messages occurred when compiling the code downloaded from "" by 'make'. It seems that sysinfo.h is not provided.

mkdir -p obj

g++ -fopenmp -c -Wall -std=c++11 -I ./include main.cpp -o obj/main.o

In file included from ./include/speedsens.hpp:6:0,

from ./include/rasbcomp.hpp:9,

from ./include/rasbhari.hpp:8,

from ./include/rasbimp.hpp:4,

from main.cpp:26:

./include/sensmem.hpp:7:25: fatal error: sys/sysinfo.h: No such file or directory

#include "sys/sysinfo.h"


compilation terminated.

make: *** [obj/main.o] Error 1

