A new multi-genomic approach for the study of biogeochemical cycles at global scale: the molecular reconstruction of the sulfur cycle

Valerie De Anda, Icoquih Zapata-Penasco, Bruno Contreras-Moreira, Augusto Cesar Poot-Hernandez, Luis E. Eguiarte, Valeria Souza

Review posted on 28th July 2017

Summary and overall assessment:
On a positive note, I found this manuscript by Anda and colleagues interesting to read. The authors describe their approach for identifying sulfur metabolic pathways (including significant occurrences linked to genomes and metagenomes) by implementing a computational entropy method. In general, I think the work is well executed, statistically robust and there is value in this work, which has been well detailed by the authors. As such, I have no doubt that this approach would be very useful to other in the field. I applaud the authors of this extensive work.

However, the manuscript requires substantial revisions and improvements. If I had one major recommendation, it would be that the authors should consider adding a module in the pipeline to identify the completeness of the sulfur metabolic pathways based on the presence (or absence) of molecular markers (Table 1 - high entropy and low standard deviation; also see line numbers 341-343). While the science is technically sound, a major issue is the presentation (particularly the language). From the abstract, right to the end, the manuscript needs a thorough edit.

In general, the abstract and introductory sections need extensive revisions. For instance, the last line of the abstract ends of with a rather negative point, which in my view, fails to capture the significance of the approach followed here.

Specific comments follow below:

L 105-108: I was a bit confused here. How will these comparisons be done based on this work? If I am not wrong, you are comparing biogeochemical pathways rather than analyzing the scores of sulfur metabolic pathways?
L 124: I assume here that whole genome sequencing means complete genomes? The authors should revise the text to reflect this and not use the word 'fully'
L 147: What was the basis for selecting these values? An explanation (or atleast a reference would be helpful here).
L 152: move the "microbiome" to "… associated microbiome sequences….".
L 156-158: Rephrase the sentence.
L 159: What is the average length of ORFs?
L 166-168: What do you mean by "seven categories of increasing length". The authors should be careful while discussing the length and should provide the range and mean length of the fragments.
L 173: Which "omic" dataset, genomes, metagenomes, or both?
L 229-230: Rephrase the sentence, what do you mean by "computed in 2014"? If the authors are referring to the year in which they accessed the database, I suggest using paraphrases throughout.
L 237-240: Which script was used for the validation of the results? Was this that placed on github?
L 358: Please rewrite this "….taxonomy according in NCBI…." to "….taxonomy according to NCBI….".
L 380-381: How many genomes were from the metagenome reconstructions?
Figure 1: In the figure Random Rlist is shown N=161 should this be N=1000?

Some examples of other language edits follow below

L 270: Re-write "…the distribution of the metabolic S-guilds….." to "…the distribution of metabolic S-guilds…" .
L 292: rewrite "…methylated thiols such….." to "…methylated thiols such as…"
L 384: archaon to archaeon.
L 397: bioremediation to bioremediation.
L 400: "…in both in…" to "…in both…"

