preprint reviews by Daan Speth

A new multi-genomic approach for the study of biogeochemical cycles at global scale: the molecular reconstruction of the sulfur cycle

Valerie De Anda, Icoquih Zapata-Penasco, Bruno Contreras-Moreira, Augusto Cesar Poot-Hernandez, Luis E. Eguiarte, Valeria Souza

Review posted on 27th July 2017

In their manuscript, a new multi-genomic approach for the study of biogeochemical cycles at global scale: the molecular reconstruction of the sulfur cycle, De Anda and colleagues describe a computational approach to characterize the relative importance of sulfur cycling in a (meta)genome dataset.

Their approach, calculating a "sulfur-score" based on the detected vs expected presence of genes involved in sulfur cycling seems appropriate for this question. I do however have several questions on the (description of) the methodology that I'd like to see clarified before recommending this manuscript for publication.

The introduction of the conceptual framework, in line 74-89 was not very clear to me at first reading. For readability of the manuscript I suggest the authors include some of the information that is present in the methods section. Specifically, why the minimum ecosystem concept and microbial mats are important.

I'm a little confused why the authors use the mean size length metric. Given a well curated reference database, even short reads should be alignable to protein sequences of any length. The length of the protein will impact the expected number of matches.

After the authors have selected 152 proteins involved in the S cycle, they only use 112 domains as annotated by interproscan. Do these domains represent 112 proteins? Why did the authors choose not to generate HMM's for the remaining 40 proteins?

Other than calculating the relative entropy, have the authors used any other check to assess whether the detected pfam domains were specific for the S-cycle? Many pfam domains contain proteins with a range of functions.

the purpose of figure 3 is not entirely clear to me. As it is, it contains too much information to be informative

the elaborate description of the metagenomes mentioned in line 433-463 seems unnecessary for the flow of the manuscript.

specific points

line 32: ROC is used but not defined till later in the text

line 46: I suggest changing "apparition" to "appearance"

line 147: what does a DNA signature of 0.01 mean?

show less

First genomic insights into members of a candidate bacterial phylum responsible for wastewater bulking

Review posted on 04th November 2014

Comments for the author

Summary In this manuscript the authors describe binning and analysis, using sophisticated bioinformatics, of two genomes from organisms belonging to candidate phylum KSB3. These filamentous organisms have been identified as a causing agent in sludge bulking of an UASB reactor. The genomes are used to construct a metabolic model of the organism, and gain insight in its ecophysiology. Additionally, based on the high amount of signal processing genes in both genomes, the authors hypothesize these organisms must be mobile and they proceed to show gliding motility using microscopy.

Although I think the study is well done and the manuscript is well written, there are a few things I’d like the authors to address:

In the methods section the DNA extraction procedure should briefly be mentioned, since it provides the underlying material for the data generated. Therefore, referring to a previous study (which in turn refers to a previous study) is in my opinion not right.

Accession codes are provided for the assembled/scaffolded genomes, but I could not find the raw data. The same goes for the study in which most of the sequencing was originally reported. I’d like to see the underlying raw data submitted to NCBI/EBI/DDBJ

Since UASB14 is the second most abundant organism of a non-bulking UASB it might benefit the paper to briefly discuss its role in the healthy system, and maybe speculate why it is more successful than UASB270.

Although in the methods section a range of stimulants for motility are mentioned, only glucose and maltose are mentioned in the results/discussion. In my opinion the discussion would be more complete when thoughts on the absence of response to the other stimulants are given.

in addition to the point above, I have a few minor comments on specific points in the text.

Line 46: “cellular processes, including bulking,” seems to imply bulking is a cellular process. Maybe “cellular processes, including those causing bulking,” fits better?

Line 83: Although I understand what is ment with population genomes, I think it is not (yet) an established term. Perhaps explaining the term in a few words will help establishing it quicker.

Line 175-176: It is unclear what is ment with “a minimum similarity of 98% of the read length”, since the CLC mapper allows specification of ‘minimum similarity’ and ‘fraction of read length’ as separate parameters

Line 298: The N50 for the assembly is given, but this metric is, in my opinion, quite meaningless for a metagenome since the sampling is (almost) always partial. I would advice to remove it.

Line 385: the authors mention CRISPRs are present in all Archaea, but I think this is not true. Unless I’m mistaken they are absent from at least some thaumarchaea

Line 402: The authors state the organisms generate ATP by converting acetyl-CoA to acetate. I’d add something along the lines of “in addition to glycolysis” as is shown in figure 4

Table 1: There is some inconsistency in the notation where number of ORFs in cellular processes are mentioned. At “glycoside hydrolases” a percentage is given, then it is explained at “protease/peptidase”, and absent at “signalling”

show less