I was asked to review this manuscript for a journal, and decided to share the review here.
The work is concerned with clustering of gene expression profiles from RNA-seq data, using data transformation and Gaussian mixture models. The manuscript is well-written and of high technical quality. I have a few suggestions for improving it further.
1.) The transformation to p_ij and then to g_arcsin, g_logit is interesting, and worthwhile considering. However, as the authors note on p.4, there are also other, obvious, and well-established candidates for transformations, such as log(n+c), VST, rlog, moderated CPM (all should be followed by mean centering). Since the title of the ms is "Transformation and model choice..." I would consider it important to include these in the study. One possible result could be that "it doesn't matter very much", or that one or another of these candidates really does poorly; in any case, this would be interesting for readers, as the choice of transformation often creates a good deal of anxiety.
2.) The arguments for Gaussian mixture models (GMM, e.g. p.16) are well taken, but are a bit old-fashioned, "20th century". Nowadays, there also very good resampling based methods for assessing cluster stability, cluster membership, etc. See, e.g., the clue package on CRAN, or the "Cluster Stability Analysis" Section in the vignette of my Hiiragi2013 Bioconductor package. Adding these methods would be interesting, although I could understand if the authors decide it is out of scope. In that case, I suggest that at least the claims on exclusive utility of GMMs for doing such stability assessments be toned down.
3.) I very much appreciate the provision of an Rmarkdown vignette reproducing all plots in the paper. This is exactly how it should be done. Here two more suggestions, which go beyond with what is required for journal publication, but would greatly increase the impact and quality of this research: To allow execution by readers, I recommend also providing the .Rmd file, not only the rendered PDF. Moreover, to avoid 'code rot' and other reproducibility issues, I recommend submitting the Rmarkdown document (e.g. in the form of a package vignette) to a repository with a build system, such as CRAN or Bioconductor, which will make sure the code actually runs on any regular computer (no dependencies on private files), with current versions of R, etc.
4.) p.4 "As previously noted, each of these transformations seeks to render the data homoskedastic" -- I do not think this is correct. Homoskedasticity is the stated goal of the variance-stablising transformation, but not of the others.
5.) p.4 "... but does not facilitate clustering together features with similar patterns of expression across experiments." -- where is the evidence for this claim? (cf. Point 1 above)
6.) p.4 Notation overload in the equation for p_ij: the symbol j on the right hand side is used for two different things
7.) p.5 "becomes even more apparent when considering the normalized expression profiles p_ij (Figure 1C)." -- is this not a circular argument if Cluster 1 was itself obtained from the p_ij?
8.) p.5 "This means that the vector of values p_i are linearly dependent ... For this reason, we consider two separate transformations of the profiles pij to break the sum constraint ..." -- Even though the sum constraint is replaced by a more complicated constraint, the dependency is not broken. Is this argument really tenable (or needed)?
9.) p.5 What are the values (used by the software) for g_logit for p_ij = 0 or 1? Both can and will happen in practice.
10.) p.7 It would be helpful if plots showing the graphs of these transformations could be provided.
11.) p.10 "filtering genes with mean normalized count less than 50" -- This seems like it could be a too stringent threshold, especially in, say, developmental studies, where certain genes are completely switched off in some of the conditions. Indeed these tend to be the most important genes.
Very small point:
12.) Introduction: "Increasingly complex studies of transcriptome dynamics are now routinely carried out using high-throughput sequencing of RNA molecules, ..."
Indeed what is being sequenced in RNA-Seq are cDNA molecules. Direct sequencing of RNA is also possible but (currently) usually not called RNA-Seq.
p.15 hierarhical clustering
This review was prepared by Wolfgang Huber.