Review for "Heat*seq: an interactive web tool for high-throughput sequencing experiment comparison with public data"

Completed on 20 Apr 2016 by Richard Smith-Unna . Sourced from http://biorxiv.org/content/early/2016/04/18/049254.

Login to endorse this review.


Comments to author

Author response is in blue.

This is potentially a very interesting and useful tool. The most important factor in whether is is useful is how the normalisation is done. You state:

> Heatmaps represent Pearson’s correlation values between experiments calculated using a Gene x Experiment numeric matrix with normalised gene expression values for expression data, a Genomic regions x Experiments binary matrix indicating presence or ab- sence of a peak for TF ChIP-seq data and a Genomic re- gions x Experiments numeric matrix of expression values for CAGE data.

Can you please expand on this, giving detail about the expression normalisation and representation used for RNAseq and CAGE, and how peak calls are normalised for ChIP-seq. Adding this explanation to the text would, in my opinion, help users feel confident in the tool.



Thanks for your interest and the feedback.

I will describe in more details what the application does, and why we decided it to do it that way.

Short version: there is no normalisation done, it is a major limit of the tool, we have not find any solution general enough and fast enough to be used here. However I will gladly look into any suggestion.

I’m not quite sure why the “normalised” in “numeric matrix with normalised gene expression values for expression data” made it through this version of the manuscript, and it will be removed.

This web application is to compare one expression file / list of peak / CAGE results against all experiment in one dataset. The idea was to develop a lightweight exploratory tool that will never be as precise as an in depth project with tailored reprocessing and normalization, but will give a first answer in minute(s) instead of week(s).

In order for the application to work fast enough, we pre-computed correlation coefficient between each pair of experiment within each dataset. We do not compare datasets one with the other. We decided to not reprocessed data for each dataset, and simply used processed data made by each consortium / team. Each dataset used its own, different, pipeline to analyse all experiment of a dataset the same way, so there is no strong need to further normalise each datasets, and we can assess (biological and technical) variability between sample. Thankfully, for most datasets like Bgee (expression data in mouse), clustering mostly reflect tissue origin of the sample. For other, like the (quite strange) ENCODE expression dataset in human, the clustering mostly reflect the fraction of RNA that was studied (ie cytosolic, nuclear, polyA, depletion of ribosomal, etc.). This can be seen quite easily using the “RNA fraction (empty to select all)” field.

To cite a slightly more detailed version of the manuscript:

"For expression data, a Gene x Experiment numeric matrix containing gene expression values (FPKM) was built to compute the Pearson correlation coefficient for every pair of experiments, after removing genes with zero expression values across. For TF ChIP-seq data, a Genomic regions x Experiments binary matrix indicating presence or absence of a peak in that region for that experiment was built to compute correlation values between each pair of experiments. Genomic regions were defined by merging the complete list of peaks for a dataset using BEDTools (Quinlan and Hall, 2010), to find non-overlapping genomic regions. For CAGE-seq data, we built a Genomic regions x Experiments numeric matrix of non-overlapping regions containing CAGE expression values (RPM) in at least one experiment to calculate the Pearson correlation coefficient for each pair of experiments.
[…]
When the user uploads a file, we compute an approximation of correlation values between the file and every experiment in a dataset using the following rules:
• For HeatRNAseq, if a gene is present in the dataset but not in the user file, it is assigned a zero expression value. If a gene is present in the user file but not in the dataset, it will be ignored.
• For HeatChIPseq, if a peak in the user file does not overlap any regions in the dataset, it will be discarded.
• For HeatCAGEseq, if a CAGE peak in the user file does not overlap any CAGE peak in the dataset, it will be discarded."

There is plenty of reason for why this should not work, and greatly underestimate the correlation value of the user file against all other experiments in the datasets: differences in library preparation methods and sequencing protocols, as well as differences in the bioinformatics pipeline. And indeed, like reported in supplementary figure 1, the top correlation value may be quite low. But:
- Sometimes, like the three examples of the application, it works like a charm. We were quite surprised by this. Details to recreate each example is available on the instructions of each sub-applications, section 2.2. It is really straight-forward.
- The bias is mostly toward underestimating correlation values, and more importantly, the bias should be somewhat constant across all experiments of a dataset, so the most correlated experiments, even if lowly correlated, could still be meaningful.

Could we did better?

Pearson Correlation value is insensitive to linear transformations of the data, which is always that. It is somewhat robust to moderately nonlinear transformations of the data to.

Since it is a “one against all” case, we cannot apply batch effect correction method such as combat, because we have only one experiment in a batch, and hundreds in the other.

Quantile normalisation, upper-quantile normalisation, TMM and z-score did not improve this situation that much, according to my exploratory tests (the code is not quite as clean as I wish, sorry):

https://github.com/gdevailly/H...
https://github.com/gdevailly/H...
https://github.com/gdevailly/H...

Should we use Spearman or Kendall instead of Pearson? I am quite open to this question, I have heard about as many argument in favour of Spearman as in favour or Pearson. However, R implementation of Spearman CC is *much slower* than its implementation of Pearson CC, which mean that we may not be able to compute 8.555 correlation values for the GTEx dataset in less than a minute, as we are now. It should be noted that for binary data (which is how we summarised ChIP-seq data), it does not make any difference.

Please, do not hesitate to ask for more information, or to provide feedback.
Yours,

Guillaume Devailly