Preprint reviews by Diego Villar

Mammalian genomic regulatory regions predicted by utilizing human genomics, transcriptomics and epigenetics data

Quan H. Nguyen, Ross L. Tellam, Marina Naval-Sanchez, Laercio Porto-Neto, Bill W. Barendse, Toni Reverter-Gomez, Ben Hayes, James Kijas, Brian P. Dalrymple

Review posted on 21st July 2017

With the aim of predicting regulatory activity in mammalian species where experimental datasets are largely lacking, the authors present an interesting analytical approach based on cross-mapping of publicly available genome-wide biochemical readouts of regulatory activity from the human genome. Conceptually this is a powerful approach, given the plethora of datasets that have been generated by international consortia and individual laboratories, which have largely focused on human cell lines and tissues.

The authors apply this approach to production mammals (cow and pig) as well as to other well-annotated mammalian species. The authors pipeline description includes details on dataset selection and parameter optimization for the cross-mapping of regulatory regions, as well as a filtering strategy to enrich for bona fide regulatory regions in their predictions (which partially relies on availability of species-specific data). Lastly, the authors apply their approach to prediction of regulatory regions in cattle, and the potential relevance of these predictions in functional annotation of regulatory variants and mutations.

While I found the manuscript very interesting in its conceptual approach and implementation, I have a number of concerns on the presentation of the results and the overall proof-of-principle for the usefulness of the approach.

Major comments:

1) The authors should more clearly establish how important the addition of species-specific data is for their filtering approach. In practice, there may be many situations were the approach is of interest to the research community, but no species-specific data is available. It would be thus interesting to incorporate analyses showing how the filtering and the resulting predictions are affected by removal of species-specific data (Figure 4, Figure 5 and SVM SNP scoring approach).

2) The authors provide somewhat limited evidence on the applicability of their approach. The manuscript would greatly benefit from some experimental validation of the authors predictions (Figures 6 and 7), even if minimal. For example, the luciferase experiments the authors suggest in the context of Figure 6 (page 10, lines1-2, "promoter activity of these three constructs") are straightforward to perform and would provide proof-of-principle for the usefulness of their regulatory predictions.

Minor comments:

- It would be useful to add some numbers to the discussion of conservation and divergence of regulatory activity in the Introduction (page 3, third paragraph). What is "a high level of conservation of …"? (page 3, line 39). This paragraph would be more balanced if the authors gave quantitative estimates on the extent of conservation and divergence in the different biochemical signatures of regulatory activity (TF binding locations, histone mark enrichment and DNA methylation levels), as reported in previous literature.

- The referencing is somewhat sparse and should be revised (e.g. page 7, lines 11-16).

- The figure legends should be carefully revised, as they are often too succinct in the current version (e.g. Figure 1 legend).

- It is not clear why the minMatch threshold was set at 0.2. The authors state that the testing indicated this was the optimal value (page 5, lines 8-9), but they may want to further justify their choice. From the testing results in Figure 2 it could be argued that any value below 0.5 is a sensible choice.

- Please indicate the number of regulatory features in each of the public datasets/annotations used in Figure 2 (ENSEMBL, ENCODE, FANTOM, etc), as part of the Figure legend.

- Table 2 does not seem to be referenced anywhere in the main text.

- The presentation of Figure 3 (page 5, lines 28-41) should be revised to improve clarity and readability.

- Also, in the current version of Figure 3:
- 3a is difficult to read due to the representation as a barplot and the many categories shown, and the difference between Figure 3c and 3d (enhancers vs promoters?) is unclear from either the Figure or the presentation of this figure in the results.
- the legend states that 3a is a projection of 42 combined Roadmap datasets and 3b is separate mapping for each tissue, but it looks like it is the other way around?
- the Venn diagrams in 3c and d should be made area-proportional to improve readability. It is also not clear whether the datasets in each are the same or different, with the exception of the yellow FANTOM dataset (enhancer in c) and promoter in d))

- The authors should consider reorganising the results sections, in particular the results of cross-mapping TF binding sites. In my opinion it makes more sense to present promoters and enhancers first, and have TF binding sites last (before presenting the filtering pipeline).

- I would keep Table S5 as a main text table, as this is a useful reference to readers on the final set of regions that are used for much of the paper.

- For clarity, the authors may want to introduce the concept of "Universal Dataset" more clearly in the section on "mapping enhancer datasets before filtering". As presented, this section and the next (filtering pipeline) read very repetitive.

- The sentence in page 8, lines 10-11 needs rewording as it is currently misleading ("which incorporates the power of cattle specific data to predict a small(er?) set of regions functional in bovine"). The predicted regions may enrich for functional regulatory regions in cow, but cannot be classed as "functional" as a whole.

- It is not clear what the relationship is between the y-axis in Figure 4b and the RatioP/E in Figure 4a and the main text. Could the authors also use the RatioP/E for Figure 4b instead?

- The authors should consider rewording/clarifying the sentence on page 9, lines 7-8. Do they mean that 92% of non-coding GWAS SNPs are within intronic regions?

- It would be useful to include the cow liver data in Figure 5 for comparison (e.g. how does the enrichment of pleiotropic SNPs look like in the cow liver data, compared to the authors multi-tissue predictions?).

- In the sentence starting on page 11, lines 10-12 ("In addition, …"), do they authors mean promoters or enhancers whose sequence is unique to a species (e.g. promoters or enhancers in a cow DNA sequence that cannot be mapped to the human genome)? Also, the sentence on page 11, lines 14-15 needs revision. The number of promoters and enhancers that were present (at the DNA sequence level) in a single species out of twenty (ref. 20) is small, but considerably higher than 1%. For example, in human liver, 4.2 % of enhancers cannot be mapped to any other of the 20 species analysed in ref. 20, and the proportion is slightly higher for promoters (5.3%). Therefore "less than 1%" should be reworded to "around 5%".

- Formatting problem with reference 21. Please revise.

- The footnotes of Table 1 need revision. There appears to be no data marked with the "4" superscript.

- The filtering criteria in Table 3 and Figure 4a should be described clearly, either as part of Table 3 or in the legend to Figure 4a. As presented, it is often unclear what the quantities in the filtering rules mean. For example: CAGE >=2, median(log2(Villar)), log2(H3K27Ac), SVM, 3rd quartile(Villar), etc. What are the units in each of these? (are CAGE number of tags?, H3K27Ac is normalised RPKM?, (Villar) is also normalised intensity?). The Supplemental Methods detail the rationale behind each filter and some of the final units for each, but this information should be more clearly displayed in the main manuscript.

- In the Supplemental Methods, please indicate clearly which are the K27ac datasets used for the gkm-SVM (current wording: "training matrices from a small number of cattle H3K27ac datasets"). Also, the following link to the Python code appears to be faulty.

- Legend to Supplemental Figure 1. Please clarify the meaning of "within regulatory regions". Are these your HPRS predictions? If so, are these the Universal or the Filtered Set?

- Ditto for the legend of Figure S2.

- Figure S5 legend: what is "mapped RNA signal"? Also, please add a scale and units to the data represented as a heatmap on Figure S5c.

- Figure S7. Is the "twelve species" here correct? If so, it is unclear why 12 are used here, as the main text (Table 4) mentions ten.

show less