Review for "Monitoring changes in the Gene Ontology and their impact on genomic data analysis"

Completed on 5 Jun 2018 by Mark Wass.

Login to endorse this review.

Comments to author

The use of Gene Ontology enrichment is widespread across biological sciences to investigate function or processes associated with particular conditions. This manuscript describes a resource called GOTrack that tracks changes in the Gene Ontology between releases over a number of years and enables users to investigate how the results of their enrichment analysis alter depending on the dataset that is used. This appears to be a resource that will be useful to many, including those working regularly with GO/GOA (e.g. those developing methods including for protein function prediction) and also more broadly researchers who want to identify the changes that occur in their system when they have different settings/conditions.

The manuscript shows that annotations can vary considerably over time and further that the results of enrichment analyses can also change as a result. This appears to be interesting work but I also wonder if some of it is simply obvious. It is well established that Gene Ontology annotations are far from complete - I think less than 1% of proteins in uniprot have been assigned annotations with experimental evidence codes. Clearly there is much that is not known about protein function and therefore the Gene Ontology would be expected to develop over time. Some of this comes across in the discussion but I feel it would be better if the scene was set in the introduction.

The manuscript shows that the GOTrack tool can be used to assess if enrichment analyses are stable over time - surely what researchers are interested in is whether the enrichment analysis that they perform is correct with the latest data set (which would be expected to contain the latest annotations)? It strikes me that if many new annotations are added in a particular release that use of previous release data may give different results and I would want to know if I can be confident based on the data that has been used rather than older data. It also seems that it is one of those problems where there will never be a correct answer unless all genes are fully annotated and this is unlikely to happen.

The manuscript states that it is difficult to predict which GO terms will remain enriched in future datasets and this seems to be me to the most relevant or useful potential feature - looking to older datasets is ok but how useful is this?

In one of the examples ( for ACTC1) the manuscript refers to IEA annotation - Inferred by electronic annotation. My understanding is that these are low confidence and variable should these really be used?