Preprint reviews by Palle Villesen

Biological classification with RNA-Seq data: Can alternative splicing enhance machine learning classifier?

Nathan T. Johnson, Andi Dhroso, Katelyn J. Hughes, Dmitry Korkin

Review posted on 28th June 2017

Dear authors - interesting work!

What about overfitting/data dredging in your work? "The reported result of assessment is based on the average f-measure for the 10-folds for testing dataset."

When you go from genes to isoforms you also increase the number of predictor variables which make overfitting more possible (not necessarily more likely though).

I couldn't see the variance of these f-measures from CV which is normally a signature of overfitting (if the variance is very high).

For a full analysis I would suggest you split your datasets into training (MCC or F estimated by CV on this set) and validation set (MCC or F estimated by fitting final model to full training set - evaluate on this set). This is very close to what is done in kaggle competitions etc. where you actually measure your performance yourself (internal performance) but also need to predict on new data (external performance). If these two measures are very different the chosen model is not good.

Check "Comparison of RNA-seq and microarray-based models for clinical endpoint prediction". The problem is that when using CV to compare and select best models you may end up with the model that accidentally fits (using CV) your dataset best (data dredging). So basically you would like to see a nice correlation between training performance (internal performance) and validation performance (external performance) - and only use internal performance to rank models/parameters.

show less