Review for "Enhancing the precision of our understanding about mentalizing in adults with autism"

Completed on 17 Dec 2015 by Jon Brock . Sourced from

Login to endorse this review.

Comments to author

Author response is in blue.

Thanks for sharing this. I'm really glad that autism researchers are starting to (a) look at cognitive heterogeneity; and (b) use preprint servers!

Some comments:

First, this is bugbear of mine but I find it quite unhelpful to talk about the RMET as a measure of mentalizing. In truth, it's a (relatively difficult) 4AFC test of emotion recognition. We can argue about whether learning the meanings of certain emotion words used in the test is contingent on having a fully functioning "theory of mind". But it's clear that the RMET is measuring something very different to other "mentalizing" tests in which the participant infers mental states based on the protagonists behaviour and/or events that are witnessed or described.

Second, I agree that there's potentially useful information at the item level that is lost by just totting up the number of correct items. But it's not clear to me that your study is demonstrating this to be true. In other words, what does subdividing the ASD group into "impaired" and "unimpaired" subgroups based on the clustering algorithm tells us that we wouldn't get by subdividing them according to some cut-off based on raw score? We learn that the "unimpaired" group have higher overall scores and higher VIQs, but we kind of know that already.

Third, related to the previous point, you show that a classifier trained on your subgroups in one dataset does a good job of predicting subgroup in an independent dataset; but how much of this "replicability" is driven by differences in overall performance? It would be helpful to get some more explicit details of what went into the classifier, but I assume that it's essentially providing a threshold on a weighted sum of all the items in the test. You've already shown that your subgroups (on which the classifier is trained) differ in overall performance (ie the unweighted sum of all the items). So it would be pretty odd if the classifier *didn't* perform well in a replication sample where subgroups also differed in overall performance. Indeed, in the TD group, where there aren't huge differences in overall performance, the classifier doesn't translate to the replication sample.

Hopefully my comment will help you clarify the article. I really like the approach of digging into the item-level data. At the very least I think it tells us something useful about the structure of the RMET - and which items are discriminating well between people who do versus do not have difficulties with labelling complex emotions. I'm just not convinced (yet) of some of the bolder claims you're making!

Finally, some references you may find useful:

Roach, N. W., Edwards, V. T., & Hogben, J. H. (2004). The tale is in the tail: An alternative hypothesis for psychophysical performance variability in dyslexia. PERCEPTION-LONDON-, 33(7), 817-830.

Towgood, K. J., Meuwese, J. D., Gilbert, S. J., Turner, M. S., & Burgess, P. W. (2009). Advantages of the multiple case series approach to the study of cognitive deficits in autism spectrum disorder. Neuropsychologia, 47(13), 2981-2988.

Brock, J. (2011). Commentary: complementary approaches to the developmental cognitive neuroscience of autism–reflections on Pelphrey et al.(2011). Journal of Child Psychology and Psychiatry, 52(6), 645-646.

Hi Jon,

Thanks. These are all very helpful comments and its very appreciated. I’ll go through each below. By the way, I should say that I’m speaking on behalf of my own opinion right now, and not the co-authors per se.

RE: preprints:

Yes, they’re great. We put up another manuscript previously on bioRxiv and had a very good experience with it, though that one was more an imaging methods paper, but it had some important points about statistical power, effect size, sample size, and neural systems for mentalizing. See here for that:

RE: Unhelpful to talk about RMET as ‘mentalizing’ and instead label it as a measure of ‘emotion recognition’:

I don’t think I would characterize the RMET as only ‘emotion recognition’ as ‘not’ mentalizing. ‘Mentalizing’ is a term that covers many ways in which an individual can understand mental states in ourselves and others; the array of ways in which we get to a solution about that is quite complex and requires a lot more elucidation. This TICS article summed up points about that quite well (, and I’d agree with that. But more specifically on the issue of whether RMET is measuring something like complex emotion recognition and whether that is ‘not’ part of what we are saying when we use a term like ‘mentalizing’, I think that’s an empirical point that one must show if one truly wants to make the statement that a label like ‘complex emotion recognition’ points to a process that is completely independent from processes we refer to when we use the term ‘mentalizing’. So while I would agree with you that complex emotion recognition is involved and that the RMET is possibly not measuring the same things as other very different tests in the domain, I would be cautious in only using the label ‘emotion recognition’ for it and saying its not within the scope of mentalizing. The RMET as a tool is not assaying every aspect of mentalizing, and in the paper, we have caveats in there to discuss how utilizing other measures may give different vantage points on the domain or may shed other insight on subgroups that the RMET cannot.

RE: Why not just use summary score cut-offs instead of clustering of item-level responses?

The main problem I can see with cut-offs on the summary scores is that they are (mostly) arbitrary where you place them. If one visually examines the scatter-boxplots showing the summary scores, it’s really hard to derive a meaningful cut-point there (if you ignore the coloring of the dots, as that’s there after the clustering approach has subgrouped them). The approach we took with hierarchical clustering and then estimating optimal number of clusters, makes the whole task of subgrouping data-driven and unsupervised and this makes it less arbitrary where the cut point ends up being. Our approach is leveraging the information in the item-level responses, as we can get a better idea of how similar-dissimilar individuals are (i.e. Hamming distance is the metric of similarity we use here) with the item-level responses intact.

RE: the classifier, overall performance, replicability:

The explicit details about what went into the classifier is quite simple to summarize (sorry if this wasn’t apparent in the paper). The data matrix being worked on in training and test is a 2D matrix of subjects along the rows, and RMET items along the columns. In machine learning terms the items are the ‘features’ and the subjects are the ‘examples’. For training the SVM model learns how to separate ‘Impaired’ and ‘Unimpaired’ within one dataset (i.e. CARD). Then during test, that training model is applied to data its never seen before (i.e. AIMS), and then tries to predict each individual as ‘Impaired’ or ‘Unimpaired’, and the ground truth for determining whether those predictions are right or wrong are the subgroup labels of ‘Impaired’ and ‘Unimpaired’, which were defined exclusively by the clustering within the test dataset.

With SVM, its actual separation task is to try and separate class 1 from class 2 with a hyperplane in 36 dimensions (since there are 36 items). That’s all pretty hard to visualize, but basically it helps to think about classification as looking for patterns across the features. You can get a classifier that highly discriminates for both reasons of 1) large differences between the classes across all the items (e.g., what I think you’re referring to here as ‘overall performance’) but also can get perfect discrimination between the classes if they are 2) similar in overall performance but the patterning is different. An example of 1 might be if one subgroup scored really poorly on all items and the other scored well on all the items. An example of 2 might be if subgroup 1 got the first half of items correct and the second half wrong, while vice versa happens in subgroup 2. In example 2, both subgroups probably have equal sum scores, and the only difference is the pattern of response across items. These are just toy examples, and real data is usually a combination of both. Here, I’d say that it’s true that 1 is happening for sure, but 2 might also be happening too. Anyhow, I don’t see this as problematic.

Additionally, without looking at the classifier at all, I’ve done some further analyses that are probably more to your point of how similar individuals within the subgroups are between-datasets. What I’ve done now (latest revision should be up online soon), is to plot the dissimilarities between-subgroups and datasets all in one matrix in a new Figure 2. I’ve also quantified on a per subject basis the average between-subject dissimilarity for various subsets of the data. The most important thing here are comparisons looking at dissimilarity between-subjects from the same subgroup but from different datasets (i.e. the similarity between Discovery and Replication Unimpaired subgroups) compared to dissimilarity between-subjects from different subgroups within the same dataset (i.e. the similarity between Discovery Unimpaired vs Impaired subgroups). The similarity between individuals from the same subgroup (even when across datasets) is always higher than similarity between individuals from different subgroups (even when they are from the same dataset). I think this should help if you’re wondering about how similar RMET item-level responses are from individuals of the same subgroup but across different datasets. The between-dataset similarities weren’t plotted in the prior versions of the manuscript, and your comment here really helped push towards pulling out that important observation, so thanks very much for this.


Mike Lombardo