Review for "Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature"

Completed on 26 Aug 2016 by Jochen Weber. Sourced from http://biorxiv.org/content/early/2016/08/25/071530.

Login to endorse this review.


Comments to author

Author response is in blue.

Could you please clarify how exactly your effect-size-extraction algorithm used the text of any given manuscript to calculate a standard effect size (i.e. which parameters and values were put into which formula/s), particularly given the following arguments?

As you state in the Introduction section, power estimates and NHST are somewhat mutilated constructs that came out of Fisher's and significance testing approach and the Nyman-Pearson theory. In most of the (current) neuroscience literature, significance is not merely assessed by the (raw) statistic, but instead a "combined height-and-size thresholding" procedure is applied.

This combined approach typically results in a "significant finding" if (and only if) a contiguous set of voxels (i.e. a cluster) in its entirety surpasses a "cluster-forming threshold" (uncorrected p-value, typically set between one-tailed p<0.01 and one-tailed p<0.001, including those boundary values) *AND* when the cluster at least is of a size (typically measured in number of voxels) that is determined via one of several ways, including Gaussian Random Fields theory (e.g. in SPM), simulating noise data with a separately estimated smoothness (e.g. in AFNI's AlphaSim), or via permutation testing (non-parametrically, e.g. in SnPM).

The relevant statistic to assess its "significance" (related to the likelihood of detection and power!), really, is then the p-value of a cluster (not its peak or average t-value). For example, if any given test (such as a two-sample t-test with group sizes N1=12 and N2=12, for a d.f. of 22) is performed (and a whole-brain search is reported, i.e. no spatial prior is used!), a typical manuscript would then contain a table of all clusters (and their peak coordinates and sizes) that reach this combined threshold, as an example, the authors may have chosen to apply a CDT of p<0.001 (t[d.f.=22] >3.505), and with an assumed voxel size of isotropic 3mm (27 cubic mm/voxel), and a smoothness estimate of 13.5mm, an application of the AlphaSim algorithm would lead to a required cluster size of approximately 73 voxels (i.e. if in a contiguous region of space spanning at least 73 3mm-cubed voxels inside the brain mask all voxels surpass a t-threshold of 3.505, this region would be considered "significantly different" between the groups).

Please then be aware that for this "activation difference", the p-value is still *just* 5% (i.e. the chance-level of finding such a cluster size given the t-threshold in noise conditions is 1 in 20). That being said, I think it is then equally fair to ACTUALLY compute the standardized effect size for such a cluster from a t-value of 1.717 (d.f.=22; one-tailed p<0.05).

In case your algorithm uses the (maximum or mean) t-value reported for the cluster, it is only natural that the standardized effect size distribution MUST be skewed towards a region that is not in line with traditional psychological research (given the massive application of family-wise-error correction procedures due to the mass-univariate testing approach, still dominant in the field).

As a summary prescription and recommendation for the literature/field, I would consider urging authors of future publications to always report cluster-level p-value estimates (which can be assessed in both GRF and simulation methods by comparing the observed cluster size of any cluster reaching significance with the distribution of cluster sizes under the NULL), such that effect sizes, at least when it comes to using them for the purpose of power estimates, can be computed appropriately.

Thank you so much for your consideration and efforts!



Thank you very much for your interest and comments!

First, some clarifications:

The method of power calculation was exactly as described in Section 3 of the Supplementary Material. As outlined, this calculation used the non-central t distribution with a mixture model of one sample, matched and two-sample t-tests and gave exactly the same results as the Matlab ‘sampsizepwr’ function for these tests which served as our benchmark. (However, our algorithms ran much faster and were more flexible/transparent than the built-in Matlab routines, so they were easier to use.)

As noted on page 8-9 of the main text, the input effect sizes (D) were estimated according to the mixture model based on the validation data (𝐷 = 𝑝𝑟(𝑡1|𝑑𝑓) ∙ 𝐷𝑡1
+ 𝑝𝑟(𝑡2|𝑑𝑓) ∙ 𝐷𝑡2).

Further, statistical reporting
in journals can be fairly variable paper by paper, so the algorithm could find
t and F values and associated data as reported in the text (body) of pdf files;
tables are much harder to analyse. However, during the validation exercise we
found that overall most t test statistics were reported in the text. This is
especially true for ‘high impact’ journals which tend to condense statistics
and only report some in the main text. For absolute certainty we only dealt
with ‘regular’ reporting of t values, e.g. t(df) = t value; p = p value; D/G,
etc. = effect size. (We had to be general enough so that we could parse all
journals in the sample, some journals use fairly different reporting from some
others.)

The algorithm did not
discriminate according to what type of t value was reported, this is not really
possible at such level of automation without also going into detail of
interpreting the text (which would require much more intelligent code) because
there is not much ‘detailed standard notation’ for reporting fMRI, etc. data in
the text. So, we analysed all reported t values without exception. Importantly,
we can of course assume that researchers reported those t values for a reason
and most probably used ‘statistical significance’ related to them in support of
their arguments in the Discussion. As we also note, as registration of pre-study
intentions is typically missing it is also hard to decide what reports related
to primary, secondary, etc. hypotheses. Also, as for example discussed by Carp
(2012), Kriegeskorte et al. (2009) and others, analysis approaches are
extremely variable so the actually reported data can be highly idiosyncratic.

Finally, reported t values and effect sizes from underpowered studies are most likely highly inflated (exaggerated). So, the most interesting test of power (as we noted) is
to calculate power to detect small, medium and large effects based on the classical
studies of Cohen (1962), Sedlmeier and Gigerenzer (1989), Rossi (1990)
considering reported degrees of freedom. This approach does not depend on any t
values reported in studies.

What to do?

Most importantly,
as we note in our Discussion, of course, ‘some teams and subfields may have
superb, error-proof research practices, while others may have more frequent
problems.’ So, overall, it seems that cognitive neuroscience research really
needs some of these excellent (often methodology focused) research teams with robust and error-proof practices to produce some strongly highlighted guideline papers for the whole community so that we can raise research standards. These
guideline papers should set the expected minimum and optimal standards in the field. These would help reviewers, editors, research councils, the whole scientific
community and the public to 1) judge the quality of studies clearly and to 2)
enforce guidelines if possible.

E.g. it seems that we
would minimally need to:

1) standardize design, analysis and reporting methods and ‘enforce’ high quality standardized papers if possible.

2) enforce pre-study registration of experimental intentions

3) enforce pre-study power calculations especially when it comes to running expensive fMRI, MEG, etc. experiments.