Review for "The Reproducibility Of Research And The Misinterpretation Of P Values"

Completed on 9 Aug 2017 by Lasse Kliemann. Sourced from http://www.biorxiv.org/content/early/2017/08/07/144337.

Login to endorse this review.


Comments to author

Author response is in blue.

I have a couple of questions and comments on this article. Let us start with one particular issue in this post. Maybe the others will become clear for me during the course of the discussion.

The way the theoretical computations are conducted in Section 5, it appears to me that it is important to have exactly *one* distribution under H1. This is the situation given in Figure 1: one density for H0 and one density for H1. This way, the likelihood ratio immediately makes sense and is easy to compute.

Thus we not only need to know that the difference between the two means is 1 under H1, but also we need to know which of the two groups has the higher mean under H1. Only with this knowledge, we can, by an appropriate ordering of the sample means in the test statistic, ensure that, under H1, this test statistic has the Student's t distribution with the positive non-centrality parameter as shown in Figure 1. So if x and y are the two samples and we know that, under H1, x stems from the group with mean 1, we would use as test statistic:

t(x,y) = (mean(x) - mean(y)) / sqrt(var(x) + var(y)) * sqrt(n)

Here, n = length(x) = length(y). A one-sided rejection region would then be appropriate, so the p-value would be p.val = 1 - pt(t(x,y), df=2*(n-1)).

However, a two-sided test is used in the article. Since the two-sidedness strongly influences the likelihood ratio (factor 2), this is an important detail.

Most certainly I am missing something. Could you help me shed some light on this? Thanks.



for@Lasse Kliemann

I'm sorry that it's taken me 6 days to notice your question. I've had lots of feedback but very little has appeared in the comments here. i guess it takes time for people to get used to the idea of post-publication peer review.

The reason for assuming that the distributions are gaussian in both samples is because that is what the t test assumes. The simulations (and the exact calculations here) obey exactly the assumptions made by the test. Of course in real life that may not happen, but the idea in this paper (and the 2014 paper) is to see what P values tell you in the ideal case. As I say more than once. real life can only be worse that the results here.

As you say, the calculations are all done for a true difference between means of one SD. But, as I say (eg p 5).

Finally, the reason for doing 2-sided tests is simply that that's what most people do. I don't think that it matters very much/ In real life there is very little difference between the implications of observing P = 0.03 and P = 0.06. Both mean "worth another look" Neither provides strong evidence that there's a real effect.
"In order to calculate the FPR we need to"postulate an alternative to the null hypothesis. Let’s say that the true effect size is equal to one, the same as the standard deviation of the individual responses. This is not as arbitrary as it seems at first sight, because identical results are obtained with any other true effect size, as long as the sample size is adjusted to keep the power of the test unchanged [10]".

An example is given in ref [10]

"For example, if the true effect size is reduced from 1 to 0.5 s.d., then to keep the power at 0.8 the sample size has to be increased from 16 to 63. Running the example with these numbers gives a false positive rate of 26%, as before, for a prior prevalence of 0.5 (and much worse for lower prevalences)."

This suggests that the problem might be normalised in a way that made it more obvious that the results aren't dependent on the effect size that's chosen. But people think in terms of sample size and effect size, so I haven't tried to do that.