May 2008
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Authors' Committee

Chair:

Andy Eggers (Gov)

Members:

Weihua An (Soc)
Kevin Bartz (Stats)
Sebastian Bauhoff (HealthPol)
John Graves (HealthPol)
Justin Grimmer (Gov)
Jens Hainmueller (Gov)
Mike Kellermann (Gov)
Ellie Powell (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Kevin Quinn, Jamie Robins, Don Rubin, Chris Winship

Recent Comments

Recent Entries

Categories

Blogroll

Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 3.34


« Applied Statistics - Nan Laird & Christoph Lang | Main | Predicting Elections »

31 October 2006

More thoughts on publication bias and p-values

Amy Perfors

In a previous post about the Gerber & Malhotra paper about publication bias in political science, I rather optimistically opined that the findings -- that there were more significant results than would be predicted by chance, and that many of those were suspiciously close to 0.05 -- were probably not deeply worrisome, at least for those fields in which experimenters could vary the number of subjects run based on the significance level achieved thus far.

Well, I now disagree with myself.

This change of mind comes as a result of reading about the Jeffreys-Lindley paradox (Lindley, 1957), a Bayes-inspired critique of significance testing in classical statistics. It says, roughly, that with large enough sample size, a p-value can be arbitrarily close to zero even though the null hypothesis is highly probable (i.e., very close to one). In other words, a classical statistical test might reject the null hypothesis at an arbitrarily low p-value, even though the evidence that it should be accepted is overwhelming. [A discussion of the paradox can be found here].

When I learned about this result a few years ago, it astonished me, and I still haven't fully figured out how to deal with all of the implications. (This is obvious, since I forgot about it when writing the previous post!). As I understand the paradox, the intuitive idea is that, with larger sample size, you will naturally get some data that appears unlikely (and, the more data you collect, the more likely you are to see some really unlikely data). If you forget to compare the probability of that data under the null hypothesis with the probability of the data under the alternative hypotheses, then you might get an arbitrarily low p-value (indicating that the data are unlikely under the null hypothesis) even if the data is even more unlikely under any of the alternatives. Thus, if you just look at the p-value, without taking effect size, sample size, or the comparative posterior probability of each hypothesis under consideration, you are likely to wrongly reject the null hypothesis on the basis of the p-value, even if it is the most likely of all possibilities.

The tie-in with my post before, of course, is that it implies that it isn't necessarily "okay" practice to keep increasing sample size until you achieve statistical significance. Of course, in practice, sample sizes rarely get larger than 24 or 32 -- at the absolute outside, 50 to 100 -- which is much smaller than infinity. Does this practical consideration, then, mean that the practice is okay? As far as I can tell, it is fairly standard (but then, so is the reliance on p-values to the exclusion of effect sizes, confidence intervals, etc., so "common" doesn't mean "okay"). Is this practice a bad idea only if your sample gets extremely large?

Lindley, D.V. (1957) A statistical paradox. Biometrika, 44. 187-192

Posted by Amy Perfors at October 31, 2006 10:00 AM

Comments

On further reflection, I just wanted to clarify two things that weren't necessarily clear in this post:


1) I did not mention that one of the things driving this result is the assumption that we put slightly more prior probability on the null hypothesis than on the alternatives. The result occurs no matter how slight the "more" is, but it is necessary to have that sort of prior. I don't think this affects the implications that much, because in most scientific contexts I can think of, that's a reasonable formal approximation to what we are mentally doing. But I should have mentioned this as a factor.


2) Regarding the question of whether this is worrisome in practice, which came up on the Mixing Memory blog... I think the take-home message of the paradox is that effect size, not just statistical significance level, is important. As long as we report effect sizes whenever they run data until they get a significant result, then I guess my worry is mostly related to factors in the sociology of science and how other scientists tend to think about the results. In other words, because people tend to think categorically, and because a significance level gives a nice "category" to fit results in, even if effect size is reported and it's small people probably notice and remember that it was significant. This tendency is made worse by the fact that sometimes there is no accepted notion of how big of an effect is "interesting", in the same way that there are accepted p-value thresholds. Thus, if we always run subjects until getting a significant p-value -- even if we report effect size -- what ends up staying in memory is just the result and the knowledge that it was significant. It might be better to stop collecting data earlier, thus possibly overlooking findings with small effect sizes, in order to just focus on and pinpoint the interesting and robust results.


Honestly, I don't think this is a big worry in practice, at least for a lot of reported work. I made the post mainly because I think the paradox is cool and wanted to talk about it. :) But effect size does matter, for many diverse reasons: and the more salient we make this point, the more often we emphasize it, the less I worry about being led astray because of cognitive factors like those I detailed in the last paragraph.


Posted by: Amy at November 2, 2006 7:10 PM

Notification

Enter e-mail address to receive notification of new comments to this entry

Post a comment




Remember Me?

(you may use HTML tags for style)