This claim (or some variant) has been invoked by a few researchers when they take a position on issues of replication and the general purpose of research. For example, I have heard this platitude from some quarters when they were explaining why they are unconcerned when an original finding with a d of 1.2 reduces to a d of .12 upon exact replications. Someone recently asked me for advice on how to respond to someone making the above claim and I struggled a bit. My first response was to dig up these two quotes and call it a day.
Cohen (1994): “Next, I have learned and taught that the primary product of research inquiry is one or more measures of effect size, not p values.” (p. 1310).
Abelson (1995): “However, as social scientists move gradually away from reliance on single studies and obsession with null hypothesis testing, effect size measures will become more and more popular” (p. 47).
But I decided to try a bit harder so here are my random thoughts at trying to respond to the above claim.
1. Assume this person is making a claim about the utility of NHST.
One retort is to ask how the researcher judges the outcome of their experiments. They need a method to distinguish the “chance” directional hit from the “real” directional hit. Often the preferred tool is NHST such that the researcher will judge that their experiment produced evidence consistent with their theory (or it failed to refute their theory) if the direction of the difference/association was consistent with their prediction and the p value was statistically significant at some level (say an alpha of .05). Unfortunately, the beloved p-value is determined, in part, by the effect size.
To quote from Rosenthal and Rosnow (2008, p. 55):
Because a complete account of “the results of a study” requires that the researcher report not just the p value but also the effect size, it is important to understand the relationship between these two quantities. The general relationship…is…Significance test = Size of effect * Size of study.
So if you care about the p value, you should care (at least somewhat) about the effect size. Why? The researcher gets to pick the size of the study so the critical unknown variable is the effect size. It is well known that given a large enough N, any trivial difference or non-zero correlation will attain significance (see Cohen, 1994, p. 1000 under the heading “The Nil Hypothesis”). Cohen notes that this point was understood as far back as 1938. Social psychologists can look to Abelson (1995) for a discussion of this point as well (see p. 40).
To further understand the inherent limitations of this NHST-bound approach, we can (and should) quote from the book of Paul Meehl (Chapter 1978).
Putting it crudely, if you have enough cases and your measures are not totally unreliable, the null hypothesis will always be falsified, regardless of the truth of the substantive theory. Of course, it could be falsified in the wrong direction, which means that as the power improves, the probability of a corroborative results approaches one-half. However, if the theory has no verisimilitude – such that we can imagine, so to speak, picking our empirical results randomly out of a directional hat apart from any theory – the probability of a refuting by getting a significant difference in the wrong direction also approaches one-half. Obviously, this is quite unlike the situation desired from either a Bayesian, a Popperian, or a commonsense scientific standpoint.” (Meehl, 1978, p. 822).
Meehl gets even more pointed (p. 823):
I am not a statistician, and I am not making a statistical complaint. I am making a philosophical complaint or, if you prefer, a complaint in the domain of scientific method. I suggest that when a reviewer tries to “make theoretical sense” out of such a table of favorable and adverse significance test results, what the reviewer is actually engaged in, willy-nilly or unwittingly, is meaningless substantive constructions on the properties of the statistical power function, and almost nothing else.
Thus, I am not sure that this appeal to directionality with the binary outcome from NHST (i.e., a statistically significant versus not statistically significant result according to some arbitrary alpha criterion) helps make the above argument persuasive. Ultimately, I believe researchers should think about how strongly the results of a study corroborate a particular theoretical idea. I think effect sizes are more useful for this purpose than the p-value. You have to use something – why not use the most direct indicator of magnitude?
A somewhat more informed researcher might tell us to go read Wainer (1999) as a way to defend the virtues of NHST. This paper is called “One Cheer for Null Hypothesis Significance Testing” and appeared in Psychological Methods in 1999. Wainer suggests 6 cases in which a binary decision would be valuable. His example from psychology is testing the hypothesis that the mean human intelligence score at time t is different from the mean score at time t+1.
However, Wainer also seems to find merit in effect sizes. He writes this as well “Once again, it would be more valuable to estimate the direction and rate of change, but just being able to state that intelligence is changing would be an important contribution (p. 213). He also concludes that “Scientific investigations only rarely must end with a simple reject-not reject decision, although they often include such decisions as part of their beginnings” (p. 213). So in the end, I am not sure that any appeal to NHST over effect size estimation and interpretation works very well. Relying exclusively on NHST seems way worse than relying on effect sizes.
2. Assume this person is making a claim about the limited value of generalizing results from a controlled lab study to the real world.
One advantage of the lab is the ability to generate a strong experimental manipulation. The downside is that any effect size estimate from such a study may not represent typical world dynamics and thus risks misleading uninformed (or unthinking) readers. For example, if we wanted to test the idea that drinking regular soda makes rats fat, we could give half of our rats the equivalent of 20 cans of coke a day whereas the other half could get 20 cans of diet coke per day. Let’s say we did this experiment and the difference was statistically significant (p < .0001) and we get a d = 2.0. The coke exposed rats were heavier than the diet coke exposed rats.
What would the effect size mean? Drawing attention to what seems like a huge effect might be misleading because most rats do not drink 20 cans of coke a day. The effect size would presumably fluctuate with a weaker or stronger manipulation. We might get ridiculed by the soda lobby if we did not exercise caution in portraying the finding to the media.
This scenario raises an important point about the interpretation of the effect sizes but I am not sure it negates the need to calculate and consider effect sizes. The effect size from any study should be viewed as an estimate of a population value and thus one should think carefully about defining the population value. Furthermore, the rat obesity expert presumably knows about other effect sizes in the literature and can therefore place this new result in context for readers. What effect sizes do we see when we compare sedentary rats to those who run 2 miles per day? What effect sizes do we see when we compare genetically modified “fat” rats to “skinny” rats? That kind of information helps the researcher interpret both the theoretical and practical importance of the coke findings.
There are probably other ways of being more charitable to the focal argument. Unfortunately, I need to work on some other things and think harder about this issue. I am interested to see if this post generates comments. However, I should say that I am skeptical that there is much to admire about this perspective on research. I have yet to read a study where I wished the authors omitted the effect size estimate.
Effect sizes matter for at least two other reasons beyond interpreting results. First, we need to think about effect sizes when we plan our studies. Otherwise, we are just being stupid and wasteful. Indeed, it is wasteful and even potentially unethical to expend resources conducting underpowered studies (see Rosenthal, 1994). Second, we need to evaluate effect sizes when reviewing the literature and conducting meta-analyses. We synthesize effect sizes, not p values. Thus, effect sizes matter for planning studies, interpreting studies, and making sense of an overall literature.
[Snarky aside, skip if you are sensitive]
I will close with a snarky observation that I hope does not detract from my post. Some of the people making the above argument about effect sizes get testy about the low power of failed replication studies of their own findings. I could fail to replicate hundreds (or more) important effects in the literature by running a bunch of 20 person studies. This should surprise no one. However, a concern about power only makes sense in the context of an underlying population effect size. I just don’t see how you can complain about the power of failed replications and dismiss effect sizes.
Post Script (6 August 2013):
Daniel Simons has written several good pieces on this topic. These influenced my thinking and I should have linked to them. Here they are:
Likewise, David Funder talked about similar issues (see also the comments):
And of course, Lee Jussim (via Brent Roberts)…