“Although well-meaning, many of the suggestions only make sense if you say them very fast.” -Howard Wainer (2011, p. 7) from Uneducated Guesses
I love this phrase: Your ideas only make sense if you say them very fast. I find myself wanting to invoke this idea anytime I hear some of the counterarguments to methodological reform. For example, I think this line applies to NS’s comment about climate change skeptics.
Anyways, I am about 90% done with the articles in the November 2012 special issue of Perspectives on Psychological Science. I enjoyed reading most of the articles and it is good resource for thinking about reform in psychological research. It should probably be required reading in graduate seminars. So far, the article that generated the strongest initial reaction was the Galak and Meyvis (2012; hereafter G & M) reply to Francis (2012). I think they basically made his point for him. [I should disclose that I think the basic idea pursued by G and M seems plausible and I think their reply was written in a constructive fashion. I just did not find their arguments very convincing.]
Francis (2012) suggests that the 8 studies in their package are less compelling when viewed in the aggregate because the “hit” rate is much higher than one would expect given the sample sizes and effect size in question. The implication is that there is probably publication bias. [Note: People sometimes quibble over how Francis calculates his effect size estimate but that is a topic for another blog post.]
I happen to like the “Francis” tables because readers get to see effect size estimates and the sample sizes stripped clean of narrative baggage. Usually the effect sizes are large and the sample sizes are small. This general pattern would seem to characterize the G and M results. (Moreover, the correlation between effect size estimates and sample sizes for the G and M set of results was something like -.88. Ouch!).
G and M acknowledge that they had studies in their file drawer: “We reported eight successful demonstrations of this phenomenon in our paper, but we also conducted five additional studies whose results either did not reach conventional levels of significance or did reach significance but ended up being rhetorically redundant” (G & M, 2012, p. 595). So there was selective reporting. Case closed in my book. Game over. As an aside, I am not sure I can distinguish between those desperate effect sizes who are reaching toward the p < .05 promised land from those who are fleeing from it. Can you? It probably takes ESP.
G and M calculated their overall effect size as a g* of .38 (95% CI .25 to .51) with all studies in the mix whereas Francis reported the average g* from the published work as .57. So it seems to me that their extra data brings down the overall effect size estimate. Is this a hint of the so-called decline effect? G and M seem to want to argue that because the g* estimate is bigger than zero that there is no real issue at stake. I disagree. Scientific judgment is rarely a yes/no decision about the existence of an effect. It is more often about the magnitude of the effect. I worry that the G and M approach distorts effect size estimates and possibly even perpetuates small n studies in the literature.
G and M also stake a position that I fail to understand: “However, as is the case for many papers in experimental psychology, the goal was never to assess the exact size of the effect, but rather to test between competing theoretical predictions” (G & M, 2012, p. 595). People use this or a similar argument to dismiss current concerns about the paucity of exact replications and the proliferation of small sample sizes in the literature. What I do understand about this argument makes me skeptical. Let’s quote from the bible of Jacob Cohen (1990, p. 1309):
“In retrospect, it seems to me simultaneously quite understandable yet also ridiculous to try to develop theories about human behavior with p values from Fisherian hypothesis testing and no more than a primitive sense of effect size.”
So effect sizes matter for theories. Effect sizes tell us something about the magnitude of the associations in question (causal or otherwise) and I happen to think this is critical information for evaluating the truthiness of a theoretical idea. Indeed, I think the field of psychology would be healthier if we focused on getting more precise estimates of particular effects rather than playing this game of collecting “hits” from a bunch of underpowered “conceptual” extensions of a general idea. I actually think G and M should have written this statement: “As is the case for many papers in psychology, our goal was to present as much evidence as possible for our preferred theoretical orientation.”
This strategy seems common but I believe it ultimately produced a JPSP paper on ESP. So perhaps it is time to discard this approach and try something else for a change. Heck, even a 1-year recess might be worth it. That moratorium on NHST worked, right?