Your arguments only make sense if you say them very fast…

“Although well-meaning, many of the suggestions only make sense if you say them very fast.”   -Howard Wainer (2011, p. 7) from Uneducated Guesses

I love this phrase: Your ideas only make sense if you say them very fast. I find myself wanting to invoke this idea anytime I hear some of the counterarguments to methodological reform. For example, I think this line applies to NS’s comment about climate change skeptics.

Anyways, I am about 90% done with the articles in the November 2012 special issue of Perspectives on Psychological Science.  I enjoyed reading most of the articles and it is good resource for thinking about reform in psychological research. It should probably be required reading in graduate seminars. So far, the article that generated the strongest initial reaction was the Galak and Meyvis (2012; hereafter G & M) reply to Francis (2012).  I think they basically made his point for him.  [I should disclose that I think the basic idea pursued by G and M seems plausible and I think their reply was written in a constructive fashion. I just did not find their arguments very convincing.]

Francis (2012) suggests that the 8 studies in their package are less compelling when viewed in the aggregate because the “hit” rate is much higher than one would expect given the sample sizes and effect size in question. The implication is that there is probably publication bias. [Note: People sometimes quibble over how Francis calculates his effect size estimate but that is a topic for another blog post.]

I happen to like the “Francis” tables because readers get to see effect size estimates and the sample sizes stripped clean of narrative baggage.  Usually the effect sizes are large and the sample sizes are small. This general pattern would seem to characterize the G and M results.  (Moreover, the correlation between effect size estimates and sample sizes for the G and M set of results was something like -.88.  Ouch!).

G and M acknowledge that they had studies in their file drawer:  “We reported eight successful demonstrations of this phenomenon in our paper, but we also conducted five additional studies whose results either did not reach conventional levels of significance or did reach significance but ended up being rhetorically redundant” (G & M, 2012, p. 595). So there was selective reporting. Case closed in my book. Game over. As an aside, I am not sure I can distinguish between those desperate effect sizes who are reaching toward the p < .05 promised land from those who are fleeing from it. Can you?  It probably takes ESP.

G and M calculated their overall effect size as a g* of .38 (95% CI .25 to .51) with all studies in the mix whereas Francis reported the average g* from the published work as .57.  So it seems to me that their extra data brings down the overall effect size estimate.  Is this a hint of the so-called decline effect?  G and M seem to want to argue that because the g* estimate is bigger than zero that there is no real issue at stake. I disagree. Scientific judgment is rarely a yes/no decision about the existence of an effect. It is more often about the magnitude of the effect.  I worry that the G and M approach distorts effect size estimates and possibly even perpetuates small n studies in the literature.

G and M also stake a position that I fail to understand:  “However, as is the case for many papers in experimental psychology, the goal was never to assess the exact size of the effect, but rather to test between competing theoretical predictions” (G & M, 2012, p.  595). People use this or a similar argument to dismiss current concerns about the paucity of exact replications and the proliferation of small sample sizes in the literature.  What I do understand about this argument makes me skeptical. Let’s quote from the bible of Jacob Cohen (1990, p. 1309):

“In retrospect, it seems to me simultaneously quite understandable yet also ridiculous to try to develop theories about human behavior with p values from Fisherian hypothesis testing and no more than a primitive sense of effect size.”

So effect sizes matter for theories. Effect sizes tell us something about the magnitude of the associations in question (causal or otherwise) and I happen to think this is critical information for evaluating the truthiness of a theoretical idea. Indeed, I think the field of psychology would be healthier if we focused on getting more precise estimates of particular effects rather than playing this game of collecting “hits” from a bunch of underpowered “conceptual” extensions of a general idea.  I actually think G and M should have written this statement: “As is the case for many papers in psychology, our goal was to present as much evidence as possible for our preferred theoretical orientation.”

This strategy seems common but I believe it ultimately produced a JPSP paper on ESP.  So perhaps it is time to discard this approach and try something else for a change. Heck, even a 1-year recess might be worth it. That moratorium on NHST worked, right?

“At any given time we know what we are doing….”

Disclaimer: Both Robert MacCallum (e.g., 2003) and George Box (e.g., 1979) have written extensively about the value of models and I will basically steal parrot their ideas in this post. Moreover, I did not sleep much last night…

Let the postmortem on the 2012 election begin! One story will likely involve the accuracy of well-conducted polls and the success of Nate Silver’s methods over “gut-based” methods favored by pundits and campaign workers.  Not surprisingly, I like much of this story as is nicely summed up by this cartoon (thanks to Skip G. for posting this one on “the” Facebook).

But what if Nate Silver was wrong? What if we woke up today and Romney won 303 electoral votes and Obama lost the election?  I think Mr. Silver would have been in a much better position than his “gut-based” critics are in today. The reason boils down to the advantages of models.  It is really useful to have a formalized recipe for prediction. To quote Box (1979): “The great advantage of the model-based over the ad hoc approach, it seems to me, is that at any given time we know what we are doing.”

I happen to like the model for science depicted by George Box (see Figure A1 in his 1976 Science and Statistics paper).  The basic idea is that errors drive the accumulation of knowledge in an iterative cycle. Learning is produced when there is “a discrepancy between what tentative theory suggests should be so and what practices says is so” (p. 791).  In other words, there is something to be gained when predictions from models and empirical facts disagree.  These errors lead to better models, at least in the ideal case.

So if Mr. Silver’s model was gravely wrong, he could have spent the next days and weeks figuring out where his model went wrong.  He has his predicted values and he has the actual values.  He can test alternative models to find ones that outperformed his original model. He is in a good position to learn something.  Compare his plight with that of the gut-based pundit.  How are they going to figure out why their predictions failed?  What are they going to learn?

Moral of my story: Models rule.  I think there might be a bigger lesson in here for “soft” psychology but I am too tired to express it properly.