Full Disclosure: I am second author on the McDonald et al. (2014) commentary.
Some of you may have seen that Psychological Science published our commentary on the Birtel and Crisp (2012) paper. Essentially we tried to replicate two of their studies with larger sample sizes (29 versus 240 and 32 versus 175, respectively) and obtained much lower effect size estimates. It is exciting that Psychological Science published our work and I think this is a hint of positive changes for the field. Hopefully nothing I write in this post undercuts that overarching message.
I read the Crisp and Birtel response and I had a set of responses (shocking, I know!). I think it is fair that they get the last word in print but I had some reactions that I wanted to share. Thus, I will outlet a few in this blog post. Before diving into issues, I want to reiterate the basic take home message of McDonald et al. (2014):
“Failures to replicate add important information to the literature and should be a normal part of the scientific enterprise. The current study suggests that more work is needed before Birtel and Crisp’s procedures are widely implemented. Interventions for treating prejudice may require more precise manipulations along with rigorous evaluation using large sample sizes.” (p. xx)
1. Can we get a mulligan on our title? We might want to revise the title of our commentary to make it clear that our efforts applied to only two specific findings in the original Birtel and Crisp (2012) paper. I think we were fairly circumscribed in the text itself but the title might have opened the door for how Crisp and Birtel (2014) responded. They basically thanked us for our efforts and pointed out that our two difficulties say nothing about the entire imagined contact hypothesis. They even argued that we “overgeneralized” our findings to the entire imagined contact literature. To be frank, I do not think they were being charitable to our piece with this criticism because we did not make this claim in the text. But titles are important and our title might have suggested some sort of overgeneralization. I will let readers make their own judgments. Regardless, I wish we had made the title more focused.
2. If you really believe the d is somewhere around .35, why were the sample sizes so small in the first place? A major substantive point in the Crisp and Birtel (2014) response is that the overall d for the imagined contact literature is somewhere around .35 based on a recent Miles and Crisp (2014) meta-analysis. That is a reasonable point but I think it actually undercuts the Birtel and Crisp (2012) paper and makes our take home point for us (i.e., the importance of using larger sample sizes in this literature). None of the original Birtel and Crisp (2012) studies had anywhere near the power to detect a population d of .35. If we take the simple two-group independent t-test design, the power requirements for .80 suggest the need for about 260 participants (130 in each group). The largest sample size in Birtel and Crisp (2012) was 32.
3. What about the ManyLabs paper? The now famous ManyLabs paper of Klein et al. (in press) reports a replication attempt of an imagined contact study (Study 1 in Husnu & Crisp, 2010). The ManyLabs effort yielded a much lower effect size estimate (d = .13, N = 6,336) than the original report (d = .86 or .84 as reported in Miles & Crisp, 2014; N = 33). This is quite similar to the pattern we found in our work. Thus, I think there is something of a decline effect in operation. There is a big difference in interpretation between a d of .80 and a d around .15. This should be worrisome to the field especially when researchers begin to think of the applied implications of this kind of work.
4. What about the Miles and Crisp Meta-Analysis (2014)? I took a serious look at the Miles and Crisp meta-analysis and I basically came away with the sinking feeling that much more research needs to be done to establish the magnitude of the imagined contact effects. Many of the studies used in the meta-analysis were grossly underpowered. There were 71 studies and only 2 had sample sizes above 260 (the threshold for having a good chance to detect a d = .35 effect using the standard between-participants design). Those two large studies yielded basically null effects for the imagined contact hypothesis (d = .02 and .05, ns = 508 and 488, respectively). The average sample size of the studies in the meta-analysis was 81 (81.27 to be precise) and the median was 61 (Min. = 23 and Max. = 508). A sample size of 123 was in the 90th percentile (i.e., 90% of the samples were below 123) and nearly 80% of the studies had sample sizes below 100.
Miles and Crisp (2014) were worried about sample size but perhaps not in the ways that I might have liked. Here is what they wrote: “However, we observed that two studies had a sample size over 6 times the average (Chen & Mackie, 2013; Lai et al., 2013). To ensure that these studies did not contribute disproportionately to the summary effect size, we capped their sample size at 180 (the size of the next largest study) when computing the standard error variable used to weight each effect size.” (p. 13). Others can weigh in about this strategy but I tend to want to let the sample sizes “speak for themselves” in the analyses, especially when using a random-effects meta-analysis model.
What’s it all mean?
Not to bring out the cliché but I think much more work needs to be done here. As it stands, I think the d = .35 imagined contact effect size estimate is probably upwardly biased. Indeed, Miles and Crisp (2014) found evidence of publication bias such that unpublished studies yielded a smaller overall effect size estimate than published studies (but the unpublished studies still produce an estimate that is reliably larger than zero). However this shakes out, researchers are well advised to use much larger sample sizes than tends to characterize this literature based on my summary of the sample sizes in Miles and Crisp (2014). I also think more work needs to be done to evaluate the specific Birtel and Crisp (2012) effects. We now have collected two more unpublished studies with even bigger sample sizes and we have yet to get effect sizes that approximate the original report.
I want to close by trying to clarify my position. I am not saying that the effect sizes in question are zero or that this is an unimportant research area. On the contrary, I think this is an incredibly important topic and thus it requires even greater attention to statistical power and precision.
Updated 26 Feb 2014: I corrected the sample size from study 1 from 204 to 240.