Full Disclosure: I am second author on the McDonald et al. (2014) commentary.

Some of you may have seen that *Psychological Science* published our commentary on the Birtel and Crisp (2012) paper. Essentially we tried to replicate two of their studies with larger sample sizes (29 versus 240 and 32 versus 175, respectively) and obtained much lower effect size estimates. It is exciting that *Psychological Science* published our work and I think this is a hint of positive changes for the field. Hopefully nothing I write in this post undercuts that overarching message.

I read the Crisp and Birtel response and I had a set of responses (shocking, I know!). I think it is fair that they get the last word in print but I had some reactions that I wanted to share. Thus, I will outlet a few in this blog post. Before diving into issues, I want to reiterate the basic take home message of McDonald et al. (2014):

*“Failures to replicate add important information to the literature and should be a normal part of the scientific enterprise. The current study suggests that more work is needed before Birtel and Crisp’s procedures are widely implemented. Interventions for treating prejudice may require more precise manipulations along with rigorous evaluation using large sample sizes.” (p. xx)*

1. **Can we get a mulligan on our title?** We might want to revise the title of our commentary to make it clear that our efforts applied to only two specific findings in the original Birtel and Crisp (2012) paper. I think we were fairly circumscribed in the text itself but the title might have opened the door for how Crisp and Birtel (2014) responded. They basically thanked us for our efforts and pointed out that our two difficulties say nothing about the entire imagined contact hypothesis. They even argued that we “overgeneralized” our findings to the entire imagined contact literature. To be frank, I do not think they were being charitable to our piece with this criticism because we did not make this claim in the text. But titles are important and our title might have suggested some sort of overgeneralization. I will let readers make their own judgments. Regardless, I wish we had made the title more focused.

2. **If you really believe the d is somewhere around .35, why were the sample sizes so small in** **the first place? **A major substantive point in the Crisp and Birtel (2014) response is that the overall *d* for the imagined contact literature is somewhere around .35 based on a recent Miles and Crisp (2014) meta-analysis. That is a reasonable point but I think it actually undercuts the Birtel and Crisp (2012) paper and makes our take home point for us (i.e., the importance of using larger sample sizes in this literature). None of the original Birtel and Crisp (2012) studies had anywhere near the power to detect a population d of .35. If we take the simple two-group independent *t*-test design, the power requirements for .80 suggest the need for about 260 participants (130 in each group). The largest sample size in Birtel and Crisp (2012) was 32.

3. **What about the ManyLabs paper? **The now famous ManyLabs paper of Klein et al. (in press) reports a replication attempt of an imagined contact study (Study 1 in Husnu & Crisp, 2010). The ManyLabs effort yielded a much lower effect size estimate (*d* = .13, N = 6,336) than the original report (*d* = .86 or .84 as reported in Miles & Crisp, 2014; N = 33). This is quite similar to the pattern we found in our work. Thus, I think there is something of a decline effect in operation. There is a big difference in interpretation between a *d* of .80 and a *d* around .15. This should be worrisome to the field especially when researchers begin to think of the applied implications of this kind of work.

4. **What about the Miles and Crisp Meta-Analysis (2014)? **I took a serious look at the Miles and Crisp meta-analysis and I basically came away with the sinking feeling that much more research needs to be done to establish the magnitude of the imagined contact effects. Many of the studies used in the meta-analysis were grossly underpowered. There were 71 studies and only 2 had sample sizes above 260 (the threshold for having a good chance to detect a *d* = .35 effect using the standard between-participants design). Those two large studies yielded basically null effects for the imagined contact hypothesis (*d* = .02 and .05, *n*s = 508 and 488, respectively). The average sample size of the studies in the meta-analysis was 81 (81.27 to be precise) and the median was 61 (Min. = 23 and Max. = 508). A sample size of 123 was in the 90^{th} percentile (i.e., 90% of the samples were below 123) and nearly 80% of the studies had sample sizes below 100.

Miles and Crisp (2014) were worried about sample size but perhaps not in the ways that I might have liked. Here is what they wrote: “However, we observed that two studies had a sample size over 6 times the average (Chen & Mackie, 2013; Lai et al., 2013). To ensure that these studies did not contribute disproportionately to the summary effect size, we capped their sample size at 180 (the size of the next largest study) when computing the standard error variable used to weight each effect size.” (p. 13). Others can weigh in about this strategy but I tend to want to let the sample sizes “speak for themselves” in the analyses, especially when using a random-effects meta-analysis model.

** ****What’s it all mean?**

Not to bring out the cliché but I think much more work needs to be done here. As it stands, I think the *d* = .35 imagined contact effect size estimate is probably upwardly biased. Indeed, Miles and Crisp (2014) found evidence of publication bias such that unpublished studies yielded a smaller overall effect size estimate than published studies (but the unpublished studies still produce an estimate that is reliably larger than zero). However this shakes out, researchers are well advised to use much larger sample sizes than tends to characterize this literature based on my summary of the sample sizes in Miles and Crisp (2014). I also think more work needs to be done to evaluate the specific Birtel and Crisp (2012) effects. We now have collected two more unpublished studies with even bigger sample sizes and we have yet to get effect sizes that approximate the original report.

I want to close by trying to clarify my position. I am not saying that the effect sizes in question are zero or that this is an unimportant research area. On the contrary, I think this is an incredibly important topic and thus it requires even greater attention to statistical power and precision.

Updated 26 Feb 2014: I corrected the sample size from study 1 from 204 to 240.

The really wrote, “we capped their sample size at 180 (the size of the next largest study) when computing the standard error variable used to weight each effect size”? Good grief. I could have accepted that they made some errors in good faith, but this is beyond pale.

As one of the authors of that meta-analysis, I just wanted to clarify why we decided to cap the two sample sizes (section 4 / the previous comment). The original version we submitted did not include the Lai et al. study, so when I looked over the distribution of sample sizes, there was just a single outlying study with a much larger sample size than the rest. While I agree there’s a point to be made about many of the existing studies being underpowered, I still needed to figure out the fairest way to incorporate these studies together with one larger study, and the reason I decided to cap the sample size is that I wanted to reassure readers that the effect sizes we reported weren’t unduly driven by single effects (for the same reason, I would have capped/Winsorized outlying effect sizes too, but there weren’t any). The effect on the overall estimate was so small it was eaten by rounding error, so this decision had no effect on our findings.

When we added in the Lai et al. study at the last minute, I retained this approach, but I agree it no longer really makes sense given the ManyLabs study. Given that the sample size capping makes no difference to the numbers themselves (I just re-ran the analyses with the sample sizes of those two studies uncapped, and it changes the effect size by -.002; still rounds off to 0.35), I now feel as though it would have been more transparent just to leave the sample sizes as they were. So I fully accept your point here. The magnitude of these new replication studies has changed the game here, really.

I completely agree with the points in the final section. The estimated effect size in our meta-analysis is our best guess of the true effect size of imagined contact based on the research that has been done so far, and as the future studies you recommend are carried out, this estimate will inevitably change. However, I still think it’s useful to have the estimate of 0.35 right now, even if it’s qualified by the existence of publication bias and criticisms of the size of the studies. Even just from a practical point of view, it gives people an idea of the sort of sample size they will need to obtain in order to demonstrate the effect, as you point out.

While I have to admit to some issues with addressing the sample size issue by running large simplified online studies, your general point about sample size is also exactly right. I know the future research I am planning includes much larger sample sizes than I have used in the past (I saw a great talk by Uri Simonsohn at the SPSP convention last week on choosing sample sizes, which I think will be made available online soon… the take home message was to collect as much data as possible, unsurprisingly).

Hi Greg – Yes. That was a direct quotation from the paper. The other Brent (i.e., the big name personality Brent) has been experimenting with different models for their Table 1.