I decided to adopt my 5% suggestion to dedicate a relatively small percentage of one’s research time to replication efforts. In other words, I spent time this summer and fall working on several replication studies. Truth be told, these efforts amounted to more than 5% of my time but these efforts have been fruitful in terms of papers. Moreover, these replication attempts were collaborative efforts with graduate students (and some undergraduates). My major role was often as data analyst. Indeed, I like to independently analyze the data to make sure that I come up with the same results. I also find data analysis inherently enjoyable and far more exciting than writing papers, editing papers, or dealing with committees. So basically I got to have fun and perhaps make a few modest contributions to the literature.
One replication effort concerned whether we could duplicate the results of 2008 Psychological Science paper about the impact of cleanliness on moral judgments. The original finding was that participants primed with cleanliness were less harsh in their moral judgments than control participants. The first two authors on this paper are outstanding graduate students in my department (Johnson, Cheung, & Donnellan, in press). For those not interested in gory details: We obtained much smaller estimates than the original paper.
Prime/induce cleanliness in one group and use a second group as a control condition. Compare ratings on a series of six moral vignettes that are aggregated into a composite rating. In sum, the IV is cleanliness/no cleanliness and the DV is ratings of wrongness on moral vignettes. The scale for the moral ratings is such that higher scores reflect harsher judgments. Negative effect size estimates indicate that those in the clean group were less harsh (on average) than those in the control group. Materials were obtained from the original researchers and we tried to follow procedures as close as possible. All participants were college students.
Cleanliness was primed with a scrambled sentence task in the original study (N = 40). The cleanliness priming effect was close to the magic p < .05 threshold for the composite DV (d = -.60, 95% CI = -1.23 to .04). It was hard to see that the overall effect was p = .064 in the published paper because the authors reported the dreadful p-rep statistic but we computed the standard p-value for our report. Those in the cleanliness prime condition rated the vignettes as less wrong than the control group at p < .065. We used the same materials with 208 participants and got an effect size estimate very close to zero (d = -.01, 95% CI = -.28 to .26). One caveat is that there is an additional study on the PsychFile drawer website that found an effect consistent with the original using a one-tailed significance test (estimated d = -.47, N = 60).
Cleanliness was induced with hand washing in the original study (N = 43). Participants were shown a disgusting video clip and then assigned to condition. Participants completed ratings of the same vignettes used in Study 1 but with a slightly different response scale. The effect on the overall composite passed the p < .05 threshold in the original publication (d = -.85, 95% CI = -1.47 to -.22) but not in our replication study with 126 participants (d = .01, 95% CI = -34 to .36).
We conducted both studies in person like the original authors and the package is now in press for the special issue of Social Psychology about replication studies. This means that everything was preregistered and all of the data, materials, and proposal are posted here: http://osf.io./project/zwrxc/.
Study 1 Redux (Unpublished):
After finishing the package, we decided to try to conduct a much larger replication attempt of Study 1 but using the internet to facilitate data collection. I like narrow confidence intervals so I wanted to try for a very large sample size. A larger sample size would also facilitate tests of moderation. Using the internet would make data collection easier but it might impair data quality. (I have ideas about dealing with that issue but those will appear in another paper, hopefully!). We selected out anyone who said they made up their responses or did not answer honestly. Our sample size for this replication attempt was 731 college students. As with our in press study, the effect of the sentence unscrambling task on the composite DV was not statistically detectable (t = 0.566, df = 729, p = .578). The d was quite small (d = .04, 95% = -.10 to .19) and the 95% CI fell below the “magic” |.20| threshold for a so-called small effect. The most interesting finding (to me) was that the Honesty-Humility scale of the HEXACO (Lee & Ashton, 2004) was the best predictor of the moral composite ratings across conditions (r = .356, 95% CI = .291 to .418, p < .05, N = 730) out of the four individual difference measures we included at the end of the study (a disgust scale, a bodily self-consciousness scale, and a single-item liberalism-conservatism scale). No individual difference we included moderated the null effect of condition (same for gender). So we tried to find moderators, honestly.
We have struck out twice now to find the original Study 1 effect. I did a quick and dirty random effects meta-analysis using all four attempts to duplicate Study 1 (the original, the one on the PsychFile drawer, and our two studies). The point estimate for the d was -.130 (95% CI = -.377 to .117, p = .303) and the estimate was even closer to zero using a common effect (or fixed effect) model (d = -.022, 95% CI = -.144 to .099, p = .718). I will provide an excerpt from the in press paper as I think the message is useful to ward off the critics who will blast our research skills and accuse me of some sort of motivated bias…
The current studies suggest that the effect sizes surrounding the impact of cleanliness on moral judgments are probably smaller than the estimates provided by the original studies…It is critical that our work is not considered the last word on the original results…. More broadly, we hope that researchers will continue to evaluate the emotional factors that contribute to moral judgments.
So more research is probably needed to better understand this effect [Don’t you just love Mom and Apple Pie statements!]. However, others can dedicate their time and resources to this effect. We gave it our best shot and pretty much encountered an epic fail as my 10 year old would say. My free piece of advice is that others should use very large samples and plan for small effect sizes.
Note: Thanks to David Johnson and Felix Cheung. I deserve any criticism and they deserve any credit.