Go Big or Go Home – A Recent Replication Attempt

I decided to adopt my 5% suggestion to dedicate a relatively small percentage of one’s research time to replication efforts.  In other words, I spent time this summer and fall working on several replication studies.  Truth be told, these efforts amounted to more than 5% of my time but these efforts have been fruitful in terms of papers. Moreover, these replication attempts were collaborative efforts with graduate students (and some undergraduates).  My major role was often as data analyst.  Indeed, I like to independently analyze the data to make sure that I come up with the same results. I also find data analysis inherently enjoyable and far more exciting than writing papers, editing papers, or dealing with committees. So basically I got to have fun and perhaps make a few modest contributions to the literature.

One replication effort concerned whether we could duplicate the results of 2008 Psychological Science paper about the impact of cleanliness on moral judgments. The original finding was that participants primed with cleanliness were less harsh in their moral judgments than control participants. The first two authors on this paper are outstanding graduate students in my department (Johnson, Cheung, & Donnellan, in press). For those not interested in gory details: We obtained much smaller estimates than the original paper.

Basic design:

Prime/induce cleanliness in one group and use a second group as a control condition. Compare ratings on a series of six moral vignettes that are aggregated into a composite rating.  In sum, the IV is cleanliness/no cleanliness and the DV is ratings of wrongness on moral vignettes. The scale for the moral ratings is such that higher scores reflect harsher judgments. Negative effect size estimates indicate that those in the clean group were less harsh (on average) than those in the control group. Materials were obtained from the original researchers and we tried to follow procedures as close as possible. All participants were college students.

Study 1:

Cleanliness was primed with a scrambled sentence task in the original study (N = 40).  The cleanliness priming effect was close to the magic p < .05 threshold for the composite DV (d = -.60, 95% CI = -1.23 to .04).  It was hard to see that the overall effect was p = .064 in the published paper because the authors reported the dreadful p-rep statistic but we computed the standard p-value for our report. Those in the cleanliness prime condition rated the vignettes as less wrong than the control group at p < .065.  We used the same materials with 208 participants and got an effect size estimate very close to zero (d = -.01, 95% CI = -.28 to .26).  One caveat is that there is an additional study on the PsychFile drawer website that found an effect consistent with the original using a one-tailed significance test (estimated d = -.47, N = 60).

Study 2:

Cleanliness was induced with hand washing in the original study (N = 43). Participants were shown a disgusting video clip and then assigned to condition.  Participants completed ratings of the same vignettes used in Study 1 but with a slightly different response scale.  The effect on the overall composite passed the p < .05 threshold in the original publication (d = -.85, 95% CI = -1.47 to -.22) but not in our replication study with 126 participants (d = .01, 95% CI = -34 to .36).

We conducted both studies in person like the original authors and the package is now in press for the special issue of Social Psychology about replication studies. This means that everything was preregistered and all of the data, materials, and proposal are posted here: http://osf.io./project/zwrxc/.

Study 1 Redux (Unpublished):

After finishing the package, we decided to try to conduct a much larger replication attempt of Study 1 but using the internet to facilitate data collection. I like narrow confidence intervals so I wanted to try for a very large sample size.  A larger sample size would also facilitate tests of moderation. Using the internet would make data collection easier but it might impair data quality.  (I have ideas about dealing with that issue but those will appear in another paper, hopefully!). We selected out anyone who said they made up their responses or did not answer honestly. Our sample size for this replication attempt was 731 college students.  As with our in press study, the effect of the sentence unscrambling task on the composite DV was not statistically detectable (t = 0.566, df = 729, p = .578). The d was quite small (d = .04, 95% = -.10 to .19) and the 95% CI fell below the “magic” |.20| threshold for a so-called small effect.   The most interesting finding (to me) was that the Honesty-Humility scale of the HEXACO (Lee & Ashton, 2004) was the best predictor of the moral composite ratings across conditions (r = .356, 95% CI = .291 to .418, p < .05, N = 730) out of the four individual difference measures we included at the end of the study (a disgust scale, a bodily self-consciousness scale, and a single-item liberalism-conservatism scale).  No individual difference we included moderated the null effect of condition (same for gender).   So we tried to find moderators, honestly.

We have struck out twice now to find the original Study 1 effect.  I did a quick and dirty random effects meta-analysis using all four attempts to duplicate Study 1 (the original, the one on the PsychFile drawer, and our two studies).  The point estimate for the d was -.130 (95% CI = -.377 to .117, p = .303) and the estimate was even closer to zero using a common effect (or fixed effect) model (d = -.022, 95% CI = -.144 to .099, p = .718).  I will provide an excerpt from the in press paper as I think the message is useful to ward off the critics who will blast our research skills and accuse me of some sort of motivated bias…

The current studies suggest that the effect sizes surrounding the impact of cleanliness on moral judgments are probably smaller than the estimates provided by the original studies…It is critical that our work is not considered the last word on the original results…. More broadly, we hope that researchers will continue to evaluate the emotional factors that contribute to moral judgments.

 So more research is probably needed to better understand this effect [Don’t you just love Mom and Apple Pie statements!].  However, others can dedicate their time and resources to this effect.  We gave it our best shot and pretty much encountered an epic fail as my 10 year old would say. My free piece of advice is that others should use very large samples and plan for small effect sizes.  [Note: The epic fail comment was ill advised. I apologize.]

Note: Thanks to David Johnson and Felix Cheung.  I deserve any criticism and they deserve any credit.

Update (2 Jan 2014).  David placed the write-up on the Psych File Drawer website. Plus the data are available as well.

David Johnson, Felix Cheung, Brent Donnellan. Cleanliness primes do not influence moral judgment. (2014, January 01). Retrieved 10:05, January 02, 2014 from http://www.PsychFileDrawer.org/replication.php?attempt=MTcy

Update (23 May 2014). Identified passages that were problematic


About these ads

8 thoughts on “Go Big or Go Home – A Recent Replication Attempt

  1. Pingback: Data for Schnall et al. (2008) available | Thinking is For Doing

  2. Pingback: Replication studies, ceiling effects, and the psychology of science | Is Nerd

  3. The title of this post seems ill-advised. 126 subjects hardly puts you in the realm of saying that you completely eschewed type II error. Moreover, small research is there to suggest what big research projects should spend their time studying.
    I think Dan Gilbert is an ass and wrong about replication but I don’t think you handled the way you presented this data very well.

    • Hi – I agree about the 126 and I have apologized for the title. In our defense, we collected the additional internet-based data for Study 1 to increase our power and thus reduce the Type II error rate for the particular effect studies in Study 1. The size of the unpublished data was the origin of the title in case you are wondering (N roughly 730). Note that I emphasized effect size estimates and CIs in everything above. I was not hiding the imprecision in the point estimates. We redid Study 1 at a large scale to see if we could detect a smaller effect size. We wanted more power. I love power. Trust me. The N = 4 (95% CI = 2 to 6) regular readers of this blog can attest to my sample size rants.

      Study 2 is harder to implement at a large scale. (It is also more subject to chargers of experimenter bias — the RAs know who is washing hands in this paradigm. I am too dumb to think of a way to have the Ps wash hands without RAs knowing it while making sure there is fidelity to the manipulation in a way that would make it reasonably efficient to collect a large sample. If we were to redo Study 2 people would say I was biased against getting the effect even though our RAs were blind to the hypothesis.)

      So for productive next steps, I think an in-lab large scale version of Study 1 is the way to go. It is easy to make it double-blind with packets. The criticism of the internet stuff is reasonable to a degree I guess. However, there are concerns about the DVs so thinking of additional moral stories and adjusting response scales in some conditions would be productive. So I think there are lots of productive directions to take Study 1 to test ideas about sample moderators and measurement concerns. But I stand by my concern for having a large sample because I think the effect sizes are going to be smaller than was found in the original. Getting the DV issues dealt with in Study 1 will make future attempts of Study 2 better presuming researchers use the same DVs.

      Again I tried to emphasize the need for more research with larger samples to avoid Type II error in the above post. Some folks actually hate Type I/II distinctions and prefer to think about errors of sign and errors of magnitude. I am sympathetic to wanting to nail down precise effect size estimate so I do worry about errors of magnitude when it comes to estimating the effect size.

      Sorry to be long-winded. It is actually fun to talk about research issues rather than some of the other stuff in this broader discussion.

  4. Pingback: Friday links: does Gaad exist, stories behind classic ecology papers, evolution of chess, and more | Dynamic Ecology

  5. Pingback: There is no ceiling effect in Johnson, Cheung, & Donnellan (2014) | [citation needed]

  6. Pingback: Almost no education research is replicated, new article shows @insidehighered | To Talk Like This and Act Like That

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s