Apology

There has been a lot of commentary about the tone of my 11 December 2013 blog post. I’ve tried to keep a relatively low profile during the events of the last week.  It has been one of the strangest weeks of my professional life. However, it seems appropriate to make a formal apology.

1. I apologize for the title.  I intended it as a jokey reference for the need to conduct high power replication studies. It was ill advised.

2. I apologize for the now infamous “epic fail” remark (“We gave it our best shot and pretty much encountered an epic fail as my 10 year would say”). It was poor form and contributed to hurt feelings. I should have been more thoughtful.

I will do better to make sure that I uphold the virtues of civility in future blog postings.

-brent donnellan

Advertisements

Random Reflections on Ceiling Effects and Replication Studies

In a blog post from December of 2013, I  described our attempts to replicate two studies testing the claim that priming cleanliness makes participants less judgmental on a series of 6 moral vignettes. My original post has recently received criticism for my timing and my tone. In terms of timing, I blogged about a paper that was accepted for publication and there was no embargo on the work. In terms of tone, I tried to ground everything I wrote with data but I also editorialized a bit.  It can be hard to know what might be taken as offensive when you are describing an unsuccessful replication attempt. The title (“Go Big or Go Home – A Recent Replication Attempt”) might have been off putting in hindsight. In the grand scope of discourse in the real world, however, I think my original blog post was fairly tame.

Most importantly: I was explicit in the original post about the need for more research. I will state again for the record: I don’t think this matter has been settled and more research is needed. We also said this in the Social Psychology paper.  It should be widely understood that no single study is ever definitive.

As noted in the current news article for Science about the special issue of Social Psychology, there is some debate about ceiling effects with our replication studies. We discuss this issue at some length in our rejoinder to the commentary. I will provide some additional context and observations in this post.  Readers just interested in gory details can read #4. This is a long and tedious post so I apologize in advance.

1. The original studies had relatively small sample sizes. There were 40 total participants in the original scrambled sentence study (Study 1) and 43 total participants in the original hand washing study (Study 2). It takes 26 participants per cell to have an approximately 80% change to detect a d of .80 with alpha set to .05 using a two-tailed significance test.  A d of .80 would be considered a large effect size in many areas of psychology.

2. The overall composite did not attain statistical significance using the conventional alpha level of .05 with a two-tailed test in the original Study 1 (p = .064).  (I have no special love for NHST but many people in the literature rely on this tool for drawing inferences).  Only one of the six vignettes attained statistical significance at the p < .05 level in the original Study 1 (Kitten). Two different vignettes attained statistical significance in the original Study 2 (Trolley and Wallet).  The kitten vignette did not. Effect size estimates for these contrasts are in our report.  Given the sample sizes, these estimates were large but they had wide confidence intervals.

3. The dependent variables were based on moral vignettes created for a different study originally conducted at the University of Virginia.These measures were originally pilot tested with 8 participants according to a PSPB paper (Schnall, Haidt, Clore, & Jordan, 2008, p.1100). College students from the United States were used to develop the measures that served as the dependent variables. There was no a priori reason to think the measures would “not work” for college students from Michigan. We registered our replication plan and Dr. Schnall was a reviewer on the proposal.  No special concerns were raised about our procedures or the nature of our sample. Our sample sizes provided over .99 power to detect the original effect size estimates.

4. The composite DVs were calculated by averaging across the six vignettes and those variables had fairly normal distributions in our studies.  In Study 1, the mean for our control condition was 6.48 (SD = 1.13, Median = 6.67, Skewness = -.55, Kurtosis = -.24, n = 102) whereas it was 5.81 in the original paper (SD = 1.47, Median = 5.67, Skewness = -.33, Kurtosis = -.44, n = 20).   The average was higher in our sample but the scores theoretically range from 0 to 9.  We found no evidence of a priming effect using the composites in Study 1.   In Study 2, the mean for our control condition was 5.65 (SD = 0.59, Median = 5.67, Skewness = -.31, Kurtosis = -.19, n = 68) whereas it was 5.43 in the original paper (SD = 0.69, Median = 5.67, Skewness = -1.58, Kurtosis = 3.45, n = 22).  The scores theoretically range from 1 to 7.  We found no hand washing effect using the composites in Study 2.  These descriptive statistics provide additional context for the discussion about ceiling effects.  The raw data are posted and critical readers can and should verify these numbers.  I have a standing policy to donate $20 to the charity of choice for the first person who notes a significant (!) statistical mistake in my blog posts.

Schnall (2014) undertook a fairly intense screening of our data.  This is healthy for the field and the open science framework facilitated this inquiry because we were required to post the data. Dr. Schnall noted that the responses to the individual moral vignettes tended toward the extreme in our samples.  I think the underlying claim is that students in our samples were so moralistic that any cleanliness priming effects could not have overpowered their pre-existing moral convictions.  This is what the ceiling effect argument translates to in real world terms: The experiments could not have worked in Michigan because the samples tended to have a particular mindset.

It might be helpful to be a little more concrete about the distributions.  For many of the individual vignettes, the “Extremely Wrong” option was a common response.  Below is a summary of the six vignettes and some descriptive information about the data from the control conditions of Study 1 across the two studies (ours and the original).  I think readers will have to judge for themselves as to what kinds of distributions to expect from samples of college students.  Depending on your level of self-righteousness, these results could be viewed positively or negatively.   Remember, we used their original materials.

  • Dog (53% versus 30%):  Morality of eating a pet dog that was just killed in a car accident.
  • Trolley (2% versus 5%):  Morality of killing one person in the classic trolley dilemma.
  • Wallet (44% versus 20%): Morality of keeping cash from a wallet found on the street.
  • Plane (43% versus 30%): Morality of killing an injured boy to save yourself and another person from starving after a plane crash.
  • Resume (29% versus 15%):  Morality of enhancing qualifications on a resume.
  • Kitten (56% versus 70%): Morality of using a kitten for sexual gratification.

Note: All comparisons are from the Control conditions for our replication Study 1 compared to Study 1 in Schnall et al. (2008).  Percentages reflect the proportion of the sample selecting the “extremely wrong” option (i.e., selecting the “9” on the original 0 to 9 scale).  For example, 53% of our participants thought it was extremely wrong for Frank to eat his dead dog for dinner whereas 30% of the participants in the original study provided that response.

To recap, we did not find evidence for the predicted effects and we basically concluded more research was necessary.  Variable distributions are useful pieces of information and non-parametric tests were consistent with the standard t-tests we used in the paper. Moreover, their kitten distribution was at least as extreme as ours, and yet they found the predicted result on this particular vignette in Study 1. Thus, I worry that any ceiling argument only applies when the results are counter to the original predictions. 

One reading of our null results is that there are unknown moderators of the cleanliness priming effects. We have tested for some moderators (e.g., private body consciousness, political orientation) in our replication report and rejoinder, but there could be other possibilities. For example, sample characteristics can make it difficult to find the predicted cleanliness priming results with particular measures.  If researchers have a sample of excessively moralistic/judgmental students who think using kittens for sexual gratification is extremely wrong, then cleaning primes may not be terribly effective at modulating their views. Perhaps a different set of vignettes that are more morally ambiguous (say more in line with the classic trolley problem) will show the predicted effects.  This is something to be tested in future research.

The bottom line for me is that we followed through on our research proposal and we reported our results.  The raw data were posted.  We have no control over the distributions. At the very least, researchers might need to worry about using this particular measure in the future based on our replication efforts. In short, the field may have learned something about how to test these ideas in the future.  In the end, I come full circle to the original conclusion in the December blog post– More research is needed.  

Postscript

I am sure reactions to our work and the respective back-and-forth will break on partisan grounds.  The “everything is fine” crew will believe that Dr. Schnall demolished our work whereas the “replication is important” crew will think we raised good points.  This is all fine and good as it relates to the insider baseball and sort of political theater that exists in our world.  However, I hope these pieces do not just create a bad taste in people’s mouth.  I feel badly that this single paper and exchange have diverted attention from the important example of reform taken by Lakens and Nosek.  They are helping to shape the broader narrative about how to do things differently in psychological science.

 

Quick Update on Timelines (23 May 2014)

David sent Dr. Schnall the paper we submitted to the editors on 28 October 2013 with a link to the raw materials. He wrote “I’ve attached the replication manuscript we submitted to Social Psychology based on our results to give you a heads up on what we found.”  He added: “If you have time, we feel it would be helpful to hear your opinions on our replication attempt, to shed some light on what kind of hidden moderators or other variables might be at play here.”

Dr. Schnall emailed back on 28 October 2013 asking for 2 weeks to review the material before we proceeded. David emailed back on 31 October 2013 apologizing for any miscommunication and that we had submitted the paper. He added we were still interested in her thoughts.

That was the end of our exchanges. We learned about the ceiling effect concern when we received the commentary in early March of 2014.

Warm Water and Loneliness Again?!?!

Call me Captain Ahab…

This is a dead horse but I got around to writing up some  useful new data in this saga.  Researchers at the University of Texas, Austin tried to replicate the basic survey findings in a large Introductory Psychology course back in the Fall of 2013.  They emailed me the results back in November and they were consistent with the general null effects we had been getting in our work.  I asked them if I could write it up for the Psychology File Drawer and they were amenable.  Here is a link to a more complete description of the results and here is a link to the PFD reference.

The basic details…

There was no evidence for an association between loneliness (M = 2.56; SD = .80, alpha = .85) and the Physical Warmth Index (r = -.03, p = .535, n = 365; 95% CI = -.14 to .07).  Moreover, the hypothesis relevant correlation between the water temperature item and the loneliness scale was not statistically distinguishable from zero (r = -.08, p = .141, n = 365, 95% CI = -.18 to .03).

One possible issue is that the U of T crew used a short 3 item measure of loneliness developed for large scale survey work whereas the other studies have used longer measures.  Fortunately, other research suggests this short measure is correlated above .80 with the parent instrument so I do not think this is a major limitation. But I can see others holding a different view.

One of the reviewers of the Emotion paper seemed concerned about our motivations.  The nice thing about these data is that we had nothing to do with the data collection so this criticism is not terribly valid.  Other parties can try this study too — the U of T folks figured a way to study this issue with 6 items!