Random Reflections on Ceiling Effects and Replication Studies

In a blog post from December of 2013, I  described our attempts to replicate two studies testing the claim that priming cleanliness makes participants less judgmental on a series of 6 moral vignettes. My original post has recently received criticism for my timing and my tone. In terms of timing, I blogged about a paper that was accepted for publication and there was no embargo on the work. In terms of tone, I tried to ground everything I wrote with data but I also editorialized a bit.  It can be hard to know what might be taken as offensive when you are describing an unsuccessful replication attempt. The title (“Go Big or Go Home – A Recent Replication Attempt”) might have been off putting in hindsight. In the grand scope of discourse in the real world, however, I think my original blog post was fairly tame.

Most importantly: I was explicit in the original post about the need for more research. I will state again for the record: I don’t think this matter has been settled and more research is needed. We also said this in the Social Psychology paper.  It should be widely understood that no single study is ever definitive.

As noted in the current news article for Science about the special issue of Social Psychology, there is some debate about ceiling effects with our replication studies. We discuss this issue at some length in our rejoinder to the commentary. I will provide some additional context and observations in this post.  Readers just interested in gory details can read #4. This is a long and tedious post so I apologize in advance.

1. The original studies had relatively small sample sizes. There were 40 total participants in the original scrambled sentence study (Study 1) and 43 total participants in the original hand washing study (Study 2). It takes 26 participants per cell to have an approximately 80% change to detect a d of .80 with alpha set to .05 using a two-tailed significance test.  A d of .80 would be considered a large effect size in many areas of psychology.

2. The overall composite did not attain statistical significance using the conventional alpha level of .05 with a two-tailed test in the original Study 1 (p = .064).  (I have no special love for NHST but many people in the literature rely on this tool for drawing inferences).  Only one of the six vignettes attained statistical significance at the p < .05 level in the original Study 1 (Kitten). Two different vignettes attained statistical significance in the original Study 2 (Trolley and Wallet).  The kitten vignette did not. Effect size estimates for these contrasts are in our report.  Given the sample sizes, these estimates were large but they had wide confidence intervals.

3. The dependent variables were based on moral vignettes created for a different study originally conducted at the University of Virginia.These measures were originally pilot tested with 8 participants according to a PSPB paper (Schnall, Haidt, Clore, & Jordan, 2008, p.1100). College students from the United States were used to develop the measures that served as the dependent variables. There was no a priori reason to think the measures would “not work” for college students from Michigan. We registered our replication plan and Dr. Schnall was a reviewer on the proposal.  No special concerns were raised about our procedures or the nature of our sample. Our sample sizes provided over .99 power to detect the original effect size estimates.

4. The composite DVs were calculated by averaging across the six vignettes and those variables had fairly normal distributions in our studies.  In Study 1, the mean for our control condition was 6.48 (SD = 1.13, Median = 6.67, Skewness = -.55, Kurtosis = -.24, n = 102) whereas it was 5.81 in the original paper (SD = 1.47, Median = 5.67, Skewness = -.33, Kurtosis = -.44, n = 20).   The average was higher in our sample but the scores theoretically range from 0 to 9.  We found no evidence of a priming effect using the composites in Study 1.   In Study 2, the mean for our control condition was 5.65 (SD = 0.59, Median = 5.67, Skewness = -.31, Kurtosis = -.19, n = 68) whereas it was 5.43 in the original paper (SD = 0.69, Median = 5.67, Skewness = -1.58, Kurtosis = 3.45, n = 22).  The scores theoretically range from 1 to 7.  We found no hand washing effect using the composites in Study 2.  These descriptive statistics provide additional context for the discussion about ceiling effects.  The raw data are posted and critical readers can and should verify these numbers.  I have a standing policy to donate $20 to the charity of choice for the first person who notes a significant (!) statistical mistake in my blog posts.

Schnall (2014) undertook a fairly intense screening of our data.  This is healthy for the field and the open science framework facilitated this inquiry because we were required to post the data. Dr. Schnall noted that the responses to the individual moral vignettes tended toward the extreme in our samples.  I think the underlying claim is that students in our samples were so moralistic that any cleanliness priming effects could not have overpowered their pre-existing moral convictions.  This is what the ceiling effect argument translates to in real world terms: The experiments could not have worked in Michigan because the samples tended to have a particular mindset.

It might be helpful to be a little more concrete about the distributions.  For many of the individual vignettes, the “Extremely Wrong” option was a common response.  Below is a summary of the six vignettes and some descriptive information about the data from the control conditions of Study 1 across the two studies (ours and the original).  I think readers will have to judge for themselves as to what kinds of distributions to expect from samples of college students.  Depending on your level of self-righteousness, these results could be viewed positively or negatively.   Remember, we used their original materials.

  • Dog (53% versus 30%):  Morality of eating a pet dog that was just killed in a car accident.
  • Trolley (2% versus 5%):  Morality of killing one person in the classic trolley dilemma.
  • Wallet (44% versus 20%): Morality of keeping cash from a wallet found on the street.
  • Plane (43% versus 30%): Morality of killing an injured boy to save yourself and another person from starving after a plane crash.
  • Resume (29% versus 15%):  Morality of enhancing qualifications on a resume.
  • Kitten (56% versus 70%): Morality of using a kitten for sexual gratification.

Note: All comparisons are from the Control conditions for our replication Study 1 compared to Study 1 in Schnall et al. (2008).  Percentages reflect the proportion of the sample selecting the “extremely wrong” option (i.e., selecting the “9” on the original 0 to 9 scale).  For example, 53% of our participants thought it was extremely wrong for Frank to eat his dead dog for dinner whereas 30% of the participants in the original study provided that response.

To recap, we did not find evidence for the predicted effects and we basically concluded more research was necessary.  Variable distributions are useful pieces of information and non-parametric tests were consistent with the standard t-tests we used in the paper. Moreover, their kitten distribution was at least as extreme as ours, and yet they found the predicted result on this particular vignette in Study 1. Thus, I worry that any ceiling argument only applies when the results are counter to the original predictions. 

One reading of our null results is that there are unknown moderators of the cleanliness priming effects. We have tested for some moderators (e.g., private body consciousness, political orientation) in our replication report and rejoinder, but there could be other possibilities. For example, sample characteristics can make it difficult to find the predicted cleanliness priming results with particular measures.  If researchers have a sample of excessively moralistic/judgmental students who think using kittens for sexual gratification is extremely wrong, then cleaning primes may not be terribly effective at modulating their views. Perhaps a different set of vignettes that are more morally ambiguous (say more in line with the classic trolley problem) will show the predicted effects.  This is something to be tested in future research.

The bottom line for me is that we followed through on our research proposal and we reported our results.  The raw data were posted.  We have no control over the distributions. At the very least, researchers might need to worry about using this particular measure in the future based on our replication efforts. In short, the field may have learned something about how to test these ideas in the future.  In the end, I come full circle to the original conclusion in the December blog post– More research is needed.  


I am sure reactions to our work and the respective back-and-forth will break on partisan grounds.  The “everything is fine” crew will believe that Dr. Schnall demolished our work whereas the “replication is important” crew will think we raised good points.  This is all fine and good as it relates to the insider baseball and sort of political theater that exists in our world.  However, I hope these pieces do not just create a bad taste in people’s mouth.  I feel badly that this single paper and exchange have diverted attention from the important example of reform taken by Lakens and Nosek.  They are helping to shape the broader narrative about how to do things differently in psychological science.


Quick Update on Timelines (23 May 2014)

David sent Dr. Schnall the paper we submitted to the editors on 28 October 2013 with a link to the raw materials. He wrote “I’ve attached the replication manuscript we submitted to Social Psychology based on our results to give you a heads up on what we found.”  He added: “If you have time, we feel it would be helpful to hear your opinions on our replication attempt, to shed some light on what kind of hidden moderators or other variables might be at play here.”

Dr. Schnall emailed back on 28 October 2013 asking for 2 weeks to review the material before we proceeded. David emailed back on 31 October 2013 apologizing for any miscommunication and that we had submitted the paper. He added we were still interested in her thoughts.

That was the end of our exchanges. We learned about the ceiling effect concern when we received the commentary in early March of 2014.


Author: mbdonnellan

Professor Social and Personality Psychology Texas A &M University

27 thoughts on “Random Reflections on Ceiling Effects and Replication Studies”

  1. well, that seems quite reasonable. Looking at the percentages though, it may well be that the groups already significantly differ in their moral preferences. If they are indeed significantly different in their moral attitudes, then it’s likely it’s not a very close replication (i.e., you have a different type of sample).

    1. by the way – if the sample is indeed so different, than that is fascinating to know, of course! It may add to any model (whether the effect is true or not). It will help create boundary conditions to the effect (if it is true), and it adds information that in the old days would have been unknown!

    2. This is reasonable explanation. A good first step would be to try to get more baseline data on these vignettes to see what the distributions look like in a number of samples from other universities. This would give us a broader frame of reference for interpreting the current studies. Researchers at other universities can try it to see if college student characteristics are moderators by trying the sentence scrambling study as it is the easiest of the 2 studies to implement. We could also try to do the sentence scrambling priming study on mTurk but some people don’t think priming on the internet is valid.

      1. Perhaps with much, much greater samples? (I could understand that MTurk simply introduces A LOT of noise). That said, I also wonder about how effective the SST is. But, a university more comparable to the original one would be great.

    3. I think this is an important point, especially in the context of whether there are hidden moderators of the cleanliness effect (e.g, population differences). But we had no a priori predictions about differences between our samples (nor did Dr. Schnall mention such suspicions when she reviewed our proposal). Even if it is the case that our samples are different, this information useful for those planning to do research in this area, as it suggests that there are moderators that need to be ferreted out.

      1. entirely agreed. And an update of the model can be one function of replications.

      2. We have the task programmed in Qualtrics and we are happy to share it. We can even try an mTurk sample once we get some cash and amend the IRB.

      3. Also, I don’t think we should expect that original researchers know everything about the model. Being wrong about your model is something that all scientists experience in their lives. We usually are wrong. It’s a sign of progress. We should also try to be open to that end (let’s forgive each other for not having all predictions yet).

      4. This is – by the way – a good example of why the results should be reviewed too. This is both helpful for the replication authors and original authors, so that together we get to better models.

    4. in order to keep it in perspective – do not forget that those %s from original study are based on 20(!) participants. as everything in that study, these estimates are highly unreliable.

      1. This is the second forum where I’ve seen a post from you that sounds like a broad-brush, emotional attack on the original paper (“…as everything in that study [sic]… highly unreliable.”). I have no ax to grind here whatsoever–I’m a computer scientists who is fascinated by the replication controversy, but has no a priori opinion about the people, studies, events or methods being discussed. But from where I sit, your behavior seems to lend some credence to those who are complaining about bullying.

      2. (i’m a little late here, sorry)
        Gregg – ‘reliability’ is a technical term. almost by definition, a sample statistic that is based on very few observations is an unreliable estimate of the true (population) parameter. so saying that the results are unreliable is not an emotional attack but a scientific criticism. you can disagree (e.g., if you think the sample was not small), but it seems wrong to call this an emotional attack, much less bullying.
        some relevant thoughts:

  2. Also, I am very disturbed by this comment:

    “The “everything is fine” crew will believe that Dr. Schnall demolished our work whereas the “replication is important” crew will think we raised good points.”

    I really like that more replications are done. But, let’s not divide us into camps. There’s a lot of people in the “in-between” area, that also criticise bad replications, but also see the merit of good replications. We know from the literature what categorization does to our perception of people…

      1. I agree, but as predicted, such has been the case. As a graduate student, I have been extremely disappointed by the discourse, or lack thereof, following the publication of the special issue. Specifically, statements made by Dan Gilbert and JP de Ruiter in a back-and-forth on twitter. It has been disheartening to watch psychologists that I admire behave in such a way.

  3. I have a new policy on comments that I made prior to posting this based on the discussion at Science. I will only approve comments from people using real names.

    1. I will amend this on a case by case basis. I just had a colleague point out that new people might have good ideas but need some protection. I would prefer to have everything out in the open but I also value free speech. But I won’t tolerate nasty comments from anonymous sources!

  4. I’m surprised that no one above has yet mentioned the ideas Danny Kahneman has expressed about grace, politeness, collegiality, and collaborative efforts. I’m doing this from memory, so it’s possible the ones I have in mind might not have been published yet. As I recall, he has written about this issue twice: once pointing out the nature of the problem & subseqently speaking more to the issue of etiquette. Perhaps the latter has appeared only online; if so, I hope subsequent posts will cite the source (I’m not in a good posiiton to look it up right now). Agree with Danny or not, it’s at least worthwhile to include his thoughts in the discussion. In my opinion, they are relevant when we consider not only distributive justice (our verdict on the presumed correctness of one set of results or another) but also the other elements identified as concerns many people have (evidenced by a vast amount of empirical literature in psychology that I assume is replicable!): interpersonal justice, informational justice, and procedural justice–see especially the work by scholars such as Tom Tyler, E. Allan Lind, and Bob Bies. The interpersonal realm pertains to treating people with respect and dignity, and I think at this point we ought to bend over backwards in our attempts to live up to those ideals. Issues of informational justice apparently were centered on transparency at the outset, which is all to the good. The aftermath, however, has suggested we might not yet have worked out all the necessary details. I think that overlaps with procedural justice (e.g., see criteria suggested by Gerald Leventhal), and those “how to” details might take longer than we realize to achieve something like a decent amount of consensus. Like beauty, after all, fairness is in the eye of the beholder!

  5. As too often the case, the essential aspect of this issue is sidestepped or unnoticed. What, pray tell, is the theory (and I do not mean stipulation of folk psychological likelihood) and conceptual apparatus that mediates and enables the presumed priming effect? What is priming? I submit there are a number of variants and that they do not all follow from a single (or dual) mechanism(s).

    I could go on at length, but I already have done so in print and will not here. I just want to make my point — psychology (generally, but not exclusively) is more concerned with demonstration-driven then theory-driven findings. And by “theory” I mean serious, conceptually sophisticated and (possibly) parametrically predictive structures that both explain as well as predict (more than a simple effect present or absent).

    In a real sense, much of what passes as psychological “science” is little more than folk using the right techniques (cf. Feynman) to flash their demos, but lacking conceptual heft to qualify as nomologically meaningful explication of nature.

    This, not simply hand wringing about replication, will ultimately either sink psychology or force a Kuhnian shift. It seems we have a sort of neo-behaviorism in which stipulated magical mental entities are now allowed with apparent impunity.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s