How Do You Feel When Something Fails To Replicate?

Short Answer: I don’t know, I don’t care.

There is an ongoing discussion about the health of psychological science and the relative merits of different research practices that could improve research. This productive discussion occasionally spawns a parallel conversation about the “psychology of the replicators” or an extended mediation about their motives, emotions, and intentions. Unfortunately, I think that parallel conversation is largely counter-productive. Why? We have limited insight into what goes on inside the minds of others. More importantly, feelings have no bearing on the validity of any result. I am a big fan of this line from Kimble (1994, p. 257): How you feel about a finding has no bearing on its truth.

A few people seem to think that replicators are predisposed to feeling ebullient (gleeful?) when they encounter failures to replicate. This is not my reaction. My initial response is fairly geeky.  My impulse is to calculate the effect size estimate and precision of the new study to compare to the old study. I do not get too invested when a small N replication fails to duplicate a large N original study. I am more interested when a large N replication fails to duplicate a small N original study.

I then look to see if the design was difficult to implement or fairly straightforward to provide a context for interpreting the new evidence. This helps to anticipate the reactions of people who will argue that replicators lacked the skill and expertise to conduct the study or that their motivations influenced the outcome.  The often vague “lack of expertise” and “ill-intentioned” arguments are more persuasive when critics offer a plausible account of how these factors might have biased a particular replication effort.  This would be akin to offering an alternative theory of the crime in legal proceedings. In many cases, it seems unlikely that these factors are especially relevant. For example, a few people claimed that we lacked the expertise to conduct survey studies of showering and loneliness but these critics failed to offer a well-defined explanation for our particular results besides just some low-level mud-slinging. A failure to detect an effect is not prima facie evidence of a lack of expertise.

After this largely intellectual exercise is concluded, I might experience a change in mood or some sort of emotional reaction. More often this amounts to feelings of disappointment about the quality of the initial study and some anxiety about the state of affairs in the field (especially if the original study was of the small N, big effect size variety). A larger N study holds more weight than smaller N study.  Thus, my degree of worry scales with the sample size of the replication.  Of course, single studies are just data points that should end up as grist for the meta-analytic mill.  So there might be some anticipation over the outcome of future studies to learn what happens in yet another replication attempt.

Other people might have different modal emotional reactions. But does it matter?  And does it have anything at all to do with the underlying science or the interpretation of the replication?  My answers are No, No, and No. I think the important issues are the relevant facts – the respective sample sizes, effect size estimates, and procedures.

(Hopefully) The Last Thing We Write About Warm Water and Loneliness

Our rejoinder to the Bargh and Shalev response to our replication studies has been accepted for publication after peer-review. The Bargh and Shalev response is available here. A pdf of our rejoinder is available here.  Here are the highlights of our piece:

  1. An inspection of the size of the correlations from their three new studies suggests their new effect size estimates are closer to our estimates than to those reported in their 2012 paper. The new studies all used larger sample sizes than the original studies.
  2. We have some concerns about the validity of the Physical Warmth Extraction Index and we believe the temperature item is the most direct test of their hypotheses. If you combine all available data and apply a random-effects meta-analytic model, the overall correlation is .017 (95% CI = -.02 to .06 based on 18 studies involving 5,285 participants).
  3. We still have no idea why 90% of the participants in their Study 1a responded that they took less than 1 shower/bath per week. No other study using a sample from the United States even comes close to this distribution. Given this anomaly, we think results from Study 1a should be viewed with extreme caution.
  4. Acquiring additional data from outside labs is probably the most constructive step forward. Additional cross-cultural data would also be valuable.

This has been an interesting adventure and we have learned a lot about self-reported bathing/showering habits. What more could you ask for?

 

Is Obama a Narcissist?

Warning: For educational purposes only. I am a personality researcher not a political scientist!

Short Answer: Probably Not.

Longer Answer: There has been a fair bit of discussion about narcissism and the current president (see here for example). Some of this stemmed from recent claims about his use of first person pronouns (i.e., a purported use of greater “I-talk”). A big problem with that line of reasoning is that the empirical evidence linking narcissism with I-talk is surprisingly shaky.  Thus, Obama’s use of pronouns is probably not very useful when it comes to making inferences about his levels of narcissism.

Perhaps a better way to gauge Obama’s level of narcissism is to see how well his personality profile matches a profile typical of someone with Narcissistic Personality Disorder (NPD).  The good news is that we have such a personality profile for NPD thanks to Lynam and Widiger (2001).  Those researchers asked 12 experts to describe the prototype case of NPD in terms of the facets of the Five-Factor Model (FFM). In general, they found that someone with NPD could be characterized as having the following characteristics…

High Levels: Assertiveness, Excitement Seeking, Hostility, and Openness to Actions (i.e., a willingness to try new things)

Low Levels: Agreeableness (all aspects), Self-Consciousness, Warmth, Openness to Feelings (i.e., a lack of awareness of one’s emotional state and some elements of empathy)

The trickier issue is finding good data on Obama’s actual personality. My former students Edward Witt and Robert Ackerman did some research on this topic that can be used as a starting point.  They had 86 college students (51 liberals and 35 conservatives) rate Obama’s personality using the same dimensions Lynam and Widiger used to generate the NPD profile.  We can use the ratings of Obama averaged across the 86 different students as an informant report of his personality.

Note: I know this approach is far from perfect and it would be ideal to have non-partisan expert raters of Obama’s personality (specifically the 30 facets of the FFM). If you have such a dataset, send it my way (self-reported data from the POTUS would be welcome too)! Moreover, Witt and Ackerman found that liberals and conservatives had some differences when it came to rating Obama’s personality.  For example, conservatives saw him higher in hostility and lower in warmth than liberals.  Thus, the profile I am using might tend to have a rosier view of Obama’s personality than a profile generated from another sample with more conservatives (send me such a dataset if you have it!). An extremely liberal sample might generate an even more positive profile than what they obtained.

With those caveats out of the way, the next step is simple: Calculate the Intraclass Correlation Coefficient (ICC) between his informant-rated profile and the profile of the prototypic person with NPD. The answer is basically zero (ICC = -.08; Pearson’s r = .06).  In short, I don’t think Obama fits the bill of the prototypical narcissist. More data are always welcome but I would be somewhat surprised if Obama’s profile matched well with the profile of a quintessential narcissist in another dataset.

As an aside, Ashley Watts and colleagues evaluated levels of narcissism in the first 43 presidents and they used historical experts to rate presidential personalities. Their paper is extremely interesting and well worth reading. They found these five presidents had personalities with the highest relative approximation to the prototype of NPD: LBJ, Nixon, Jackson, Johnson, and Arthur.  The five lowest presidents were Lincoln, Fillmore, Grant, McKinley, and Monroe. (See Table 4 in their report).

Using data from the Watts et al. paper, I computed standardized scores for the estimates of Obama’s grandiose and vulnerable narcissism levels from the Witt and Ackerman profile. These scores indicated Obama was below average by over .50 SDs for both dimensions (Grandiose: -.70; Vulnerable: -.63).   The big caveat here is that the personality ratings for Obama were provided by undergrads and the Watts et al. data were from experts.  Again, however, there were no indications that Obama is especially narcissistic compared to the other presidents.

Thanks to Robert Ackerman, Matthias Mehl, Rich Slatcher, Ashley Watts, and Edward Witt for insights that helped with this post.

Postscript 1:  This is light hearted post.  However, the procedures I used could make for a fun classroom project for Personality Psychology 101.  Have the students rate a focal individual such as Obama or a character from TV, movies, etc. and then compare the consensus profile to the PD profiles. I have all of the materials to do this if you want them.  The variance in the ratings across students is also potentially interesting.

Postscript 2: Using this same general procedure, Edward Witt, Christopher Hopwood, and I concluded that Anakin Skywalker did not strongly match the profile of someone with BPD and neither did Darth Vader (counter to these speculations).  They were more like successful psychopaths.  But that is a blog post for another day!

Silly Questions to Ask Children

I have been working on a project designed to measure a certain individual difference in children as early as 5 years of age. There are a number of concerns about the use of self-reports with young children so this has been an overarching concern in this project. To partially address this issue, we came up with a handful of items that would be useful for detecting unusual responses in children. These items might be used to identify children who did not understand how to use the response scale or flag children who were giving responses that would be considered invalid.  There is a cottage industry of these kinds of scales for adult personality inventories but fewer options for kids.  (And yes I know about those controversies in the literature over these kinds of scales.)

Truth be told, I like writing items and I think this is true for many researchers. I am curious about how people respond to all sorts of questions especially silly ones.  It is even better if the silly ones tap something interesting about personality or ask participants about dinosaurs.

Here are a few sample items:

1. How do you feel about getting shots from the doctor?

2. How do you feel about getting presents for your birthday?

And my favorite item ever….

3. How would you feel about being eaten by a T-Rex?

The fact that we have asked over 800 kids this last question is sort of ridiculous but it makes me happy. I predicted that kids should report negative responses for this one. This was true for the most part but 11.3% of the sample registered a positive response. In fact, the T-Rex item sparked a heated conversation in my household this morning. My spouse (AD) is a former school teacher and AD thought some kids might think it was cool to see a T-Rex. She thought it was a bad item. My youngest child (SD) thought it would be bad to be eaten by said T-Rex even if it was cool to see one in person. I think SD was on my side.

I have had enough controversy over the past few weeks so I wanted to move on from this breakfast conversation. Thus, I did what any sensible academic would do – I equivocated. I acknowledged that items usually reflect multiple sources of variance and all have some degree of error. I also conceded that this item might pick up on sensation seeking tendencies. There could be some kids who might find it thrilling to be eaten by a T-Rex.Then I took SD to school and cried over a large cup of coffee.

But I still like this item and I think most people would think it would suck to be eaten by a T-Rex. It might also be fun to crowd source the writing of additional items. Feel free to make suggestions.

PS: I want to acknowledge my two collaborators on this project – Michelle Harris and Kali Trzesniewski. They did all of the hard work collecting these data.

Apology

There has been a lot of commentary about the tone of my 11 December 2013 blog post. I’ve tried to keep a relatively low profile during the events of the last week.  It has been one of the strangest weeks of my professional life. However, it seems appropriate to make a formal apology.

1. I apologize for the title.  I intended it as a jokey reference for the need to conduct high power replication studies. It was ill advised.

2. I apologize for the now infamous “epic fail” remark (“We gave it our best shot and pretty much encountered an epic fail as my 10 year would say”). It was poor form and contributed to hurt feelings. I should have been more thoughtful.

I will do better to make sure that I uphold the virtues of civility in future blog postings.

-brent donnellan

Random Reflections on Ceiling Effects and Replication Studies

In a blog post from December of 2013, I  described our attempts to replicate two studies testing the claim that priming cleanliness makes participants less judgmental on a series of 6 moral vignettes. My original post has recently received criticism for my timing and my tone. In terms of timing, I blogged about a paper that was accepted for publication and there was no embargo on the work. In terms of tone, I tried to ground everything I wrote with data but I also editorialized a bit.  It can be hard to know what might be taken as offensive when you are describing an unsuccessful replication attempt. The title (“Go Big or Go Home – A Recent Replication Attempt”) might have been off putting in hindsight. In the grand scope of discourse in the real world, however, I think my original blog post was fairly tame.

Most importantly: I was explicit in the original post about the need for more research. I will state again for the record: I don’t think this matter has been settled and more research is needed. We also said this in the Social Psychology paper.  It should be widely understood that no single study is ever definitive.

As noted in the current news article for Science about the special issue of Social Psychology, there is some debate about ceiling effects with our replication studies. We discuss this issue at some length in our rejoinder to the commentary. I will provide some additional context and observations in this post.  Readers just interested in gory details can read #4. This is a long and tedious post so I apologize in advance.

1. The original studies had relatively small sample sizes. There were 40 total participants in the original scrambled sentence study (Study 1) and 43 total participants in the original hand washing study (Study 2). It takes 26 participants per cell to have an approximately 80% change to detect a d of .80 with alpha set to .05 using a two-tailed significance test.  A d of .80 would be considered a large effect size in many areas of psychology.

2. The overall composite did not attain statistical significance using the conventional alpha level of .05 with a two-tailed test in the original Study 1 (p = .064).  (I have no special love for NHST but many people in the literature rely on this tool for drawing inferences).  Only one of the six vignettes attained statistical significance at the p < .05 level in the original Study 1 (Kitten). Two different vignettes attained statistical significance in the original Study 2 (Trolley and Wallet).  The kitten vignette did not. Effect size estimates for these contrasts are in our report.  Given the sample sizes, these estimates were large but they had wide confidence intervals.

3. The dependent variables were based on moral vignettes created for a different study originally conducted at the University of Virginia.These measures were originally pilot tested with 8 participants according to a PSPB paper (Schnall, Haidt, Clore, & Jordan, 2008, p.1100). College students from the United States were used to develop the measures that served as the dependent variables. There was no a priori reason to think the measures would “not work” for college students from Michigan. We registered our replication plan and Dr. Schnall was a reviewer on the proposal.  No special concerns were raised about our procedures or the nature of our sample. Our sample sizes provided over .99 power to detect the original effect size estimates.

4. The composite DVs were calculated by averaging across the six vignettes and those variables had fairly normal distributions in our studies.  In Study 1, the mean for our control condition was 6.48 (SD = 1.13, Median = 6.67, Skewness = -.55, Kurtosis = -.24, n = 102) whereas it was 5.81 in the original paper (SD = 1.47, Median = 5.67, Skewness = -.33, Kurtosis = -.44, n = 20).   The average was higher in our sample but the scores theoretically range from 0 to 9.  We found no evidence of a priming effect using the composites in Study 1.   In Study 2, the mean for our control condition was 5.65 (SD = 0.59, Median = 5.67, Skewness = -.31, Kurtosis = -.19, n = 68) whereas it was 5.43 in the original paper (SD = 0.69, Median = 5.67, Skewness = -1.58, Kurtosis = 3.45, n = 22).  The scores theoretically range from 1 to 7.  We found no hand washing effect using the composites in Study 2.  These descriptive statistics provide additional context for the discussion about ceiling effects.  The raw data are posted and critical readers can and should verify these numbers.  I have a standing policy to donate $20 to the charity of choice for the first person who notes a significant (!) statistical mistake in my blog posts.

Schnall (2014) undertook a fairly intense screening of our data.  This is healthy for the field and the open science framework facilitated this inquiry because we were required to post the data. Dr. Schnall noted that the responses to the individual moral vignettes tended toward the extreme in our samples.  I think the underlying claim is that students in our samples were so moralistic that any cleanliness priming effects could not have overpowered their pre-existing moral convictions.  This is what the ceiling effect argument translates to in real world terms: The experiments could not have worked in Michigan because the samples tended to have a particular mindset.

It might be helpful to be a little more concrete about the distributions.  For many of the individual vignettes, the “Extremely Wrong” option was a common response.  Below is a summary of the six vignettes and some descriptive information about the data from the control conditions of Study 1 across the two studies (ours and the original).  I think readers will have to judge for themselves as to what kinds of distributions to expect from samples of college students.  Depending on your level of self-righteousness, these results could be viewed positively or negatively.   Remember, we used their original materials.

  • Dog (53% versus 30%):  Morality of eating a pet dog that was just killed in a car accident.
  • Trolley (2% versus 5%):  Morality of killing one person in the classic trolley dilemma.
  • Wallet (44% versus 20%): Morality of keeping cash from a wallet found on the street.
  • Plane (43% versus 30%): Morality of killing an injured boy to save yourself and another person from starving after a plane crash.
  • Resume (29% versus 15%):  Morality of enhancing qualifications on a resume.
  • Kitten (56% versus 70%): Morality of using a kitten for sexual gratification.

Note: All comparisons are from the Control conditions for our replication Study 1 compared to Study 1 in Schnall et al. (2008).  Percentages reflect the proportion of the sample selecting the “extremely wrong” option (i.e., selecting the “9” on the original 0 to 9 scale).  For example, 53% of our participants thought it was extremely wrong for Frank to eat his dead dog for dinner whereas 30% of the participants in the original study provided that response.

To recap, we did not find evidence for the predicted effects and we basically concluded more research was necessary.  Variable distributions are useful pieces of information and non-parametric tests were consistent with the standard t-tests we used in the paper. Moreover, their kitten distribution was at least as extreme as ours, and yet they found the predicted result on this particular vignette in Study 1. Thus, I worry that any ceiling argument only applies when the results are counter to the original predictions. 

One reading of our null results is that there are unknown moderators of the cleanliness priming effects. We have tested for some moderators (e.g., private body consciousness, political orientation) in our replication report and rejoinder, but there could be other possibilities. For example, sample characteristics can make it difficult to find the predicted cleanliness priming results with particular measures.  If researchers have a sample of excessively moralistic/judgmental students who think using kittens for sexual gratification is extremely wrong, then cleaning primes may not be terribly effective at modulating their views. Perhaps a different set of vignettes that are more morally ambiguous (say more in line with the classic trolley problem) will show the predicted effects.  This is something to be tested in future research.

The bottom line for me is that we followed through on our research proposal and we reported our results.  The raw data were posted.  We have no control over the distributions. At the very least, researchers might need to worry about using this particular measure in the future based on our replication efforts. In short, the field may have learned something about how to test these ideas in the future.  In the end, I come full circle to the original conclusion in the December blog post– More research is needed.  

Postscript

I am sure reactions to our work and the respective back-and-forth will break on partisan grounds.  The “everything is fine” crew will believe that Dr. Schnall demolished our work whereas the “replication is important” crew will think we raised good points.  This is all fine and good as it relates to the insider baseball and sort of political theater that exists in our world.  However, I hope these pieces do not just create a bad taste in people’s mouth.  I feel badly that this single paper and exchange have diverted attention from the important example of reform taken by Lakens and Nosek.  They are helping to shape the broader narrative about how to do things differently in psychological science.

 

Quick Update on Timelines (23 May 2014)

David sent Dr. Schnall the paper we submitted to the editors on 28 October 2013 with a link to the raw materials. He wrote “I’ve attached the replication manuscript we submitted to Social Psychology based on our results to give you a heads up on what we found.”  He added: “If you have time, we feel it would be helpful to hear your opinions on our replication attempt, to shed some light on what kind of hidden moderators or other variables might be at play here.”

Dr. Schnall emailed back on 28 October 2013 asking for 2 weeks to review the material before we proceeded. David emailed back on 31 October 2013 apologizing for any miscommunication and that we had submitted the paper. He added we were still interested in her thoughts.

That was the end of our exchanges. We learned about the ceiling effect concern when we received the commentary in early March of 2014.

Warm Water and Loneliness Again?!?!

Call me Captain Ahab…

This is a dead horse but I got around to writing up some  useful new data in this saga.  Researchers at the University of Texas, Austin tried to replicate the basic survey findings in a large Introductory Psychology course back in the Fall of 2013.  They emailed me the results back in November and they were consistent with the general null effects we had been getting in our work.  I asked them if I could write it up for the Psychology File Drawer and they were amenable.  Here is a link to a more complete description of the results and here is a link to the PFD reference.

The basic details…

There was no evidence for an association between loneliness (M = 2.56; SD = .80, alpha = .85) and the Physical Warmth Index (r = -.03, p = .535, n = 365; 95% CI = -.14 to .07).  Moreover, the hypothesis relevant correlation between the water temperature item and the loneliness scale was not statistically distinguishable from zero (r = -.08, p = .141, n = 365, 95% CI = -.18 to .03).

One possible issue is that the U of T crew used a short 3 item measure of loneliness developed for large scale survey work whereas the other studies have used longer measures.  Fortunately, other research suggests this short measure is correlated above .80 with the parent instrument so I do not think this is a major limitation. But I can see others holding a different view.

One of the reviewers of the Emotion paper seemed concerned about our motivations.  The nice thing about these data is that we had nothing to do with the data collection so this criticism is not terribly valid.  Other parties can try this study too — the U of T folks figured a way to study this issue with 6 items!