Careless Responders and Factor Structures

Warning: This post will bore most people.  Read at your own risk. I also linked to some  articles behind pay walls. Sorry!

I have a couple of research obsessions that interest me more than they should. This post is about two in particular: 1) the factor structure of the Rosenberg Self-Esteem Scale (RSE); and 2) the impact that careless responding can have on the psychometric properties of measures.  Like I said, this is a boring post.

I worked at the same institution as Neal Schmitt for about a decade and he once wrote a paper in 1985 (with Daniel Stults) illustrating how careless respondents can contribute to “artifact” factors defined by negatively keyed items (see also Woods, 2006).  One implication of Neal’s paper is that careless responders (e.g., people who mark a “1” for all items regardless of the content) confound the evaluation of the dimensionality of scales that include both positively and keyed items.  This matters for empirical research concerning the factor structure of the RSE.  The RSE is perfectly balanced (it has 5 positively-keyed items and 5 negatively-keyed items). Careless responders might contribute to method artifacts when evaluating the structure of the RSE.

This issue then raises a critical issue — how do you identify careless responders? There is an entire literature on this subject (see e.g., Meade & Craig, 2012) that is well worth reading. One option is to sprinkle directed response items throughout a survey (i.e., “Please mark 4 for quality control purposes”). The trick is that participants can be frustrated by too many of these so these items have to be used judiciously. A second option is to include scales developed explicitly to identify careless responders (see e.g., Marjanovic, Struthers, Cribbie, & Greenglass, 2014).  These are good strategies for new data collections. They are not suitable for identifying careless respondents from existing datasets (see Marjanovic, Holden, Struthers, Cribbie, & Greenglass, 2015).  This could be a concern as Meade and Craig found that between 10% and 12% of undergraduate participants to a long survey could be flagged as careless responders using a cool latent profile technique. My take away from their paper is that many datasets might have some degree of contamination.  Yikes!

I experimented with different methods for detecting careless responders on an ad-hoc basis several years ago for a conference talk.  One approach took advantage of the fact that the RSE is a balanced scale. Thus, I computed absolute value discrepancy scores between the positively and negatively keyed items.  [I’m sure someone had the idea before me and that I read about it but simply forgot the source. I also know that some people believe that positively and negatively keyed items reflect different constructs. I’m kind of skeptical of that argument.]

For example, image Dr. Evil responds a “1” to all 10 of the RSE items assessed on a 5 point Likert-type scale.  Given that half of the RSE items are reverse scored, 5 of Dr. Evil’s 1s will be transformed to 5s.  Her/his average for the positively keyed items will be 1 whereas the average for the negatively keyed items will be a 5.  This generates a value of 4 on the discrepancy index (the maximum in this example).  I basically found that selecting people with smaller discrepancy scores cleaned up the evaluation of the factor structure of the RSE.  I dropped the 10% of the sample with the highest discrepancy scores but this was made on a post hoc basis.

[I know there are all sorts of limitations and assumptions with this approach. For example, one obvious limitation is that Dr. Super Evil who responds a 3 to all items, regardless of her/his true feelings, earns a discrepancy score of 0 and is retained in the analysis. Dr. Super Evil is a real problem. I suspect she/he is friends with the Hamburglar.]

Marjanovic et al. (2015) recently published an interesting approach for detecting careless responding.  They propose calculating the standard deviation of the set of items designed to assess the same construct for each person (called the inter-item standard deviation or ISD).  Here the items all need to be keyed in the correct direction and I suspect this approach works best for scales with a mix of positive and negatively keyed items given issues of rectangular responding. [Note: Others have used the inter-item standard deviation as an indicator of substantive constructs but these authors are using this index as a methodological tool.]

Marjanovic et al. (2015) had a dataset with responses to Marjanovic et al. (2014) Conscientious Responders Scale (CRS) as well as responses to Big Five scales.  A composite based on the average of the ISDs for each of the Big Five scales was strongly negatively correlated with responses to the CRS (r = -.81, n = 284). Things looked promising based on the initial study. They also showed how to use a random number generator to develop empirical benchmarks for the ISD.  Indeed, I got a better understanding of the ISD when I simulated a dataset of 1,000 responses to 10 hypothetical items in which item responses were independent and drawn from a distribution whereby each of the five response options has a .20 proportion in the population.  [I also computed the ISD when preparing my talk back in the day but I focused on the discrepancy index – I just used the ISD to identify the people who gave all 3s to the RSE items by selecting mean = 3 and ISD = 0.  There remains an issue with separating those who have “neutral” feelings about the self from people like Dr. Super Evil.]

Anyways, I used their approach and it works well to help clean up analyses of the factor structure of the RSE.  I first drew a sample of 1,000 from a larger dataset of responses to the RSE (the same dataset I used for my conference presentation in 2009).  I only selected responses from European American students to avoid concerns about cultural differences.  The raw data and a  brief description are available.  The ratio of the first to second eigenvalues was 3.13 (5.059 and 1.616) and the scree plot would suggest 2 factors. [I got these eigenvalues from Mplus and this is based on the correlation matrix with 1.0s on the diagonal.  Some purists will kill me. I get it.]

I then ran through a standard set of models for the RSE.  A single factor model was not terribly impressive (e.g., RMSEA = .169, TLI = .601, SRMR = .103) and I thought the best fit was a model with a single global factor and correlated residuals for the negatively and positively keyed items minus one correlation (RMSEA = .068, TLI = .836, SRMR = .029).  I computed the internal consistency coefficient (alpha = .887, average inter-item correlation = .449). Tables with fit indices, the Mplus syntax, and input data are available.

Using the Marjanovic et al (2015) approach with random data, I identified 15% of the sample that could be flagged as random responders (see their paper for details). The RSE structure looked more unidimensional with this subset of 850 non-careless responders. The ratio of the first to second eigenvalues was 6.22 (6.145 and 0.988) and the models tended to have stronger factor loadings and comparatively better fit (even adjusting for the smaller sample size).  Consider that the average loading for the single factor model for all participants was .67 and this increased to .76 with the “clean” dataset. The single global model fit was still relatively unimpressive but better than before (RMSEA = .129, TLI = .852, SRMR = .055) and the single global model with correlated item residuals was still the best (RMSEA = .063, TLI = .964, SRMR = .019).  The alpha was even a bit better (.926, average inter-item correlation = .570).

So I think there is something to be said for trying to identify careless responders before undertaking analyses designed to evaluate the structure of the Rosenberg and other measures as well.  I also hope people continue to develop and evaluate simple ways for flagging potential careless responders for both new and existing datasets.  This might not be “sexy” work but it is important and useful.


Updates (1:30 CST; 2 June 2015): A few people sent/tweeted links to good papers.

Huang et al. (2012). Detecting and deterring insufficient effort responding to surveys.

Huang, Liu, & Bowling (2015). Insufficient effort responding: Examining an insidious confound in survey data.

Maniaci & Roggee (2014). Caring about carelessness: Participant inattention and its effects on research.

Reise & Widaman (1999). Assessing the fit of measurement models at the individual level: A comparison of item response theory and covariance structure approaches.

(1:00 CST; 3 June 2015): Even More Recommendations!  Sanjay rightly pointed out that my post was stupid. But the references and suggested readings are gold!  So even if my post wasted your time, the references should prove useful.

DeSimone, Harms, & DeSimone (2014).  Best practice recommendations for data screening.

Hankins (2008). The reliability of the twelve-item General Health Questionnaire (GHQ-12) under realistic assumptions.

See also: Graham, J. M(2006). Congeneric and (essentially) tau-equivalent estimates of score reliability what they are and how to use them. {Good stuff pointing to limitations with alpha and alternatives}

Savalei & Falk (2014).  Recovering substantive factor loadings in the presence of acquiescence bias: A Comparison of three approaches.



A Partial Defense of the Pete Rose Rule

I tweeted this yesterday: Let’s adopt a Pete Rose Rule for fakers = banned for life.  Nothing questionable about fraud.  Jobs and funds are too scarce for 2nd chances.

My initial thought was that people who have been shown by a preponderance of the evidence to have passed faked datasets as legitimate should be banned from receiving grants and publishing papers for life.   [Pete Rose was a baseball player and manager in professional baseball who bet on games when he was a manager. This made him permanently ineligible to participate in the activities of professional baseball.]

Nick Brown didn’t like this suggestion and provided a thoughtful response on his blog.  My post is an attempt to defend my initial proposal. I don’t want to hijack his comments with a lengthy rejoinder. You can get banned for life from the Olympics for doping so I don’t think it is beyond the pale to make the same suggestion for science.  As always, I reserve the right to change my mind in the future!

At the outset, I agree with his suggestion that it is not 100% feasible given that there is no overall international governing body for scientific research like there is for professional sports or the Olympics. However, the research world is often surprisingly small and I think it would be possible to impose an informal ban that would stick. And I think it could be warranted because faking data is exceptionally damaging to science. I also think it is rare so perhaps it is not worth thinking about too much.

Fakers impose huge costs on the system.  First, they make journals and scientists look bad in the eyes of the public. This is unfortunate because the “public” ultimately pays for a considerable amount of scientific research.  Faked data undermine public confidence in scientists and this often bleeds over into discussions about unrelated issues such as climate change or whether vaccines cause autism.  Likewise, as Dr. Barr pointed out in a comment on Nick’s blog, there is a case to be made for taking legal action for fraud in some cases.

Second, it takes resources to investigate the fakers. At the risk of speaking in broad generalities, I suspect that huge amounts of time are invested when it comes to the investigation of faked data. It takes effort to evaluate the initial charge and then determine what was and was not faked for people with long CVs. Efforts also need to be expended determine whether co-authors were innocent or co-conspirators.  This is time and money NOT spent on new research, teaching students, reviewing papers, etc.

Third, fakers impose costs on their peers.  Academics is a competitive enterprise.  We are judged by the number and quality of our work.  I suspect it is much easier to pump out papers based on fake data than real data.  This matters because there are limited numbers of positions and grant dollars.  A grad student faker who gets a paper in say Science will have a huge advantage on the job market.  There are far more qualified people than university positions.  Universities that have merit-based systems end up paying superstars more than mere mortals.  A superstar who faked her/his/their way to an impressive CV could easily have a higher salary than an honest peer who can’t compete with faked data.  Likewise, fakers cause their peers to waste limited resources when researchers attempt to extend (or just replicate) interesting results.

To my mind, faking data is the worst crime in science because it undermines the integrity of the system.  Thus, I believe that it warrants a serious punishment once it is established after a thorough judicial process or a confession.  You might think a lifetime ban is too severe but I am not so sure.

Moreover, let’s say the field decides to let a faker back in the “game” after some kind of rehabilitation.  Is this wise? I worry that it would impose additional and never-ending costs on the system.  The rehabilitated faker is going to continue to drain the system until retirement. For example, it would cost resources to double-check everything she or he does in the future.  How am I supposed to treat a journal submission from a known faker? It would require extra effort, additional reviews, and a lot of teeth gnashing. I would think a paper from a faker would need to independently replicated before it was taken seriously (I think this is true of all papers, but that is a topic for another day).  Why should a known faker get grants when so many good proposals are not funded because of a lack of resources? Would you trust a rehabilitated faker to train grad students in your program?

So my solution is to kick the “convicted” faker out of the game forever.  There are lots of talented and bright people who can’t get into the game as it stands.  There are not enough resources to go around for deserving scientists who don’t cheat.  I know that I would personally never vote to hire a faker in my department.

But I am open-minded and I know it sounds harsh. I want to thank Nick for forcing me to think more about this. Comments are welcome!

Replication Project in Personality Psychology – Call for Submissions

Richard Lucas and I are editing a special issue of the Journal of Research in Personality dedicated to replication (Click here for complete details). This blog post describes the general process and a few of my random thoughts on the special issue. These are my thoughts and Rich may or may not share my views.  I also want to acknowledge that there are multiple ways of doing replication special issues and we have no illusions that our approach is ideal or uncontroversial.  These kinds of efforts are part of an evolving “conversation” in the field about replication efforts and experimentation should be tolerated.  I also want to make it clear that JRP has been open to replication studies for several years.  The point of the special issue is to actively encourage replication studies and try something new with a variant of pre-registration.

What is the General Process?

We modeled the call for papers on procedures others have used with replication special issues and registered reports (e.g., the special issue of Social Psychology, the Registered Replication Reports at PoPS).  Here is the gist:

  • Authors will submit proposals for replication studies by 1 July 2015. These extended abstracts will be screened for methodological rigor and the importance of the topic.
  • Authors of selected proposals will then be notified by 15 August 2015.
  • There is a deadline of 15 March 2016 to submit the finished manuscript.

We are looking to identify a set of well-designed replication studies that provide valuable information about findings in personality psychology (broadly construed). We hope to include a healthy mix of pre-registered direct replications involving new data collections (either by independent groups or adversarial collaborations) and replications using existing datasets for projects that are not amenable to new data collection (e.g., long-term longitudinal studies).  The specific outcome of the replication attempt will not be a factor in selection.  Indeed, we do not want proposals to describe the actual results!

Complete manuscript will be subjected to peer-review but the relevant issues will be adherence to the proposed research plan, the quality of the data analysis, and the reasonableness of the interpretations.  For example, proposing to use a sample size of 800 but submitting a final manuscript with 80 participants will be solid grounds for outright rejection.  Finding a null result after a good faith attempt that was clearly outlined before data collection will not be grounds for rejection.  Likewise, learning that a previously used measure had subpar psychometric properties in a new and larger sample is valuable information even if it might explain a failure to find predicted effects.  At the very least, such information about how measures perform in new samples provides important technical insights.

Why Do This?

Umm, replication is an important part of science?!?! But beyond that truism, I am excited to learn what happens when we try to organize a modest effort to replicate specific findings in personality psychology. Personality psychologists use a diverse set of methods beyond experiments such as diary and panel studies.  This creates special challenges and opportunities when it comes to replication efforts.  Thus, I see this special issue as a potential chance to learn how replication efforts can be adapted to the diverse kinds of studies conducted by personality researchers.

For example, multiple research groups might have broadly similar datasets that target similar constructs but with specific differences when it comes to the measures, timing of assessments, underlying populations, sample sizes, etc. This requires careful attention to methodological similarities and differences when it comes to interpreting whether particular findings converge across the different datasets.  It would be ideal if researchers paid some attention to these issues before the results of the investigations were known.  Otherwise, there might be a tendency to accentuate differences when results fail to converge. This is one of the reasons why we will entertain proposals that describe replication attempts using existing datasets.

I also think it is important to address a perception that Michael Inzlicht described in a recent blog post.  He suggested that some social psychologists believe that some personality psychologists are using current controversies in the field as a way to get payback for the person-situation debate.  In light of this perception, I think it is important for more personality researchers to engage in formal replication efforts of the sort that have been prominent in social psychology.  This can help counter perceptions that personality researchers are primarily interested in schadenfreude and criticizing our sibling discipline. Hopefully, the cold war is over.

[As an aside, I think it the current handwringing about replication and scientific integrity transcends social and personality psychology.  Moreover, the fates of personality and social psychology are intertwined given the way many departments and journals are structured.  Social and personality psychology (to the extent that there is a difference) each benefit when the other field is vibrant, replicable, and methodologically rigorous.  Few outside of our world make big distinctions between social and personality researchers so we all stand to lose if decision makers like funders and university administrators decide to discount the field over concerns about scientific rigor.]

What Kinds of Replication Studies Are Ideal?

In a nut-shell: High quality replications of interesting and important studies in personality psychology.  To offer a potentially self-serving example, the recent replication of the association between I-words and narcissism is a good example.  The original study was relatively well-cited but it was not particularly strong in terms of sample size.  There were few convincing replications in the literature and it was often accepted as an article of faith that the finding was robust.  Thus, there was value in gaining more knowledge  about the underlying effect size(s) and testing to see whether the basic finding was actually robust.  Studies like that one as well as more modest contributions are welcome.  Personally, I would like more information about how well interactions between personality attributes and experimental manipulations tend to replicate especially when the original studies are seemingly underpowered.

What Don’t You Want to See?

I don’t want to single out too many specific topics or limit submissions but I can think of a few topics that are probably not going to be well received.  For instance, I am not sure we need to publish tons of replications showing there are 3 to 6 basic trait domains using data from college students.  Likewise, I am not sure we need more evidence that skilled factor analysts can find indications of a GFP (or general component) in a personality inventory.  Replications of well-worn and intensely studied topics are not good candidates for this special issue. The point is to get more data on interesting and understudied topics in personality psychology.

Final Thought

I hope we get a number of good submissions and the field learns something new in terms of specific findings. I also hope we also gain insights about the advantages and disadvantages of different approaches to replication in personality psychology.

Is Obama a Narcissist?

Warning: For educational purposes only. I am a personality researcher not a political scientist!

Short Answer: Probably Not.

Longer Answer: There has been a fair bit of discussion about narcissism and the current president (see here for example). Some of this stemmed from recent claims about his use of first person pronouns (i.e., a purported use of greater “I-talk”). A big problem with that line of reasoning is that the empirical evidence linking narcissism with I-talk is surprisingly shaky.  Thus, Obama’s use of pronouns is probably not very useful when it comes to making inferences about his levels of narcissism.

Perhaps a better way to gauge Obama’s level of narcissism is to see how well his personality profile matches a profile typical of someone with Narcissistic Personality Disorder (NPD).  The good news is that we have such a personality profile for NPD thanks to Lynam and Widiger (2001).  Those researchers asked 12 experts to describe the prototype case of NPD in terms of the facets of the Five-Factor Model (FFM). In general, they found that someone with NPD could be characterized as having the following characteristics…

High Levels: Assertiveness, Excitement Seeking, Hostility, and Openness to Actions (i.e., a willingness to try new things)

Low Levels: Agreeableness (all aspects), Self-Consciousness, Warmth, Openness to Feelings (i.e., a lack of awareness of one’s emotional state and some elements of empathy)

The trickier issue is finding good data on Obama’s actual personality. My former students Edward Witt and Robert Ackerman did some research on this topic that can be used as a starting point.  They had 86 college students (51 liberals and 35 conservatives) rate Obama’s personality using the same dimensions Lynam and Widiger used to generate the NPD profile.  We can use the ratings of Obama averaged across the 86 different students as an informant report of his personality.

Note: I know this approach is far from perfect and it would be ideal to have non-partisan expert raters of Obama’s personality (specifically the 30 facets of the FFM). If you have such a dataset, send it my way (self-reported data from the POTUS would be welcome too)! Moreover, Witt and Ackerman found that liberals and conservatives had some differences when it came to rating Obama’s personality.  For example, conservatives saw him higher in hostility and lower in warmth than liberals.  Thus, the profile I am using might tend to have a rosier view of Obama’s personality than a profile generated from another sample with more conservatives (send me such a dataset if you have it!). An extremely liberal sample might generate an even more positive profile than what they obtained.

With those caveats out of the way, the next step is simple: Calculate the Intraclass Correlation Coefficient (ICC) between his informant-rated profile and the profile of the prototypic person with NPD. The answer is basically zero (ICC = -.08; Pearson’s r = .06).  In short, I don’t think Obama fits the bill of the prototypical narcissist. More data are always welcome but I would be somewhat surprised if Obama’s profile matched well with the profile of a quintessential narcissist in another dataset.

As an aside, Ashley Watts and colleagues evaluated levels of narcissism in the first 43 presidents and they used historical experts to rate presidential personalities. Their paper is extremely interesting and well worth reading. They found these five presidents had personalities with the highest relative approximation to the prototype of NPD: LBJ, Nixon, Jackson, Johnson, and Arthur.  The five lowest presidents were Lincoln, Fillmore, Grant, McKinley, and Monroe. (See Table 4 in their report).

Using data from the Watts et al. paper, I computed standardized scores for the estimates of Obama’s grandiose and vulnerable narcissism levels from the Witt and Ackerman profile. These scores indicated Obama was below average by over .50 SDs for both dimensions (Grandiose: -.70; Vulnerable: -.63).   The big caveat here is that the personality ratings for Obama were provided by undergrads and the Watts et al. data were from experts.  Again, however, there were no indications that Obama is especially narcissistic compared to the other presidents.

Thanks to Robert Ackerman, Matthias Mehl, Rich Slatcher, Ashley Watts, and Edward Witt for insights that helped with this post.

Postscript 1:  This is light hearted post.  However, the procedures I used could make for a fun classroom project for Personality Psychology 101.  Have the students rate a focal individual such as Obama or a character from TV, movies, etc. and then compare the consensus profile to the PD profiles. I have all of the materials to do this if you want them.  The variance in the ratings across students is also potentially interesting.

Postscript 2: Using this same general procedure, Edward Witt, Christopher Hopwood, and I concluded that Anakin Skywalker did not strongly match the profile of someone with BPD and neither did Darth Vader (counter to these speculations).  They were more like successful psychopaths.  But that is a blog post for another day!

Silly Questions to Ask Children

I have been working on a project designed to measure a certain individual difference in children as early as 5 years of age. There are a number of concerns about the use of self-reports with young children so this has been an overarching concern in this project. To partially address this issue, we came up with a handful of items that would be useful for detecting unusual responses in children. These items might be used to identify children who did not understand how to use the response scale or flag children who were giving responses that would be considered invalid.  There is a cottage industry of these kinds of scales for adult personality inventories but fewer options for kids.  (And yes I know about those controversies in the literature over these kinds of scales.)

Truth be told, I like writing items and I think this is true for many researchers. I am curious about how people respond to all sorts of questions especially silly ones.  It is even better if the silly ones tap something interesting about personality or ask participants about dinosaurs.

Here are a few sample items:

1. How do you feel about getting shots from the doctor?

2. How do you feel about getting presents for your birthday?

And my favorite item ever….

3. How would you feel about being eaten by a T-Rex?

The fact that we have asked over 800 kids this last question is sort of ridiculous but it makes me happy. I predicted that kids should report negative responses for this one. This was true for the most part but 11.3% of the sample registered a positive response. In fact, the T-Rex item sparked a heated conversation in my household this morning. My spouse (AD) is a former school teacher and AD thought some kids might think it was cool to see a T-Rex. She thought it was a bad item. My youngest child (SD) thought it would be bad to be eaten by said T-Rex even if it was cool to see one in person. I think SD was on my side.

I have had enough controversy over the past few weeks so I wanted to move on from this breakfast conversation. Thus, I did what any sensible academic would do – I equivocated. I acknowledged that items usually reflect multiple sources of variance and all have some degree of error. I also conceded that this item might pick up on sensation seeking tendencies. There could be some kids who might find it thrilling to be eaten by a T-Rex.Then I took SD to school and cried over a large cup of coffee.

But I still like this item and I think most people would think it would suck to be eaten by a T-Rex. It might also be fun to crowd source the writing of additional items. Feel free to make suggestions.

PS: I want to acknowledge my two collaborators on this project – Michelle Harris and Kali Trzesniewski. They did all of the hard work collecting these data.

Random Reflections on Ceiling Effects and Replication Studies

In a blog post from December of 2013, I  described our attempts to replicate two studies testing the claim that priming cleanliness makes participants less judgmental on a series of 6 moral vignettes. My original post has recently received criticism for my timing and my tone. In terms of timing, I blogged about a paper that was accepted for publication and there was no embargo on the work. In terms of tone, I tried to ground everything I wrote with data but I also editorialized a bit.  It can be hard to know what might be taken as offensive when you are describing an unsuccessful replication attempt. The title (“Go Big or Go Home – A Recent Replication Attempt”) might have been off putting in hindsight. In the grand scope of discourse in the real world, however, I think my original blog post was fairly tame.

Most importantly: I was explicit in the original post about the need for more research. I will state again for the record: I don’t think this matter has been settled and more research is needed. We also said this in the Social Psychology paper.  It should be widely understood that no single study is ever definitive.

As noted in the current news article for Science about the special issue of Social Psychology, there is some debate about ceiling effects with our replication studies. We discuss this issue at some length in our rejoinder to the commentary. I will provide some additional context and observations in this post.  Readers just interested in gory details can read #4. This is a long and tedious post so I apologize in advance.

1. The original studies had relatively small sample sizes. There were 40 total participants in the original scrambled sentence study (Study 1) and 43 total participants in the original hand washing study (Study 2). It takes 26 participants per cell to have an approximately 80% change to detect a d of .80 with alpha set to .05 using a two-tailed significance test.  A d of .80 would be considered a large effect size in many areas of psychology.

2. The overall composite did not attain statistical significance using the conventional alpha level of .05 with a two-tailed test in the original Study 1 (p = .064).  (I have no special love for NHST but many people in the literature rely on this tool for drawing inferences).  Only one of the six vignettes attained statistical significance at the p < .05 level in the original Study 1 (Kitten). Two different vignettes attained statistical significance in the original Study 2 (Trolley and Wallet).  The kitten vignette did not. Effect size estimates for these contrasts are in our report.  Given the sample sizes, these estimates were large but they had wide confidence intervals.

3. The dependent variables were based on moral vignettes created for a different study originally conducted at the University of Virginia.These measures were originally pilot tested with 8 participants according to a PSPB paper (Schnall, Haidt, Clore, & Jordan, 2008, p.1100). College students from the United States were used to develop the measures that served as the dependent variables. There was no a priori reason to think the measures would “not work” for college students from Michigan. We registered our replication plan and Dr. Schnall was a reviewer on the proposal.  No special concerns were raised about our procedures or the nature of our sample. Our sample sizes provided over .99 power to detect the original effect size estimates.

4. The composite DVs were calculated by averaging across the six vignettes and those variables had fairly normal distributions in our studies.  In Study 1, the mean for our control condition was 6.48 (SD = 1.13, Median = 6.67, Skewness = -.55, Kurtosis = -.24, n = 102) whereas it was 5.81 in the original paper (SD = 1.47, Median = 5.67, Skewness = -.33, Kurtosis = -.44, n = 20).   The average was higher in our sample but the scores theoretically range from 0 to 9.  We found no evidence of a priming effect using the composites in Study 1.   In Study 2, the mean for our control condition was 5.65 (SD = 0.59, Median = 5.67, Skewness = -.31, Kurtosis = -.19, n = 68) whereas it was 5.43 in the original paper (SD = 0.69, Median = 5.67, Skewness = -1.58, Kurtosis = 3.45, n = 22).  The scores theoretically range from 1 to 7.  We found no hand washing effect using the composites in Study 2.  These descriptive statistics provide additional context for the discussion about ceiling effects.  The raw data are posted and critical readers can and should verify these numbers.  I have a standing policy to donate $20 to the charity of choice for the first person who notes a significant (!) statistical mistake in my blog posts.

Schnall (2014) undertook a fairly intense screening of our data.  This is healthy for the field and the open science framework facilitated this inquiry because we were required to post the data. Dr. Schnall noted that the responses to the individual moral vignettes tended toward the extreme in our samples.  I think the underlying claim is that students in our samples were so moralistic that any cleanliness priming effects could not have overpowered their pre-existing moral convictions.  This is what the ceiling effect argument translates to in real world terms: The experiments could not have worked in Michigan because the samples tended to have a particular mindset.

It might be helpful to be a little more concrete about the distributions.  For many of the individual vignettes, the “Extremely Wrong” option was a common response.  Below is a summary of the six vignettes and some descriptive information about the data from the control conditions of Study 1 across the two studies (ours and the original).  I think readers will have to judge for themselves as to what kinds of distributions to expect from samples of college students.  Depending on your level of self-righteousness, these results could be viewed positively or negatively.   Remember, we used their original materials.

  • Dog (53% versus 30%):  Morality of eating a pet dog that was just killed in a car accident.
  • Trolley (2% versus 5%):  Morality of killing one person in the classic trolley dilemma.
  • Wallet (44% versus 20%): Morality of keeping cash from a wallet found on the street.
  • Plane (43% versus 30%): Morality of killing an injured boy to save yourself and another person from starving after a plane crash.
  • Resume (29% versus 15%):  Morality of enhancing qualifications on a resume.
  • Kitten (56% versus 70%): Morality of using a kitten for sexual gratification.

Note: All comparisons are from the Control conditions for our replication Study 1 compared to Study 1 in Schnall et al. (2008).  Percentages reflect the proportion of the sample selecting the “extremely wrong” option (i.e., selecting the “9” on the original 0 to 9 scale).  For example, 53% of our participants thought it was extremely wrong for Frank to eat his dead dog for dinner whereas 30% of the participants in the original study provided that response.

To recap, we did not find evidence for the predicted effects and we basically concluded more research was necessary.  Variable distributions are useful pieces of information and non-parametric tests were consistent with the standard t-tests we used in the paper. Moreover, their kitten distribution was at least as extreme as ours, and yet they found the predicted result on this particular vignette in Study 1. Thus, I worry that any ceiling argument only applies when the results are counter to the original predictions. 

One reading of our null results is that there are unknown moderators of the cleanliness priming effects. We have tested for some moderators (e.g., private body consciousness, political orientation) in our replication report and rejoinder, but there could be other possibilities. For example, sample characteristics can make it difficult to find the predicted cleanliness priming results with particular measures.  If researchers have a sample of excessively moralistic/judgmental students who think using kittens for sexual gratification is extremely wrong, then cleaning primes may not be terribly effective at modulating their views. Perhaps a different set of vignettes that are more morally ambiguous (say more in line with the classic trolley problem) will show the predicted effects.  This is something to be tested in future research.

The bottom line for me is that we followed through on our research proposal and we reported our results.  The raw data were posted.  We have no control over the distributions. At the very least, researchers might need to worry about using this particular measure in the future based on our replication efforts. In short, the field may have learned something about how to test these ideas in the future.  In the end, I come full circle to the original conclusion in the December blog post– More research is needed.  


I am sure reactions to our work and the respective back-and-forth will break on partisan grounds.  The “everything is fine” crew will believe that Dr. Schnall demolished our work whereas the “replication is important” crew will think we raised good points.  This is all fine and good as it relates to the insider baseball and sort of political theater that exists in our world.  However, I hope these pieces do not just create a bad taste in people’s mouth.  I feel badly that this single paper and exchange have diverted attention from the important example of reform taken by Lakens and Nosek.  They are helping to shape the broader narrative about how to do things differently in psychological science.


Quick Update on Timelines (23 May 2014)

David sent Dr. Schnall the paper we submitted to the editors on 28 October 2013 with a link to the raw materials. He wrote “I’ve attached the replication manuscript we submitted to Social Psychology based on our results to give you a heads up on what we found.”  He added: “If you have time, we feel it would be helpful to hear your opinions on our replication attempt, to shed some light on what kind of hidden moderators or other variables might be at play here.”

Dr. Schnall emailed back on 28 October 2013 asking for 2 weeks to review the material before we proceeded. David emailed back on 31 October 2013 apologizing for any miscommunication and that we had submitted the paper. He added we were still interested in her thoughts.

That was the end of our exchanges. We learned about the ceiling effect concern when we received the commentary in early March of 2014.

Things that make me skeptical…

Simine Vazire crafted a thought provoking blog post about how some in the field respond to counter-intuitive findings.  One common reaction among critics of this kind of research is to claim that the results are unbelievable.   This reaction seems to fit with the maxim that extraordinary claims should require extraordinary evidence (AKA the Sagan doctrine).  For example, the standard of evidence needed to support the claim that a high-calorie/low nutrient diet coupled with a sedentary life style is negatively associated with morbidity might be different than the standard of proof needed to support the claim that attending class is positively associated with exam performance.  One claim seems far more extraordinary than the other.  Put another way: Prior subjective beliefs about the truthiness of these claims might differ and thus the research evidence needed to modify these pre-existing beliefs should be different.

I like the Sagan doctrine but I think we can all appreciate the difficulties that arise when trying to determine standards of evidence needed to justify a particular research claim.  There are no easy answers except for the tried and true response that all scientific claims should be thoroughly evaluated by multiple teams using strong methods and multiple operational definitions of the underlying constructs.  But this is a “long term” perspective and provides little guidance when trying to interpret any single study or package of studies.  Except that it does, sort of.  A long term perspective means that most findings should be viewed with a big grain of salt, at least initially.  Skepticism is a virtue (and I think this is one of the overarching themes of Simine’s blog posts thus far).   However, skepticism does not preclude publication and even some initial excitement about an idea.  It simply precludes making bold and definitive statements based on initial results with unknown generality.  More research is needed because of the inherent uncertainty of scientific claims. To quote a lesser known U2 lyric – “Uncertainty can be a guiding light”.

Anyways, I will admit to having the “unbelievable” reaction to a number of research studies.  However, my reaction usually springs from a different set of concerns rather than just a suspicion that a particular claim is counter to my own intuitions.  I am fairly skeptical of my own intuitions. I am also fairly skeptical of the intuitions of others.  And I still find lots of studies literally unbelievable.

Here is a partial list of the reasons for my skepticism. (Note: These points cover well worn ground so feel free to ignore if it sounds like I am beating a dead horse!)

1.  Large effect sizes coupled with small sample sizes.  Believe it or not, there is guidance in the literature to help generate an expected value for research findings in “soft” psychology.  A reasonable number of effects are between .20 and .30 in the r metric and relatively few are above .50 (see Hemphill, 2003; Richard et al., 2003).   Accordingly, when I read studies that generate “largish” effect size estimates (i.e., r ≥ |.40|), I tend to be skeptical.  I think an effect size estimate of .50 is in fact an extraordinary claim.

My skepticism gets compounded when the sample sizes are small and thus the confidence intervals are wide.  This means that the published findings are consistent with a wide range of plausible effect sizes so that any inference about the underlying effect size is not terribly constrained.  The point estimates are not precise. Authors might be excited about the .50 correlation but the 95% CI suggests that the data are actually consistent with anything from a tiny effect to a massive effect.  Frankly, I also hate it when the lower bound of the CI falls just slightly above 0 and thus the p value is just slightly below .05.  It makes me suspect p-hacking was involved.   (Sorry, I said it!)

2. Conceptual replications but no direct replications.  The multi-study package common to such prestigious outlets like PS or JPSP has drawn critical attention in the last 3 or so years.  Although these packages seem persuasive on the surface, they often show hints of publication bias on closer inspection.   The worry is that the original researchers actually conducted a number of related studies and only those that worked were published.   Thus, the published package reflects a biased sampling of the entire body of studies.  The ones that failed to support the general idea were left to languish in the proverbial file drawer.  This generates inflated effect size estimates and makes the case for an effect seem far more compelling than it should be in light of all of the evidence.  Given these issues, I tend to want to see a package of studies that reports both direct and conceptual replications.  If I see only conceptual replications, I get skeptical.  This is compounded when each study itself has a modest sample size with a relatively large effect size estimate that produces a 95% CI that gets quite close to 0 (see Point #1).

3. Breathless press releases.  Members of some of my least favorite crews in psychology seem to create press releases for every paper they publish.  (Of course, my perceptions could be biased!).  At any rate, press releases are designed by the university PR office to get media attention.  The PR office is filled with smart people trained to draw positive attention to the university using the popular media.  I do not have a problem with this objective per se.  However, I do not think this should be the primary mission of the social scientist.  Sometimes good science is only interesting to the scientific community.  I get skeptical when the press release makes the paper seem like it was the most groundbreaking research in all of psychology.  I also get skeptical when the press release draws strong real world implications from fairly constrained lab studies.  It makes me think the researchers overlooked the thorny issues with generalized causal inference.

I worry about saying this but I will put it out there – I suspect that some press releases were envisioned before the research was even conducted.  This is probably an unfair reaction to many press releases but at least I am being honest.  So I get skeptical when there is a big disconnect between the press release and the underlying research like when sweeping claims are made on a study of say 37 kids.  Or big claims about money and happiness are drawn from priming studies involving pictures of money.

I would be interested to hear what makes others skeptical of published claims.


A little background tangential to the main points of this post:

One way to generate press excitement is to quote the researcher(s) as being shocked by the results.  Unfortunately, I often think some of shock and awe expressed in these press releases is disingenuous.  Why?  Researchers designed the studies to test specific predictions in the first place.  So they had some expectations as to what they would find.  Alternatively, if someone did obtain a shocking initial result, they should conduct multiple direct replications to make sure the original result was not simply a false positive.  This kind of narrative is not usually part of the press release.

I also hate to read press releases that generalize the underlying results well beyond the initial design and purpose of the research.  Sometimes the real world implications of experiments are just not clear.  In fact, not all research is designed to have real world implications.  If we take the classic Mook reading at face value, lots of experimental research in psychology has no clear real world implications.   This is perfectly OK but it might make the findings less interesting to the general public.  Or at least it probably requires more background knowledge to make the implications interesting.  Such background is beyond the scope of the press release.