Is Obama a Narcissist?

Warning: For educational purposes only. I am a personality researcher not a political scientist!

Short Answer: Probably Not.

Longer Answer: There has been a fair bit of discussion about narcissism and the current president (see here for example). Some of this stemmed from recent claims about his use of first person pronouns (i.e., a purported use of greater “I-talk”). A big problem with that line of reasoning is that the empirical evidence linking narcissism with I-talk is surprisingly shaky.  Thus, Obama’s use of pronouns is probably not very useful when it comes to making inferences about his levels of narcissism.

Perhaps a better way to gauge Obama’s level of narcissism is to see how well his personality profile matches a profile typical of someone with Narcissistic Personality Disorder (NPD).  The good news is that we have such a personality profile for NPD thanks to Lynam and Widiger (2001).  Those researchers asked 12 experts to describe the prototype case of NPD in terms of the facets of the Five-Factor Model (FFM). In general, they found that someone with NPD could be characterized as having the following characteristics…

High Levels: Assertiveness, Excitement Seeking, Hostility, and Openness to Actions (i.e., a willingness to try new things)

Low Levels: Agreeableness (all aspects), Self-Consciousness, Warmth, Openness to Feelings (i.e., a lack of awareness of one’s emotional state and some elements of empathy)

The trickier issue is finding good data on Obama’s actual personality. My former students Edward Witt and Robert Ackerman did some research on this topic that can be used as a starting point.  They had 86 college students (51 liberals and 35 conservatives) rate Obama’s personality using the same dimensions Lynam and Widiger used to generate the NPD profile.  We can use the ratings of Obama averaged across the 86 different students as an informant report of his personality.

Note: I know this approach is far from perfect and it would be ideal to have non-partisan expert raters of Obama’s personality (specifically the 30 facets of the FFM). If you have such a dataset, send it my way (self-reported data from the POTUS would be welcome too)! Moreover, Witt and Ackerman found that liberals and conservatives had some differences when it came to rating Obama’s personality.  For example, conservatives saw him higher in hostility and lower in warmth than liberals.  Thus, the profile I am using might tend to have a rosier view of Obama’s personality than a profile generated from another sample with more conservatives (send me such a dataset if you have it!). An extremely liberal sample might generate an even more positive profile than what they obtained.

With those caveats out of the way, the next step is simple: Calculate the Intraclass Correlation Coefficient (ICC) between his informant-rated profile and the profile of the prototypic person with NPD. The answer is basically zero (ICC = -.08; Pearson’s r = .06).  In short, I don’t think Obama fits the bill of the prototypical narcissist. More data are always welcome but I would be somewhat surprised if Obama’s profile matched well with the profile of a quintessential narcissist in another dataset.

As an aside, Ashley Watts and colleagues evaluated levels of narcissism in the first 43 presidents and they used historical experts to rate presidential personalities. Their paper is extremely interesting and well worth reading. They found these five presidents had personalities with the highest relative approximation to the prototype of NPD: LBJ, Nixon, Jackson, Johnson, and Arthur.  The five lowest presidents were Lincoln, Fillmore, Grant, McKinley, and Monroe. (See Table 4 in their report).

Using data from the Watts et al. paper, I computed standardized scores for the estimates of Obama’s grandiose and vulnerable narcissism levels from the Witt and Ackerman profile. These scores indicated Obama was below average by over .50 SDs for both dimensions (Grandiose: -.70; Vulnerable: -.63).   The big caveat here is that the personality ratings for Obama were provided by undergrads and the Watts et al. data were from experts.  Again, however, there were no indications that Obama is especially narcissistic compared to the other presidents.

Thanks to Robert Ackerman, Matthias Mehl, Rich Slatcher, Ashley Watts, and Edward Witt for insights that helped with this post.

Postscript 1:  This is light hearted post.  However, the procedures I used could make for a fun classroom project for Personality Psychology 101.  Have the students rate a focal individual such as Obama or a character from TV, movies, etc. and then compare the consensus profile to the PD profiles. I have all of the materials to do this if you want them.  The variance in the ratings across students is also potentially interesting.

Postscript 2: Using this same general procedure, Edward Witt, Christopher Hopwood, and I concluded that Anakin Skywalker did not strongly match the profile of someone with BPD and neither did Darth Vader (counter to these speculations).  They were more like successful psychopaths.  But that is a blog post for another day!

More Null Results in Psychological Science — Comments on McDonald et al. (2014) and Crisp and Birtel (2014)

Full Disclosure:  I am second author on the McDonald et al. (2014) commentary.

Some of you may have seen that Psychological Science published our commentary on the Birtel and Crisp (2012) paper.  Essentially we tried to replicate two of their studies with larger sample sizes (29 versus 240 and 32 versus 175, respectively) and obtained much lower effect size estimates. It is exciting that Psychological Science published our work and I think this is a hint of positive changes for the field.  Hopefully nothing I write in this post undercuts that overarching message.

I read the Crisp and Birtel response and I had a set of responses (shocking, I know!). I think it is fair that they get the last word in print but I had some reactions that I wanted to share.  Thus, I will outlet a few in this blog post. Before diving into issues, I want to reiterate the basic take home message of McDonald et al. (2014):

“Failures to replicate add important information to the literature and should be a normal part of the scientific enterprise. The current study suggests that more work is needed before Birtel and Crisp’s procedures are widely implemented. Interventions for treating prejudice may require more precise manipulations along with rigorous evaluation using large sample sizes.” (p. xx)

1.  Can we get a mulligan on our title? We might want to revise the title of our commentary to make it clear that our efforts applied to only two specific findings in the original Birtel and Crisp (2012) paper. I think we were fairly circumscribed in the text itself but the title might have opened the door for how Crisp and Birtel (2014) responded.  They basically thanked us for our efforts and pointed out that our two difficulties say nothing about the entire imagined contact hypothesis.  They even argued that we “overgeneralized” our findings to the entire imagined contact literature.  To be frank, I do not think they were being charitable to our piece with this criticism because we did not make this claim in the text.  But titles are important and our title might have suggested some sort of overgeneralization.  I will let readers make their own judgments.  Regardless, I wish we had made the title more focused.

2.  If you really believe the d is somewhere around .35, why were the sample sizes so small in the first place?  A major substantive point in the Crisp and Birtel (2014) response is that the overall d for the imagined contact literature is somewhere around .35 based on a recent Miles and Crisp (2014) meta-analysis.  That is a reasonable point but I think it actually undercuts the Birtel and Crisp (2012) paper and makes our take home point for us (i.e., the importance of using larger sample sizes in this literature).  None of the original Birtel and Crisp (2012) studies had anywhere near the power to detect a population d of .35.  If we take the simple two-group independent t-test design, the power requirements for .80 suggest the need for about 260 participants (130 in each group).   The largest sample size in Birtel and Crisp (2012) was 32.

3. What about the ManyLabs paper?  The now famous ManyLabs paper of Klein et al. (in press) reports a replication attempt of an imagined contact study (Study 1 in Husnu & Crisp, 2010).  The ManyLabs effort yielded a much lower effect size estimate (d = .13, N = 6,336) than the original report (d = .86 or .84 as reported in Miles & Crisp, 2014; N = 33).  This is quite similar to the pattern we found in our work.  Thus, I think there is something of a decline effect in operation.  There is a big difference in interpretation between a d of .80 and a d around .15.  This should be worrisome to the field especially when researchers begin to think of the applied implications of this kind of work.

4. What about the Miles and Crisp Meta-Analysis (2014)? I took a serious look at the Miles and Crisp meta-analysis and I basically came away with the sinking feeling that much more research needs to be done to establish the magnitude of the imagined contact effects.  Many of the studies used in the meta-analysis were grossly underpowered.  There were 71 studies and only 2 had sample sizes above 260 (the threshold for having a good chance to detect a d = .35 effect using the standard between-participants design).  Those two large studies yielded basically null effects for the imagined contact hypothesis (d = .02 and .05, ns = 508 and 488, respectively). The average sample size of the studies in the meta-analysis was 81 (81.27 to be precise) and the median was 61 (Min. = 23 and Max. = 508).  A sample size of 123 was in the 90th percentile (i.e., 90% of the samples were below 123) and nearly 80% of the studies had sample sizes below 100.

Miles and Crisp (2014) were worried about sample size but perhaps not in the ways that I might have liked.   Here is what they wrote: “However, we observed that two studies had a sample size over 6 times the average (Chen & Mackie, 2013; Lai et al., 2013). To ensure that these studies did not contribute disproportionately to the summary effect size, we capped their sample size at 180 (the size of the next largest study) when computing the standard error variable used to weight each effect size.” (p. 13).  Others can weigh in about this strategy but I tend to want to let the sample sizes “speak for themselves” in the analyses, especially when using a random-effects meta-analysis model.

 What’s it all mean?

Not to bring out the cliché but I think much more work needs to be done here.  As it stands, I think the d = .35 imagined contact effect size estimate is probably upwardly biased.  Indeed, Miles and Crisp (2014) found evidence of publication bias such that unpublished studies yielded a smaller overall effect size estimate than published studies (but the unpublished studies still produce an estimate that is reliably larger than zero).  However this shakes out, researchers are well advised to use much larger sample sizes than tends to characterize this literature based on my summary of the sample sizes in Miles and Crisp (2014).  I also think more work needs to be done to evaluate the specific Birtel and Crisp (2012) effects.  We now have collected two more unpublished studies with even bigger sample sizes and we have yet to get effect sizes that approximate the original report.

I want to close by trying to clarify my position.  I am not saying that the effect sizes in question are zero or that this is an unimportant research area.  On the contrary, I think this is an incredibly important topic and thus it requires even greater attention to statistical power and precision.


Updated 26 Feb 2014: I corrected the sample size from study 1 from 204 to 240.

I don’t care about effect sizes — I only care about the direction of the results when I conduct my experiments

This claim (or some variant) has been invoked by a few researchers when they take a position on issues of replication and the general purpose of research.  For example, I have heard this platitude from some quarters when they were explaining why they are unconcerned when an original finding with a d of 1.2 reduces to a d of .12 upon exact replications. Someone recently asked me for advice on how to respond to someone making the above claim and I struggled a bit.  My first response was to dig up these two quotes and call it a day.

Cohen (1994): “Next, I have learned and taught that the primary product of research inquiry is one or more measures of effect size, not p values.” (p. 1310).

Abelson (1995): “However, as social scientists move gradually away from reliance on single studies and obsession with null hypothesis testing, effect size measures will become more and more popular” (p. 47).

But I decided to try a bit harder so here are my random thoughts at trying to respond to the above claim.

1.  Assume this person is making a claim about the utility of NHST. 

One retort is to ask how the researcher judges the outcome of their experiments.  They need a method to distinguish the “chance” directional hit from the “real” directional hit.  Often the preferred tool is NHST such that the researcher will judge that their experiment produced evidence consistent with their theory (or it failed to refute their theory) if the direction of the difference/association was consistent with their prediction and the p value was statistically significant at some level (say an alpha of .05).  Unfortunately, the beloved p-value is determined, in part, by the effect size.

To quote from Rosenthal and Rosnow (2008, p. 55):

Because a complete account of “the results of a study” requires that the researcher report not just the p value but also the effect size, it is important to understand the relationship between these two quantities.  The general relationship…is…Significance test = Size of effect * Size of study.

So if you care about the p value, you should care (at least somewhat) about the effect size.  Why? The researcher gets to pick the size of the study so the critical unknown variable is the effect size.  It is well known that given a large enough N, any trivial difference or non-zero correlation will attain significance (see Cohen, 1994, p. 1000 under the heading “The Nil Hypothesis”). Cohen notes that this point was understood as far back as 1938.  Social psychologists can look to Abelson (1995) for a discussion of this point as well (see p. 40).

To further understand the inherent limitations of this NHST-bound approach, we can (and should) quote from the book of Paul Meehl (Chapter 1978).

Putting it crudely, if you have enough cases and your measures are not totally unreliable, the null hypothesis will always be falsified, regardless of the truth of the substantive theory. Of course, it could be falsified in the wrong direction, which means that as the power improves, the probability of a corroborative results approaches one-half. However, if the theory has no verisimilitude – such that we can imagine, so to speak, picking our empirical results randomly out of a directional hat apart from any theory – the probability of a refuting by getting a significant difference in the wrong direction also approaches one-half.  Obviously, this is quite unlike the situation desired from either a Bayesian, a Popperian, or a commonsense scientific standpoint.”  (Meehl, 1978, p. 822).

Meehl gets even more pointed (p. 823):

I am not a statistician, and I am not making a statistical complaint. I am making a philosophical complaint or, if you prefer, a complaint in the domain of scientific method. I suggest that when a reviewer tries to “make theoretical sense” out of such a table of favorable and adverse significance test results, what the reviewer is actually engaged in, willy-nilly or unwittingly, is meaningless substantive constructions on the properties of the statistical power function, and almost nothing else.

Thus, I am not sure that this appeal to directionality with the binary outcome from NHST (i.e., a statistically significant versus not statistically significant result according to some arbitrary alpha criterion) helps make the above argument persuasive.  Ultimately, I believe researchers should think about how strongly the results of a study corroborate a particular theoretical idea.  I think effect sizes are more useful for this purpose than the p-value.  You have to use something – why not use the most direct indicator of magnitude?

A somewhat more informed researcher might tell us to go read Wainer (1999) as a way to defend the virtues of NHST.  This paper is called “One Cheer for Null Hypothesis Significance Testing” and appeared in Psychological Methods in 1999.  Wainer suggests 6 cases in which a binary decision would be valuable.  His example from psychology is testing the hypothesis that the mean human intelligence score at time t is different from the mean score at time t+1.

However, Wainer also seems to find merit in effect sizes.  He writes this as well “Once again, it would be more valuable to estimate the direction and rate of change, but just being able to state that intelligence is changing would be an important contribution (p. 213). He also concludes that “Scientific investigations only rarely must end with a simple reject-not reject decision, although they often include such decisions as part of their beginnings” (p. 213).  So in the end, I am not sure that any appeal to NHST over effect size estimation and interpretation works very well.  Relying exclusively on NHST seems way worse than relying on effect sizes.

2.  Assume this person is making a claim about the limited value of generalizing results from a controlled lab study to the real world.

One advantage of the lab is the ability to generate a strong experimental manipulation.  The downside is that any effect size estimate from such a study may not represent typical world dynamics and thus risks misleading uninformed (or unthinking) readers.  For example, if we wanted to test the idea that drinking regular soda makes rats fat, we could give half of our rats the equivalent of 20 cans of coke a day whereas the other half could get 20 cans of diet coke per day.  Let’s say we did this experiment and the difference was statistically significant (p < .0001) and we get a d = 2.0.  The coke exposed rats were heavier than the diet coke exposed rats.

What would the effect size mean?  Drawing attention to what seems like a huge effect might be misleading because most rats do not drink 20 cans of coke a day.  The effect size would presumably fluctuate with a weaker or stronger manipulation.  We might get ridiculed by the soda lobby if we did not exercise caution in portraying the finding to the media.

This scenario raises an important point about the interpretation of the effect sizes but I am not sure it negates the need to calculate and consider effect sizes.  The effect size from any study should be viewed as an estimate of a population value and thus one should think carefully about defining the population value.  Furthermore, the rat obesity expert presumably knows about other effect sizes in the literature and can therefore place this new result in context for readers.  What effect sizes do we see when we compare sedentary rats to those who run 2 miles per day?  What effect sizes do we see when we compare genetically modified “fat” rats to “skinny” rats?  That kind of information helps the researcher interpret both the theoretical and practical importance of the coke findings.

What Else?

There are probably other ways of being more charitable to the focal argument. Unfortunately, I need to work on some other things and think harder about this issue. I am interested to see if this post generates comments.  However, I should say that I am skeptical that there is much to admire about this perspective on research.  I have yet to read a study where I wished the authors omitted the effect size estimate.

Effect sizes matter for at least two other reasons beyond interpreting results.  First, we need to think about effect sizes when we plan our studies.  Otherwise, we are just being stupid and wasteful.  Indeed, it is wasteful and even potentially unethical to expend resources conducting underpowered studies (see Rosenthal, 1994).  Second, we need to evaluate effect sizes when reviewing the literature and conducting meta-analyses.  We synthesize effect sizes, not p values.  Thus, effect sizes matter for planning studies, interpreting studies, and making sense of an overall literature.

[Snarky aside, skip if you are sensitive]

I will close with a snarky observation that I hope does not detract from my post. Some of the people making the above argument about effect sizes get testy about the low power of failed replication studies of their own findings.   I could fail to replicate hundreds (or more) important effects in the literature by running a bunch of 20 person studies. This should surprise no one. However, a concern about power only makes sense in the context of an underlying population effect size.  I just don’t see how you can complain about the power of failed replications and dismiss effect sizes.

Post Script (6 August 2013):

Daniel Simons has written several good pieces on this topic.  These influenced my thinking and I should have linked to them.  Here they are:

Likewise, David Funder talked about similar issues (see also the comments):

And of course, Lee Jussim (via Brent Roberts)…

The Life Goals of Kids These Days

The folks at the Language Log did a nice job of considering some recent claims about the narcissism and delusions of today’s young people. I want to piggy-back on that post with an illustration from another dataset based on work I have done with some colleagues.

We considered a JPSP paper by a group I will just refer to as Drs. Smith and colleagues. Smith et al. used data from the Monitoring the Future Study from 1976 to 2008 to evaluate possible changes in the life goals of high school seniors. They classified high school seniors from 1976 to 1978 as Baby Boomers (N = 10,167) and those from 2000 to 2008 as Millennials (N= 20,684). Those in-between were Gen Xers but I will not talk about them in the interest of simplifying the presentation.

Students were asked about 14 goals and could answer on a 1 to 4 point scale (1=Not Important to 4=Extremely Important). Smith et al. used a centering procedure to report the goals but I think the raw numbers are as enlightening.  Below are the 14 goals ranked by the average level of endorsement for the Millennials.

Mean Level

% Reporting Extremely Important







Having a good marriage and family life






Being able to find steady work






Having strong friendships






Being able to give my children better opportunities than I‘ve had






Being successful in my line of work






Finding purpose and meaning in my life






Having plenty of time for recreation and hobbies






Having lots of money






Making a contribution to society






Discovering new ways to experience things






Living close to parents and relatives






Being a leader in my community






Working to correct social and economic inequalities






Getting away from this area of the country






Overall Goal Rating




What do I make of this?  Not surprisingly, I see more similarities than big differences.  Marriage and family life are important to students as is having a steady job. So high school students want it all – success in love and work.  I do not see “alarming” trends in these results but this is my subjective interpretation.

As I said, Smith et al. used a centering approach with the data.  I think they computed a grand mean across the 14 goals for each respondent and then centered each individual’s response to the 14 goals around that grand mean.  Such a strategy might be a fine approach but it seems to make things look “worse” for the Millennials in comparison to Boomers.  I will let others judge as to which analytic approach is better but I do worry about researcher degrees of freedom here.  I also just like raw descriptive statistics.

[The Monitoring the Future Data are available through ICPSR. My standard $20 contribution to the charity of choice for the first person who emails me with any reporting errors holds.  I really do hope others look at the data themselves.]

Your arguments only make sense if you say them very fast…

“Although well-meaning, many of the suggestions only make sense if you say them very fast.”   -Howard Wainer (2011, p. 7) from Uneducated Guesses

I love this phrase: Your ideas only make sense if you say them very fast. I find myself wanting to invoke this idea anytime I hear some of the counterarguments to methodological reform. For example, I think this line applies to NS’s comment about climate change skeptics.

Anyways, I am about 90% done with the articles in the November 2012 special issue of Perspectives on Psychological Science.  I enjoyed reading most of the articles and it is good resource for thinking about reform in psychological research. It should probably be required reading in graduate seminars. So far, the article that generated the strongest initial reaction was the Galak and Meyvis (2012; hereafter G & M) reply to Francis (2012).  I think they basically made his point for him.  [I should disclose that I think the basic idea pursued by G and M seems plausible and I think their reply was written in a constructive fashion. I just did not find their arguments very convincing.]

Francis (2012) suggests that the 8 studies in their package are less compelling when viewed in the aggregate because the “hit” rate is much higher than one would expect given the sample sizes and effect size in question. The implication is that there is probably publication bias. [Note: People sometimes quibble over how Francis calculates his effect size estimate but that is a topic for another blog post.]

I happen to like the “Francis” tables because readers get to see effect size estimates and the sample sizes stripped clean of narrative baggage.  Usually the effect sizes are large and the sample sizes are small. This general pattern would seem to characterize the G and M results.  (Moreover, the correlation between effect size estimates and sample sizes for the G and M set of results was something like -.88.  Ouch!).

G and M acknowledge that they had studies in their file drawer:  “We reported eight successful demonstrations of this phenomenon in our paper, but we also conducted five additional studies whose results either did not reach conventional levels of significance or did reach significance but ended up being rhetorically redundant” (G & M, 2012, p. 595). So there was selective reporting. Case closed in my book. Game over. As an aside, I am not sure I can distinguish between those desperate effect sizes who are reaching toward the p < .05 promised land from those who are fleeing from it. Can you?  It probably takes ESP.

G and M calculated their overall effect size as a g* of .38 (95% CI .25 to .51) with all studies in the mix whereas Francis reported the average g* from the published work as .57.  So it seems to me that their extra data brings down the overall effect size estimate.  Is this a hint of the so-called decline effect?  G and M seem to want to argue that because the g* estimate is bigger than zero that there is no real issue at stake. I disagree. Scientific judgment is rarely a yes/no decision about the existence of an effect. It is more often about the magnitude of the effect.  I worry that the G and M approach distorts effect size estimates and possibly even perpetuates small n studies in the literature.

G and M also stake a position that I fail to understand:  “However, as is the case for many papers in experimental psychology, the goal was never to assess the exact size of the effect, but rather to test between competing theoretical predictions” (G & M, 2012, p.  595). People use this or a similar argument to dismiss current concerns about the paucity of exact replications and the proliferation of small sample sizes in the literature.  What I do understand about this argument makes me skeptical. Let’s quote from the bible of Jacob Cohen (1990, p. 1309):

“In retrospect, it seems to me simultaneously quite understandable yet also ridiculous to try to develop theories about human behavior with p values from Fisherian hypothesis testing and no more than a primitive sense of effect size.”

So effect sizes matter for theories. Effect sizes tell us something about the magnitude of the associations in question (causal or otherwise) and I happen to think this is critical information for evaluating the truthiness of a theoretical idea. Indeed, I think the field of psychology would be healthier if we focused on getting more precise estimates of particular effects rather than playing this game of collecting “hits” from a bunch of underpowered “conceptual” extensions of a general idea.  I actually think G and M should have written this statement: “As is the case for many papers in psychology, our goal was to present as much evidence as possible for our preferred theoretical orientation.”

This strategy seems common but I believe it ultimately produced a JPSP paper on ESP.  So perhaps it is time to discard this approach and try something else for a change. Heck, even a 1-year recess might be worth it. That moratorium on NHST worked, right?

What’s the First Rule about John Bargh’s Data?

Answer: You do not talk about John Bargh’s data.

I went on hiatus with back to school events and letter of recommendation writing.  However, I think this is a good story that raises lots of issues. I need to say upfront that these opinions are mine and do not necessarily reflect anyone else’s views. I might also be making a big enemy with this post, but I probably already have a few of those out there. To quote the Dark Knight: I’m not afraid, I’m angry.

Background: Bargh and Shalev (2012) published an article in Emotion where they predicted that trait loneliness would be “positively associated with the frequency, duration, and preferred water temperatures” of showers and baths (p. 156). The correlation between self-reported loneliness and self-reported “physical warmth extraction” from baths/showers was .57 in Study 1a (51 undergrads) and .37 in Study 1b (41 community members). This package received media attention and was discussed in a Psychology Today blog post with the title: “Feeling lonely? Take a warm bath.”

We failed to replicate this effect three times using three different kinds of samples. Our combined sample size was 925 and the overall estimate was – .02. We also used Bayesian estimation techniques and got similar results (the mean estimate was -.02 and 70% of the credible estimates were below zero). Again, the opinions expressed in this blog post are mine and only mine but the research was a collaborative effort with Rich Lucas and Joe Cesario.

[As an aside, John Kruschke gave a workshop at MSU this past weekend about Bayesian estimation. It was engaging and informative. This link will take you to his in press paper at JEP: General about the Bayesian t Test. It is well worth your time to read his paper.]

We just sent our paper off to get trashed in the undergo the peer review process.  However, the point that I want to raise is more important than our findings. Bargh let Joe Cesario look at his data but he forbids us from talking about what Joe observed. So a gag order is in place.

I think this is bull****. There is no reason why there should be a veil of secrecy around raw data. How can we have an open and transparent science if researchers are not allowed to make observations about the underlying data used to make published claims?

I doubt very much that there is even a moderate association between trait loneliness and showering habits. It might not be zero, but it is hard to believe the population value is anything around .50. Consider Figure 1 in Richard, Bond, and Stokes-Zoota (2003, p. 336). This is a summary of 474 meta-analytic effect sizes in the r-metric across social psychology. Richard et al. noted that 5.28% of the effect sizes they summarized were greater than .50. Viewed against this distribution, the .57 from Bargh and Shalev’s Study 1a is unusual. A .57 correlation is something I might expect to see when calculating the correlation between two measures of very similar constructs using self-report scales.

So before more data are collected on this topic, I would hold off on making any recommendations about taking warm baths/showers to lonely people. To quote Uli Schimmack: “In the real world, effect sizes matter.” I think replication and transparency matter as well.

Coverage of the Bargh and Shalev (2012) Study: