Warm Water and Loneliness

Our paper on bathing/showering habits and loneliness has been accepted (Donnellan, Lucas, & Cesario, in press).  The current package has 9 studies evaluating the correlation between trait loneliness and a preference for warm showers and baths as inspired by Studies 1a and 1b in Bargh and Shalev (2012; hereafter B & S).  In the end, we collected data from over 3,000 people and got effect size estimates that were considerably smaller than the original report.  Below are some random reflections on the results and the process. As I understand the next steps, B & S will have an opportunity to respond to our package (if they want) and then we have the option of writing a brief rejoinder.

1. I blogged about our inability to talk about original B & S data in the Fall of 2012.  I think this has been one of my most viewed blog entries (pathetic, I know).  My crew can apparently talk about these issues now so I will briefly outline a big concern.

Essentially, I thought the data from their Study 1a were strange. We learned that 46 of the 51 participants (90%) reported taking less than one shower or bath per week.  I can see that college students might report taking less than 1 bath per week, but showers?  The modal response in each of our 9 studies drawn from college students, internet panelists, and mTurk workers was always “once a day” and we never observed more than 1% of any sample telling us that they take less than one shower/bath per week.  So I think this distribution in the original Study 1a has to be considered unusual on both intuitive and empirical grounds.

The water temperature variable was also odd given that 24 out of 51 participants selected “cold” (47%) and 18 selected “lukewarm” (35%).   My own intuition is that people like warm to hot water when bathing/showering.  The modal response in each of our 9 samples was “very warm” and it was extremely rare to ever observe a “cold” response.

My view is that the data from Study 1a should be discarded from the literature. The distributions from 1a are just too weird.  This would then leave the field with Study 1b from the original B & S package based on 41 community members versus our 9 samples with over 3,000 people.

2.  My best meta-analytic estimate is that the correlation between trait loneliness and the water temperature variable is .026 (95% CI: -.018 to .069, p = .245).  This is based on a random effects model using the 11 studies in the local literature (i.e., our 9 studies plus Studies 1a and 1b – I included 1a to avoid controversy).  Researchers can debate about the magnitude of correlations but this one seems trivial to me especially because we are talking about two self-reported variables. We are not talking about aspirin and a life or death outcome or the impact of a subtle intervention designed to boost GPA.  Small effects can be important but sometimes very small correlations are practically and theoretically meaningless.

3. None of the original B and S studies had adequate power to detect something like the average .21 correlational effect size found across many social psychological studies (see Richard et al., 2003).  Researchers need around 175 participants with power set to .80 for the r = .21 expectation. If one takes sample size as an implicit statement about researcher expectations about the underlying effect sizes, it would seem like the original researchers thought the effects they were evaluating were fairly substantial.  Our work suggests that the effects in question are probably not.

In the end, I am glad this paper is going to see the light of day.  I am not sure all the effort was worth it but I hope our paper makes people think twice about the size of the connection between loneliness and warm showers/baths.

25 Jan 2014:  Corrected some typos.


I don’t care about effect sizes — I only care about the direction of the results when I conduct my experiments

This claim (or some variant) has been invoked by a few researchers when they take a position on issues of replication and the general purpose of research.  For example, I have heard this platitude from some quarters when they were explaining why they are unconcerned when an original finding with a d of 1.2 reduces to a d of .12 upon exact replications. Someone recently asked me for advice on how to respond to someone making the above claim and I struggled a bit.  My first response was to dig up these two quotes and call it a day.

Cohen (1994): “Next, I have learned and taught that the primary product of research inquiry is one or more measures of effect size, not p values.” (p. 1310).

Abelson (1995): “However, as social scientists move gradually away from reliance on single studies and obsession with null hypothesis testing, effect size measures will become more and more popular” (p. 47).

But I decided to try a bit harder so here are my random thoughts at trying to respond to the above claim.

1.  Assume this person is making a claim about the utility of NHST. 

One retort is to ask how the researcher judges the outcome of their experiments.  They need a method to distinguish the “chance” directional hit from the “real” directional hit.  Often the preferred tool is NHST such that the researcher will judge that their experiment produced evidence consistent with their theory (or it failed to refute their theory) if the direction of the difference/association was consistent with their prediction and the p value was statistically significant at some level (say an alpha of .05).  Unfortunately, the beloved p-value is determined, in part, by the effect size.

To quote from Rosenthal and Rosnow (2008, p. 55):

Because a complete account of “the results of a study” requires that the researcher report not just the p value but also the effect size, it is important to understand the relationship between these two quantities.  The general relationship…is…Significance test = Size of effect * Size of study.

So if you care about the p value, you should care (at least somewhat) about the effect size.  Why? The researcher gets to pick the size of the study so the critical unknown variable is the effect size.  It is well known that given a large enough N, any trivial difference or non-zero correlation will attain significance (see Cohen, 1994, p. 1000 under the heading “The Nil Hypothesis”). Cohen notes that this point was understood as far back as 1938.  Social psychologists can look to Abelson (1995) for a discussion of this point as well (see p. 40).

To further understand the inherent limitations of this NHST-bound approach, we can (and should) quote from the book of Paul Meehl (Chapter 1978).

Putting it crudely, if you have enough cases and your measures are not totally unreliable, the null hypothesis will always be falsified, regardless of the truth of the substantive theory. Of course, it could be falsified in the wrong direction, which means that as the power improves, the probability of a corroborative results approaches one-half. However, if the theory has no verisimilitude – such that we can imagine, so to speak, picking our empirical results randomly out of a directional hat apart from any theory – the probability of a refuting by getting a significant difference in the wrong direction also approaches one-half.  Obviously, this is quite unlike the situation desired from either a Bayesian, a Popperian, or a commonsense scientific standpoint.”  (Meehl, 1978, p. 822).

Meehl gets even more pointed (p. 823):

I am not a statistician, and I am not making a statistical complaint. I am making a philosophical complaint or, if you prefer, a complaint in the domain of scientific method. I suggest that when a reviewer tries to “make theoretical sense” out of such a table of favorable and adverse significance test results, what the reviewer is actually engaged in, willy-nilly or unwittingly, is meaningless substantive constructions on the properties of the statistical power function, and almost nothing else.

Thus, I am not sure that this appeal to directionality with the binary outcome from NHST (i.e., a statistically significant versus not statistically significant result according to some arbitrary alpha criterion) helps make the above argument persuasive.  Ultimately, I believe researchers should think about how strongly the results of a study corroborate a particular theoretical idea.  I think effect sizes are more useful for this purpose than the p-value.  You have to use something – why not use the most direct indicator of magnitude?

A somewhat more informed researcher might tell us to go read Wainer (1999) as a way to defend the virtues of NHST.  This paper is called “One Cheer for Null Hypothesis Significance Testing” and appeared in Psychological Methods in 1999.  Wainer suggests 6 cases in which a binary decision would be valuable.  His example from psychology is testing the hypothesis that the mean human intelligence score at time t is different from the mean score at time t+1.

However, Wainer also seems to find merit in effect sizes.  He writes this as well “Once again, it would be more valuable to estimate the direction and rate of change, but just being able to state that intelligence is changing would be an important contribution (p. 213). He also concludes that “Scientific investigations only rarely must end with a simple reject-not reject decision, although they often include such decisions as part of their beginnings” (p. 213).  So in the end, I am not sure that any appeal to NHST over effect size estimation and interpretation works very well.  Relying exclusively on NHST seems way worse than relying on effect sizes.

2.  Assume this person is making a claim about the limited value of generalizing results from a controlled lab study to the real world.

One advantage of the lab is the ability to generate a strong experimental manipulation.  The downside is that any effect size estimate from such a study may not represent typical world dynamics and thus risks misleading uninformed (or unthinking) readers.  For example, if we wanted to test the idea that drinking regular soda makes rats fat, we could give half of our rats the equivalent of 20 cans of coke a day whereas the other half could get 20 cans of diet coke per day.  Let’s say we did this experiment and the difference was statistically significant (p < .0001) and we get a d = 2.0.  The coke exposed rats were heavier than the diet coke exposed rats.

What would the effect size mean?  Drawing attention to what seems like a huge effect might be misleading because most rats do not drink 20 cans of coke a day.  The effect size would presumably fluctuate with a weaker or stronger manipulation.  We might get ridiculed by the soda lobby if we did not exercise caution in portraying the finding to the media.

This scenario raises an important point about the interpretation of the effect sizes but I am not sure it negates the need to calculate and consider effect sizes.  The effect size from any study should be viewed as an estimate of a population value and thus one should think carefully about defining the population value.  Furthermore, the rat obesity expert presumably knows about other effect sizes in the literature and can therefore place this new result in context for readers.  What effect sizes do we see when we compare sedentary rats to those who run 2 miles per day?  What effect sizes do we see when we compare genetically modified “fat” rats to “skinny” rats?  That kind of information helps the researcher interpret both the theoretical and practical importance of the coke findings.

What Else?

There are probably other ways of being more charitable to the focal argument. Unfortunately, I need to work on some other things and think harder about this issue. I am interested to see if this post generates comments.  However, I should say that I am skeptical that there is much to admire about this perspective on research.  I have yet to read a study where I wished the authors omitted the effect size estimate.

Effect sizes matter for at least two other reasons beyond interpreting results.  First, we need to think about effect sizes when we plan our studies.  Otherwise, we are just being stupid and wasteful.  Indeed, it is wasteful and even potentially unethical to expend resources conducting underpowered studies (see Rosenthal, 1994).  Second, we need to evaluate effect sizes when reviewing the literature and conducting meta-analyses.  We synthesize effect sizes, not p values.  Thus, effect sizes matter for planning studies, interpreting studies, and making sense of an overall literature.

[Snarky aside, skip if you are sensitive]

I will close with a snarky observation that I hope does not detract from my post. Some of the people making the above argument about effect sizes get testy about the low power of failed replication studies of their own findings.   I could fail to replicate hundreds (or more) important effects in the literature by running a bunch of 20 person studies. This should surprise no one. However, a concern about power only makes sense in the context of an underlying population effect size.  I just don’t see how you can complain about the power of failed replications and dismiss effect sizes.

Post Script (6 August 2013):

Daniel Simons has written several good pieces on this topic.  These influenced my thinking and I should have linked to them.  Here they are:



Likewise, David Funder talked about similar issues (see also the comments):



And of course, Lee Jussim (via Brent Roberts)…


One for the File Drawer?

I once read about an experiment in which college kids held either a cold pack or a warm pack and then reported about their levels of so-called trait loneliness. We just tried a close replication of this experiment involving the same short form loneliness scale used by the original authors. I won’t out my collaborators but I want to acknowledge their help.

The original effect size estimate was pretty substantial (d = .61, t = 2.12, df = 49) but we used 261 students so we could have more than adequate power. Our attempt yielded a much small effect size than the original (d =-.01, t = 0.111, df = 259, p = .912).  The mean of the cold group (2.10) was darn near the same as the warm group (2.11; pooled SD = .61).  (We also get null results if you restrict the analyses to just those who reported that they believed the entire cover story: d = -.17.  The direction is counter to predictions, however.)

Failures to replicate are a natural part of science so I am not going to make any bold claims in this post. I do want to point out that the reporting in the original is flawed. (The original authors used a no-pack control condition and found no evidence of a difference between the warm pack and the no-pack condition so we just focused on the warm versus cold comparison for our replication study).  The sample size was reported as 75 participants. The F value for the one-way ANOVA was reported as 3.80 and the degrees of freedom were reported as 2, 74.  The numerator for the reference F distribution should be k -1 (where k is the number of conditions) so the 2 was correct.  However, the denominator was reported as 74 when it should be N – k or 72 (75 – 3).   Things get even weirder when you try to figure out the sample sizes for the 3 groups based on the degrees of freedom reported for each of the three follow-up t-tests.

We found indications that holding a cold pack did do something to participants.  Both the original study and our replication involved a cover story about product evaluation. Participants answered three yes/no questions and these responses varied by condition.

Percentage answering “Yes” to the Pleasant Question:

Warm: 96%     Cold: 80%

Percentage answering “Yes” to the Effective Question:

Warm: 98%     Cold: 88%

Percentage answering “Yes” to the Recommending to a Friend Question:

Warm: 95%   Cold: 85%

Apparently, the cold packs were not evaluated as positively as the warm packs.  I can foresee all sorts of criticism coming our way. I bet one thread is that were are not “skilled” enough to get the effect to work and a second thread is that we are biased against the original authors (either explicitly or implicitly). I’ll just note these as potential limitations and call it good.  Fair enough?

Update 7 February 2014:  We decided to write this up for a journal article. In the process of preparing the manuscript and files for posting, Jessica noticed that I did not drop a participant with an ID we use for testing the survey system.  Thus, the actual sample size should be 260 NOT 261.  Fortunately, this did not change any of the conclusions.  The t statistic was -0.006 (df = 258), p = .995 and the effect size was d = -.01.  We also conducted a number of supplementary analyses to see if removing participants who expressed suspicion or had questionable values on the manipulation check variable (rating the temperature of the cold pack) impacted results.  Nothing we could do influenced the bottom line null result.

I caught my own mistake so I donated $20 to a charity I support – the American Cancer Society.

The Life Goals of Kids These Days

The folks at the Language Log did a nice job of considering some recent claims about the narcissism and delusions of today’s young people. I want to piggy-back on that post with an illustration from another dataset based on work I have done with some colleagues.

We considered a JPSP paper by a group I will just refer to as Drs. Smith and colleagues. Smith et al. used data from the Monitoring the Future Study from 1976 to 2008 to evaluate possible changes in the life goals of high school seniors. They classified high school seniors from 1976 to 1978 as Baby Boomers (N = 10,167) and those from 2000 to 2008 as Millennials (N= 20,684). Those in-between were Gen Xers but I will not talk about them in the interest of simplifying the presentation.

Students were asked about 14 goals and could answer on a 1 to 4 point scale (1=Not Important to 4=Extremely Important). Smith et al. used a centering procedure to report the goals but I think the raw numbers are as enlightening.  Below are the 14 goals ranked by the average level of endorsement for the Millennials.

Mean Level

% Reporting Extremely Important







Having a good marriage and family life






Being able to find steady work






Having strong friendships






Being able to give my children better opportunities than I‘ve had






Being successful in my line of work






Finding purpose and meaning in my life






Having plenty of time for recreation and hobbies






Having lots of money






Making a contribution to society






Discovering new ways to experience things






Living close to parents and relatives






Being a leader in my community






Working to correct social and economic inequalities






Getting away from this area of the country






Overall Goal Rating




What do I make of this?  Not surprisingly, I see more similarities than big differences.  Marriage and family life are important to students as is having a steady job. So high school students want it all – success in love and work.  I do not see “alarming” trends in these results but this is my subjective interpretation.

As I said, Smith et al. used a centering approach with the data.  I think they computed a grand mean across the 14 goals for each respondent and then centered each individual’s response to the 14 goals around that grand mean.  Such a strategy might be a fine approach but it seems to make things look “worse” for the Millennials in comparison to Boomers.  I will let others judge as to which analytic approach is better but I do worry about researcher degrees of freedom here.  I also just like raw descriptive statistics.

[The Monitoring the Future Data are available through ICPSR. My standard $20 contribution to the charity of choice for the first person who emails me with any reporting errors holds.  I really do hope others look at the data themselves.]

Two Types of Researchers?

Last winter I gave a quick brown bag where I speculated about the possibility of two distinct types of researchers. I drew from a number of sources to construct my prototypes. To be clear, I do not suspect that all researchers will fall neatly into one of these two types. I suspect these are so-called “fuzzy” types. I also know that at least one of my colleagues hates this idea. Thus, I apologize in advance.

Regardless, I think there is something to my working taxonomy and I would love to get data on these issues. Absent data, this will have to remain purely hypothetical. There is of course a degree of hyperbole mixed in here as well. Enjoy (or not)!

Approach I Approach II
Ioannidis (2008) Label: Aggressive Discoverer Reflective Replicator
Abelson (1995) Label: Brash/Liberal Stuffy/Conservative
Tetlock (2005) or Berlin (1953) Label: Hedgehogs Foxes
Focus: Discovery Finding Sturdy Effects
Preference: Novelty Definitiveness
Research Materials: Private possessions Public goods
Ideal Reporting Standard: Interesting findings only Everything
Analytic Approach: Find results to support view Concerned about sensitivity
Favorite Sections of Papers: Introduction & Discussion Method & Results
Favorite Kind of Article: Splashy reports that get media coverage Meta-Analyses
View on Confidence Intervals: Unnecessary clutter The smaller the better
Stand on the NHST Controversy: What controversy? Jacob Cohen was a god
View on TED Talks: Yes. Please pick me. Meh!
Greatest Fear: Getting scooped Having findings fail to replicate
Orientation in the Field: Advocacy Skepticism
Error Risk: Type I Type II

What’s the First Rule about John Bargh’s Data?

Answer: You do not talk about John Bargh’s data.

I went on hiatus with back to school events and letter of recommendation writing.  However, I think this is a good story that raises lots of issues. I need to say upfront that these opinions are mine and do not necessarily reflect anyone else’s views. I might also be making a big enemy with this post, but I probably already have a few of those out there. To quote the Dark Knight: I’m not afraid, I’m angry.

Background: Bargh and Shalev (2012) published an article in Emotion where they predicted that trait loneliness would be “positively associated with the frequency, duration, and preferred water temperatures” of showers and baths (p. 156). The correlation between self-reported loneliness and self-reported “physical warmth extraction” from baths/showers was .57 in Study 1a (51 undergrads) and .37 in Study 1b (41 community members). This package received media attention and was discussed in a Psychology Today blog post with the title: “Feeling lonely? Take a warm bath.”

We failed to replicate this effect three times using three different kinds of samples. Our combined sample size was 925 and the overall estimate was – .02. We also used Bayesian estimation techniques and got similar results (the mean estimate was -.02 and 70% of the credible estimates were below zero). Again, the opinions expressed in this blog post are mine and only mine but the research was a collaborative effort with Rich Lucas and Joe Cesario.

[As an aside, John Kruschke gave a workshop at MSU this past weekend about Bayesian estimation. It was engaging and informative. This link will take you to his in press paper at JEP: General about the Bayesian t Test. It is well worth your time to read his paper.]

We just sent our paper off to get trashed in the undergo the peer review process.  However, the point that I want to raise is more important than our findings. Bargh let Joe Cesario look at his data but he forbids us from talking about what Joe observed. So a gag order is in place.

I think this is bull****. There is no reason why there should be a veil of secrecy around raw data. How can we have an open and transparent science if researchers are not allowed to make observations about the underlying data used to make published claims?

I doubt very much that there is even a moderate association between trait loneliness and showering habits. It might not be zero, but it is hard to believe the population value is anything around .50. Consider Figure 1 in Richard, Bond, and Stokes-Zoota (2003, p. 336). This is a summary of 474 meta-analytic effect sizes in the r-metric across social psychology. Richard et al. noted that 5.28% of the effect sizes they summarized were greater than .50. Viewed against this distribution, the .57 from Bargh and Shalev’s Study 1a is unusual. A .57 correlation is something I might expect to see when calculating the correlation between two measures of very similar constructs using self-report scales.

So before more data are collected on this topic, I would hold off on making any recommendations about taking warm baths/showers to lonely people. To quote Uli Schimmack: “In the real world, effect sizes matter.” I think replication and transparency matter as well.

Coverage of the Bargh and Shalev (2012) Study:



Free Advice about the Subject Pool

Around here, the Fall semester starts in just a few weeks. This means the MSU subject pool will soon be teeming with “volunteers” eager to earn their research participation credits. Like many of my colleagues, I have often wondered about the pros and cons of relying so heavily on college sophomores in the laboratory (e.g., Sears, 1986, 2008). Regardless of your take on these issues, it is hard to imagine that subject pools will go away in the near future. Thus, I think it is important to try to learn more about the characteristics of participants in these subject pools and to think more carefully about issues that may impact the generalizability of these types of studies. I still think college student subject pools generate convenience samples even if a certain researcher disagrees.

I did a paper with my former graduate student Edward Witt and our undergraduate assistant (Matthew Orlando) about differences in the characteristics of subject pool members who chose to participate at different points in the semester (Witt, Donnellan, & Orlando, 2011). We also tested for selection effects in the chosen mode of participation by offering an online and in-person version of the same study (participants were only allowed to participate through one mode).  We conducted that study in the Spring of 2010 with a total sample size of 512 participants.

In the original report, we found evidence that more extraverted students selected the in-person version of the study (as opposed to the online version) and that average levels of Conscientiousness were lower at the end of the semester compared to the beginning. In other words, individuals with relatively lower scores on this personality attribute were more likely to show up at the end of term. We also found that we had a greater proportion of men at the end of the term compared to the start. To be clear, the effect sizes were small and some might even say trivial. Nonetheless, our results suggested to us that participants at the start of the semester are likely to be different than participants at the end of the term in some ways. This result is probably unsurprising to anyone who has taught a college course and/or collected data from a student sample (sometimes naïve theories are credible).

We repeated the study in the Fall semester of 2010 but never bothered to publish the results (Max. N with usable data = 594). (We try to replicate our results when we can.) It is reassuring to note that the major results were replicated in the sense of obtaining similar effect size estimates and levels of statistical significance. We used the same personality measure (John Johnson’s 120-item IPIP approximation of the NEO PI-R) and the same design. Individuals who self-selected into the online version of the study were less extraverted than those who selected into the in-person version (d = -.18, t = 2.072, df = 592, p = .039; Witt et al., 2011: d = -.26).   This effect held controlling for the week of the semester and gender. Likewise, we had a greater proportion of men at the end of the term compared to the start (e.g., roughly 15% of the participants were men in September versus 43% in December).

The more interesting result (to me) was that average levels of Conscientiousness were also lower at the end of the semester rather than at the beginning (standardized regression coefficient for week = -.12, p = .005; model also includes gender). Again, the effect sizes were small and some might say trivial.  However, a different way to understand this effect is to standardize Conscientiousness within-gender (women self-report higher scores) and then plot average scores by week of data collection.

The average for the first two weeks of data collection (September of 2010) was .29 (SD = 1.04) whereas the average for the last three weeks (December of 2010) was -.18 (SD = 1.00).  Viewed in this light, the difference between the beginning of the semester and the end of the semester starts to look a bit more substantial.

So here is my free advice:  If you want more conscientiousness participants, be ready to run early in the term.  If you want to have an easier time recruiting men, wait till the end of the term. (Controlling for C does not wipe out the gender effect).

I would post the data but I am going to push Ed to write this up. We have a few other interesting variables that tried to pick up on careless responding that we need to think through.

Note: Edward Witt helped me prepare this entry.