Silly Questions to Ask Children

I have been working on a project designed to measure a certain individual difference in children as early as 5 years of age. There are a number of concerns about the use of self-reports with young children so this has been an overarching concern in this project. To partially address this issue, we came up with a handful of items that would be useful for detecting unusual responses in children. These items might be used to identify children who did not understand how to use the response scale or flag children who were giving responses that would be considered invalid.  There is a cottage industry of these kinds of scales for adult personality inventories but fewer options for kids.  (And yes I know about those controversies in the literature over these kinds of scales.)

Truth be told, I like writing items and I think this is true for many researchers. I am curious about how people respond to all sorts of questions especially silly ones.  It is even better if the silly ones tap something interesting about personality or ask participants about dinosaurs.

Here are a few sample items:

1. How do you feel about getting shots from the doctor?

2. How do you feel about getting presents for your birthday?

And my favorite item ever….

3. How would you feel about being eaten by a T-Rex?

The fact that we have asked over 800 kids this last question is sort of ridiculous but it makes me happy. I predicted that kids should report negative responses for this one. This was true for the most part but 11.3% of the sample registered a positive response. In fact, the T-Rex item sparked a heated conversation in my household this morning. My spouse (AD) is a former school teacher and AD thought some kids might think it was cool to see a T-Rex. She thought it was a bad item. My youngest child (SD) thought it would be bad to be eaten by said T-Rex even if it was cool to see one in person. I think SD was on my side.

I have had enough controversy over the past few weeks so I wanted to move on from this breakfast conversation. Thus, I did what any sensible academic would do – I equivocated. I acknowledged that items usually reflect multiple sources of variance and all have some degree of error. I also conceded that this item might pick up on sensation seeking tendencies. There could be some kids who might find it thrilling to be eaten by a T-Rex.Then I took SD to school and cried over a large cup of coffee.

But I still like this item and I think most people would think it would suck to be eaten by a T-Rex. It might also be fun to crowd source the writing of additional items. Feel free to make suggestions.

PS: I want to acknowledge my two collaborators on this project – Michelle Harris and Kali Trzesniewski. They did all of the hard work collecting these data.


There has been a lot of commentary about the tone of my 11 December 2013 blog post. I’ve tried to keep a relatively low profile during the events of the last week.  It has been one of the strangest weeks of my professional life. However, it seems appropriate to make a formal apology.

1. I apologize for the title.  I intended it as a jokey reference for the need to conduct high power replication studies. It was ill advised.

2. I apologize for the now infamous “epic fail” remark (“We gave it our best shot and pretty much encountered an epic fail as my 10 year would say”). It was poor form and contributed to hurt feelings. I should have been more thoughtful.

I will do better to make sure that I uphold the virtues of civility in future blog postings.

-brent donnellan

Random Reflections on Ceiling Effects and Replication Studies

In a blog post from December of 2013, I  described our attempts to replicate two studies testing the claim that priming cleanliness makes participants less judgmental on a series of 6 moral vignettes. My original post has recently received criticism for my timing and my tone. In terms of timing, I blogged about a paper that was accepted for publication and there was no embargo on the work. In terms of tone, I tried to ground everything I wrote with data but I also editorialized a bit.  It can be hard to know what might be taken as offensive when you are describing an unsuccessful replication attempt. The title (“Go Big or Go Home – A Recent Replication Attempt”) might have been off putting in hindsight. In the grand scope of discourse in the real world, however, I think my original blog post was fairly tame.

Most importantly: I was explicit in the original post about the need for more research. I will state again for the record: I don’t think this matter has been settled and more research is needed. We also said this in the Social Psychology paper.  It should be widely understood that no single study is ever definitive.

As noted in the current news article for Science about the special issue of Social Psychology, there is some debate about ceiling effects with our replication studies. We discuss this issue at some length in our rejoinder to the commentary. I will provide some additional context and observations in this post.  Readers just interested in gory details can read #4. This is a long and tedious post so I apologize in advance.

1. The original studies had relatively small sample sizes. There were 40 total participants in the original scrambled sentence study (Study 1) and 43 total participants in the original hand washing study (Study 2). It takes 26 participants per cell to have an approximately 80% change to detect a d of .80 with alpha set to .05 using a two-tailed significance test.  A d of .80 would be considered a large effect size in many areas of psychology.

2. The overall composite did not attain statistical significance using the conventional alpha level of .05 with a two-tailed test in the original Study 1 (p = .064).  (I have no special love for NHST but many people in the literature rely on this tool for drawing inferences).  Only one of the six vignettes attained statistical significance at the p < .05 level in the original Study 1 (Kitten). Two different vignettes attained statistical significance in the original Study 2 (Trolley and Wallet).  The kitten vignette did not. Effect size estimates for these contrasts are in our report.  Given the sample sizes, these estimates were large but they had wide confidence intervals.

3. The dependent variables were based on moral vignettes created for a different study originally conducted at the University of Virginia.These measures were originally pilot tested with 8 participants according to a PSPB paper (Schnall, Haidt, Clore, & Jordan, 2008, p.1100). College students from the United States were used to develop the measures that served as the dependent variables. There was no a priori reason to think the measures would “not work” for college students from Michigan. We registered our replication plan and Dr. Schnall was a reviewer on the proposal.  No special concerns were raised about our procedures or the nature of our sample. Our sample sizes provided over .99 power to detect the original effect size estimates.

4. The composite DVs were calculated by averaging across the six vignettes and those variables had fairly normal distributions in our studies.  In Study 1, the mean for our control condition was 6.48 (SD = 1.13, Median = 6.67, Skewness = -.55, Kurtosis = -.24, n = 102) whereas it was 5.81 in the original paper (SD = 1.47, Median = 5.67, Skewness = -.33, Kurtosis = -.44, n = 20).   The average was higher in our sample but the scores theoretically range from 0 to 9.  We found no evidence of a priming effect using the composites in Study 1.   In Study 2, the mean for our control condition was 5.65 (SD = 0.59, Median = 5.67, Skewness = -.31, Kurtosis = -.19, n = 68) whereas it was 5.43 in the original paper (SD = 0.69, Median = 5.67, Skewness = -1.58, Kurtosis = 3.45, n = 22).  The scores theoretically range from 1 to 7.  We found no hand washing effect using the composites in Study 2.  These descriptive statistics provide additional context for the discussion about ceiling effects.  The raw data are posted and critical readers can and should verify these numbers.  I have a standing policy to donate $20 to the charity of choice for the first person who notes a significant (!) statistical mistake in my blog posts.

Schnall (2014) undertook a fairly intense screening of our data.  This is healthy for the field and the open science framework facilitated this inquiry because we were required to post the data. Dr. Schnall noted that the responses to the individual moral vignettes tended toward the extreme in our samples.  I think the underlying claim is that students in our samples were so moralistic that any cleanliness priming effects could not have overpowered their pre-existing moral convictions.  This is what the ceiling effect argument translates to in real world terms: The experiments could not have worked in Michigan because the samples tended to have a particular mindset.

It might be helpful to be a little more concrete about the distributions.  For many of the individual vignettes, the “Extremely Wrong” option was a common response.  Below is a summary of the six vignettes and some descriptive information about the data from the control conditions of Study 1 across the two studies (ours and the original).  I think readers will have to judge for themselves as to what kinds of distributions to expect from samples of college students.  Depending on your level of self-righteousness, these results could be viewed positively or negatively.   Remember, we used their original materials.

  • Dog (53% versus 30%):  Morality of eating a pet dog that was just killed in a car accident.
  • Trolley (2% versus 5%):  Morality of killing one person in the classic trolley dilemma.
  • Wallet (44% versus 20%): Morality of keeping cash from a wallet found on the street.
  • Plane (43% versus 30%): Morality of killing an injured boy to save yourself and another person from starving after a plane crash.
  • Resume (29% versus 15%):  Morality of enhancing qualifications on a resume.
  • Kitten (56% versus 70%): Morality of using a kitten for sexual gratification.

Note: All comparisons are from the Control conditions for our replication Study 1 compared to Study 1 in Schnall et al. (2008).  Percentages reflect the proportion of the sample selecting the “extremely wrong” option (i.e., selecting the “9” on the original 0 to 9 scale).  For example, 53% of our participants thought it was extremely wrong for Frank to eat his dead dog for dinner whereas 30% of the participants in the original study provided that response.

To recap, we did not find evidence for the predicted effects and we basically concluded more research was necessary.  Variable distributions are useful pieces of information and non-parametric tests were consistent with the standard t-tests we used in the paper. Moreover, their kitten distribution was at least as extreme as ours, and yet they found the predicted result on this particular vignette in Study 1. Thus, I worry that any ceiling argument only applies when the results are counter to the original predictions. 

One reading of our null results is that there are unknown moderators of the cleanliness priming effects. We have tested for some moderators (e.g., private body consciousness, political orientation) in our replication report and rejoinder, but there could be other possibilities. For example, sample characteristics can make it difficult to find the predicted cleanliness priming results with particular measures.  If researchers have a sample of excessively moralistic/judgmental students who think using kittens for sexual gratification is extremely wrong, then cleaning primes may not be terribly effective at modulating their views. Perhaps a different set of vignettes that are more morally ambiguous (say more in line with the classic trolley problem) will show the predicted effects.  This is something to be tested in future research.

The bottom line for me is that we followed through on our research proposal and we reported our results.  The raw data were posted.  We have no control over the distributions. At the very least, researchers might need to worry about using this particular measure in the future based on our replication efforts. In short, the field may have learned something about how to test these ideas in the future.  In the end, I come full circle to the original conclusion in the December blog post– More research is needed.  


I am sure reactions to our work and the respective back-and-forth will break on partisan grounds.  The “everything is fine” crew will believe that Dr. Schnall demolished our work whereas the “replication is important” crew will think we raised good points.  This is all fine and good as it relates to the insider baseball and sort of political theater that exists in our world.  However, I hope these pieces do not just create a bad taste in people’s mouth.  I feel badly that this single paper and exchange have diverted attention from the important example of reform taken by Lakens and Nosek.  They are helping to shape the broader narrative about how to do things differently in psychological science.


Quick Update on Timelines (23 May 2014)

David sent Dr. Schnall the paper we submitted to the editors on 28 October 2013 with a link to the raw materials. He wrote “I’ve attached the replication manuscript we submitted to Social Psychology based on our results to give you a heads up on what we found.”  He added: “If you have time, we feel it would be helpful to hear your opinions on our replication attempt, to shed some light on what kind of hidden moderators or other variables might be at play here.”

Dr. Schnall emailed back on 28 October 2013 asking for 2 weeks to review the material before we proceeded. David emailed back on 31 October 2013 apologizing for any miscommunication and that we had submitted the paper. He added we were still interested in her thoughts.

That was the end of our exchanges. We learned about the ceiling effect concern when we received the commentary in early March of 2014.

Warm Water and Loneliness Again?!?!

Call me Captain Ahab…

This is a dead horse but I got around to writing up some  useful new data in this saga.  Researchers at the University of Texas, Austin tried to replicate the basic survey findings in a large Introductory Psychology course back in the Fall of 2013.  They emailed me the results back in November and they were consistent with the general null effects we had been getting in our work.  I asked them if I could write it up for the Psychology File Drawer and they were amenable.  Here is a link to a more complete description of the results and here is a link to the PFD reference.

The basic details…

There was no evidence for an association between loneliness (M = 2.56; SD = .80, alpha = .85) and the Physical Warmth Index (r = -.03, p = .535, n = 365; 95% CI = -.14 to .07).  Moreover, the hypothesis relevant correlation between the water temperature item and the loneliness scale was not statistically distinguishable from zero (r = -.08, p = .141, n = 365, 95% CI = -.18 to .03).

One possible issue is that the U of T crew used a short 3 item measure of loneliness developed for large scale survey work whereas the other studies have used longer measures.  Fortunately, other research suggests this short measure is correlated above .80 with the parent instrument so I do not think this is a major limitation. But I can see others holding a different view.

One of the reviewers of the Emotion paper seemed concerned about our motivations.  The nice thing about these data is that we had nothing to do with the data collection so this criticism is not terribly valid.  Other parties can try this study too — the U of T folks figured a way to study this issue with 6 items!


Things that make me skeptical…

Simine Vazire crafted a thought provoking blog post about how some in the field respond to counter-intuitive findings.  One common reaction among critics of this kind of research is to claim that the results are unbelievable.   This reaction seems to fit with the maxim that extraordinary claims should require extraordinary evidence (AKA the Sagan doctrine).  For example, the standard of evidence needed to support the claim that a high-calorie/low nutrient diet coupled with a sedentary life style is negatively associated with morbidity might be different than the standard of proof needed to support the claim that attending class is positively associated with exam performance.  One claim seems far more extraordinary than the other.  Put another way: Prior subjective beliefs about the truthiness of these claims might differ and thus the research evidence needed to modify these pre-existing beliefs should be different.

I like the Sagan doctrine but I think we can all appreciate the difficulties that arise when trying to determine standards of evidence needed to justify a particular research claim.  There are no easy answers except for the tried and true response that all scientific claims should be thoroughly evaluated by multiple teams using strong methods and multiple operational definitions of the underlying constructs.  But this is a “long term” perspective and provides little guidance when trying to interpret any single study or package of studies.  Except that it does, sort of.  A long term perspective means that most findings should be viewed with a big grain of salt, at least initially.  Skepticism is a virtue (and I think this is one of the overarching themes of Simine’s blog posts thus far).   However, skepticism does not preclude publication and even some initial excitement about an idea.  It simply precludes making bold and definitive statements based on initial results with unknown generality.  More research is needed because of the inherent uncertainty of scientific claims. To quote a lesser known U2 lyric – “Uncertainty can be a guiding light”.

Anyways, I will admit to having the “unbelievable” reaction to a number of research studies.  However, my reaction usually springs from a different set of concerns rather than just a suspicion that a particular claim is counter to my own intuitions.  I am fairly skeptical of my own intuitions. I am also fairly skeptical of the intuitions of others.  And I still find lots of studies literally unbelievable.

Here is a partial list of the reasons for my skepticism. (Note: These points cover well worn ground so feel free to ignore if it sounds like I am beating a dead horse!)

1.  Large effect sizes coupled with small sample sizes.  Believe it or not, there is guidance in the literature to help generate an expected value for research findings in “soft” psychology.  A reasonable number of effects are between .20 and .30 in the r metric and relatively few are above .50 (see Hemphill, 2003; Richard et al., 2003).   Accordingly, when I read studies that generate “largish” effect size estimates (i.e., r ≥ |.40|), I tend to be skeptical.  I think an effect size estimate of .50 is in fact an extraordinary claim.

My skepticism gets compounded when the sample sizes are small and thus the confidence intervals are wide.  This means that the published findings are consistent with a wide range of plausible effect sizes so that any inference about the underlying effect size is not terribly constrained.  The point estimates are not precise. Authors might be excited about the .50 correlation but the 95% CI suggests that the data are actually consistent with anything from a tiny effect to a massive effect.  Frankly, I also hate it when the lower bound of the CI falls just slightly above 0 and thus the p value is just slightly below .05.  It makes me suspect p-hacking was involved.   (Sorry, I said it!)

2. Conceptual replications but no direct replications.  The multi-study package common to such prestigious outlets like PS or JPSP has drawn critical attention in the last 3 or so years.  Although these packages seem persuasive on the surface, they often show hints of publication bias on closer inspection.   The worry is that the original researchers actually conducted a number of related studies and only those that worked were published.   Thus, the published package reflects a biased sampling of the entire body of studies.  The ones that failed to support the general idea were left to languish in the proverbial file drawer.  This generates inflated effect size estimates and makes the case for an effect seem far more compelling than it should be in light of all of the evidence.  Given these issues, I tend to want to see a package of studies that reports both direct and conceptual replications.  If I see only conceptual replications, I get skeptical.  This is compounded when each study itself has a modest sample size with a relatively large effect size estimate that produces a 95% CI that gets quite close to 0 (see Point #1).

3. Breathless press releases.  Members of some of my least favorite crews in psychology seem to create press releases for every paper they publish.  (Of course, my perceptions could be biased!).  At any rate, press releases are designed by the university PR office to get media attention.  The PR office is filled with smart people trained to draw positive attention to the university using the popular media.  I do not have a problem with this objective per se.  However, I do not think this should be the primary mission of the social scientist.  Sometimes good science is only interesting to the scientific community.  I get skeptical when the press release makes the paper seem like it was the most groundbreaking research in all of psychology.  I also get skeptical when the press release draws strong real world implications from fairly constrained lab studies.  It makes me think the researchers overlooked the thorny issues with generalized causal inference.

I worry about saying this but I will put it out there – I suspect that some press releases were envisioned before the research was even conducted.  This is probably an unfair reaction to many press releases but at least I am being honest.  So I get skeptical when there is a big disconnect between the press release and the underlying research like when sweeping claims are made on a study of say 37 kids.  Or big claims about money and happiness are drawn from priming studies involving pictures of money.

I would be interested to hear what makes others skeptical of published claims.


A little background tangential to the main points of this post:

One way to generate press excitement is to quote the researcher(s) as being shocked by the results.  Unfortunately, I often think some of shock and awe expressed in these press releases is disingenuous.  Why?  Researchers designed the studies to test specific predictions in the first place.  So they had some expectations as to what they would find.  Alternatively, if someone did obtain a shocking initial result, they should conduct multiple direct replications to make sure the original result was not simply a false positive.  This kind of narrative is not usually part of the press release.

I also hate to read press releases that generalize the underlying results well beyond the initial design and purpose of the research.  Sometimes the real world implications of experiments are just not clear.  In fact, not all research is designed to have real world implications.  If we take the classic Mook reading at face value, lots of experimental research in psychology has no clear real world implications.   This is perfectly OK but it might make the findings less interesting to the general public.  Or at least it probably requires more background knowledge to make the implications interesting.  Such background is beyond the scope of the press release.


More Null Results in Psychological Science — Comments on McDonald et al. (2014) and Crisp and Birtel (2014)

Full Disclosure:  I am second author on the McDonald et al. (2014) commentary.

Some of you may have seen that Psychological Science published our commentary on the Birtel and Crisp (2012) paper.  Essentially we tried to replicate two of their studies with larger sample sizes (29 versus 240 and 32 versus 175, respectively) and obtained much lower effect size estimates. It is exciting that Psychological Science published our work and I think this is a hint of positive changes for the field.  Hopefully nothing I write in this post undercuts that overarching message.

I read the Crisp and Birtel response and I had a set of responses (shocking, I know!). I think it is fair that they get the last word in print but I had some reactions that I wanted to share.  Thus, I will outlet a few in this blog post. Before diving into issues, I want to reiterate the basic take home message of McDonald et al. (2014):

“Failures to replicate add important information to the literature and should be a normal part of the scientific enterprise. The current study suggests that more work is needed before Birtel and Crisp’s procedures are widely implemented. Interventions for treating prejudice may require more precise manipulations along with rigorous evaluation using large sample sizes.” (p. xx)

1.  Can we get a mulligan on our title? We might want to revise the title of our commentary to make it clear that our efforts applied to only two specific findings in the original Birtel and Crisp (2012) paper. I think we were fairly circumscribed in the text itself but the title might have opened the door for how Crisp and Birtel (2014) responded.  They basically thanked us for our efforts and pointed out that our two difficulties say nothing about the entire imagined contact hypothesis.  They even argued that we “overgeneralized” our findings to the entire imagined contact literature.  To be frank, I do not think they were being charitable to our piece with this criticism because we did not make this claim in the text.  But titles are important and our title might have suggested some sort of overgeneralization.  I will let readers make their own judgments.  Regardless, I wish we had made the title more focused.

2.  If you really believe the d is somewhere around .35, why were the sample sizes so small in the first place?  A major substantive point in the Crisp and Birtel (2014) response is that the overall d for the imagined contact literature is somewhere around .35 based on a recent Miles and Crisp (2014) meta-analysis.  That is a reasonable point but I think it actually undercuts the Birtel and Crisp (2012) paper and makes our take home point for us (i.e., the importance of using larger sample sizes in this literature).  None of the original Birtel and Crisp (2012) studies had anywhere near the power to detect a population d of .35.  If we take the simple two-group independent t-test design, the power requirements for .80 suggest the need for about 260 participants (130 in each group).   The largest sample size in Birtel and Crisp (2012) was 32.

3. What about the ManyLabs paper?  The now famous ManyLabs paper of Klein et al. (in press) reports a replication attempt of an imagined contact study (Study 1 in Husnu & Crisp, 2010).  The ManyLabs effort yielded a much lower effect size estimate (d = .13, N = 6,336) than the original report (d = .86 or .84 as reported in Miles & Crisp, 2014; N = 33).  This is quite similar to the pattern we found in our work.  Thus, I think there is something of a decline effect in operation.  There is a big difference in interpretation between a d of .80 and a d around .15.  This should be worrisome to the field especially when researchers begin to think of the applied implications of this kind of work.

4. What about the Miles and Crisp Meta-Analysis (2014)? I took a serious look at the Miles and Crisp meta-analysis and I basically came away with the sinking feeling that much more research needs to be done to establish the magnitude of the imagined contact effects.  Many of the studies used in the meta-analysis were grossly underpowered.  There were 71 studies and only 2 had sample sizes above 260 (the threshold for having a good chance to detect a d = .35 effect using the standard between-participants design).  Those two large studies yielded basically null effects for the imagined contact hypothesis (d = .02 and .05, ns = 508 and 488, respectively). The average sample size of the studies in the meta-analysis was 81 (81.27 to be precise) and the median was 61 (Min. = 23 and Max. = 508).  A sample size of 123 was in the 90th percentile (i.e., 90% of the samples were below 123) and nearly 80% of the studies had sample sizes below 100.

Miles and Crisp (2014) were worried about sample size but perhaps not in the ways that I might have liked.   Here is what they wrote: “However, we observed that two studies had a sample size over 6 times the average (Chen & Mackie, 2013; Lai et al., 2013). To ensure that these studies did not contribute disproportionately to the summary effect size, we capped their sample size at 180 (the size of the next largest study) when computing the standard error variable used to weight each effect size.” (p. 13).  Others can weigh in about this strategy but I tend to want to let the sample sizes “speak for themselves” in the analyses, especially when using a random-effects meta-analysis model.

 What’s it all mean?

Not to bring out the cliché but I think much more work needs to be done here.  As it stands, I think the d = .35 imagined contact effect size estimate is probably upwardly biased.  Indeed, Miles and Crisp (2014) found evidence of publication bias such that unpublished studies yielded a smaller overall effect size estimate than published studies (but the unpublished studies still produce an estimate that is reliably larger than zero).  However this shakes out, researchers are well advised to use much larger sample sizes than tends to characterize this literature based on my summary of the sample sizes in Miles and Crisp (2014).  I also think more work needs to be done to evaluate the specific Birtel and Crisp (2012) effects.  We now have collected two more unpublished studies with even bigger sample sizes and we have yet to get effect sizes that approximate the original report.

I want to close by trying to clarify my position.  I am not saying that the effect sizes in question are zero or that this is an unimportant research area.  On the contrary, I think this is an incredibly important topic and thus it requires even greater attention to statistical power and precision.


Updated 26 Feb 2014: I corrected the sample size from study 1 from 204 to 240.

Warm Water and Loneliness

Our paper on bathing/showering habits and loneliness has been accepted (Donnellan, Lucas, & Cesario, in press).  The current package has 9 studies evaluating the correlation between trait loneliness and a preference for warm showers and baths as inspired by Studies 1a and 1b in Bargh and Shalev (2012; hereafter B & S).  In the end, we collected data from over 3,000 people and got effect size estimates that were considerably smaller than the original report.  Below are some random reflections on the results and the process. As I understand the next steps, B & S will have an opportunity to respond to our package (if they want) and then we have the option of writing a brief rejoinder.

1. I blogged about our inability to talk about original B & S data in the Fall of 2012.  I think this has been one of my most viewed blog entries (pathetic, I know).  My crew can apparently talk about these issues now so I will briefly outline a big concern.

Essentially, I thought the data from their Study 1a were strange. We learned that 46 of the 51 participants (90%) reported taking less than one shower or bath per week.  I can see that college students might report taking less than 1 bath per week, but showers?  The modal response in each of our 9 studies drawn from college students, internet panelists, and mTurk workers was always “once a day” and we never observed more than 1% of any sample telling us that they take less than one shower/bath per week.  So I think this distribution in the original Study 1a has to be considered unusual on both intuitive and empirical grounds.

The water temperature variable was also odd given that 24 out of 51 participants selected “cold” (47%) and 18 selected “lukewarm” (35%).   My own intuition is that people like warm to hot water when bathing/showering.  The modal response in each of our 9 samples was “very warm” and it was extremely rare to ever observe a “cold” response.

My view is that the data from Study 1a should be discarded from the literature. The distributions from 1a are just too weird.  This would then leave the field with Study 1b from the original B & S package based on 41 community members versus our 9 samples with over 3,000 people.

2.  My best meta-analytic estimate is that the correlation between trait loneliness and the water temperature variable is .026 (95% CI: -.018 to .069, p = .245).  This is based on a random effects model using the 11 studies in the local literature (i.e., our 9 studies plus Studies 1a and 1b – I included 1a to avoid controversy).  Researchers can debate about the magnitude of correlations but this one seems trivial to me especially because we are talking about two self-reported variables. We are not talking about aspirin and a life or death outcome or the impact of a subtle intervention designed to boost GPA.  Small effects can be important but sometimes very small correlations are practically and theoretically meaningless.

3. None of the original B and S studies had adequate power to detect something like the average .21 correlational effect size found across many social psychological studies (see Richard et al., 2003).  Researchers need around 175 participants with power set to .80 for the r = .21 expectation. If one takes sample size as an implicit statement about researcher expectations about the underlying effect sizes, it would seem like the original researchers thought the effects they were evaluating were fairly substantial.  Our work suggests that the effects in question are probably not.

In the end, I am glad this paper is going to see the light of day.  I am not sure all the effort was worth it but I hope our paper makes people think twice about the size of the connection between loneliness and warm showers/baths.

25 Jan 2014:  Corrected some typos.