One for the File Drawer?

I once read about an experiment in which college kids held either a cold pack or a warm pack and then reported about their levels of so-called trait loneliness. We just tried a close replication of this experiment involving the same short form loneliness scale used by the original authors. I won’t out my collaborators but I want to acknowledge their help.

The original effect size estimate was pretty substantial (d = .61, t = 2.12, df = 49) but we used 261 students so we could have more than adequate power. Our attempt yielded a much small effect size than the original (d =-.01, t = 0.111, df = 259, p = .912).  The mean of the cold group (2.10) was darn near the same as the warm group (2.11; pooled SD = .61).  (We also get null results if you restrict the analyses to just those who reported that they believed the entire cover story: d = -.17.  The direction is counter to predictions, however.)

Failures to replicate are a natural part of science so I am not going to make any bold claims in this post. I do want to point out that the reporting in the original is flawed. (The original authors used a no-pack control condition and found no evidence of a difference between the warm pack and the no-pack condition so we just focused on the warm versus cold comparison for our replication study).  The sample size was reported as 75 participants. The F value for the one-way ANOVA was reported as 3.80 and the degrees of freedom were reported as 2, 74.  The numerator for the reference F distribution should be k -1 (where k is the number of conditions) so the 2 was correct.  However, the denominator was reported as 74 when it should be N – k or 72 (75 – 3).   Things get even weirder when you try to figure out the sample sizes for the 3 groups based on the degrees of freedom reported for each of the three follow-up t-tests.

We found indications that holding a cold pack did do something to participants.  Both the original study and our replication involved a cover story about product evaluation. Participants answered three yes/no questions and these responses varied by condition.

Percentage answering “Yes” to the Pleasant Question:

Warm: 96%     Cold: 80%

Percentage answering “Yes” to the Effective Question:

Warm: 98%     Cold: 88%

Percentage answering “Yes” to the Recommending to a Friend Question:

Warm: 95%   Cold: 85%

Apparently, the cold packs were not evaluated as positively as the warm packs.  I can foresee all sorts of criticism coming our way. I bet one thread is that were are not “skilled” enough to get the effect to work and a second thread is that we are biased against the original authors (either explicitly or implicitly). I’ll just note these as potential limitations and call it good.  Fair enough?

Update 7 February 2014:  We decided to write this up for a journal article. In the process of preparing the manuscript and files for posting, Jessica noticed that I did not drop a participant with an ID we use for testing the survey system.  Thus, the actual sample size should be 260 NOT 261.  Fortunately, this did not change any of the conclusions.  The t statistic was -0.006 (df = 258), p = .995 and the effect size was d = -.01.  We also conducted a number of supplementary analyses to see if removing participants who expressed suspicion or had questionable values on the manipulation check variable (rating the temperature of the cold pack) impacted results.  Nothing we could do influenced the bottom line null result.

I caught my own mistake so I donated $20 to a charity I support – the American Cancer Society.


Author: mbdonnellan

Professor Social and Personality Psychology Texas A &M University

13 thoughts on “One for the File Drawer?”

  1. This is important work in terms of knocking down yet another silly pseudo-finding, but….honestly my first reaction was something like ‘you have too much to offer psychology to be spending time in this sort of drivel!’

    Seriously, the study with adequate power and appropriate analyses goes in the file drawer whereas the one with fishy degrees of freedom gets published? I realize personality psychology has a stake in the outcome of the social psych hangover, but when I see smart folks spend time on this kind of “research” I wonder about the work of value that could be getting done instead. Luckily MBD is productive enough to keep more than one ball in the air at once.

    But on some level shouldn’t the big lesson be that social psych needs to come around to the generalizeable methods (and straightforward, informative titles, while i’m ranting) of individual differences rather than the other way around?

    1. If smart, accomplished people don’t spend time on this kind of thing, who will? We have the participant pool resources in the Big 10 to tackle this kind of non-sense, so why not? It’s not time wasted if it saves one student from trying to do a master’s or a dissertation on the topic.

  2. Any interest in subjecting the data to a Bayesian analysis? 261 students in two conditions is probably on the low side of producing precise enough estimates to be able to confirm the null, but with such a small mean difference I wonder if a large portion of the probabilistic mean differences would fall into a region of practical equivalence.

  3. Good point David!

    I just ran the BEST using the Kruschke scripts. The mean difference was .01 and the 95% HDI was -.140 to .159 (based on 20,000 samples). In effect size terms, that package estimated a mean d-metric effect size estimate of .02 with a 95% HDI ranging from -.224 to .272. So these values are certainly (!) outside of a conventional ROPE. A larger sample size would be nice.

    I also like the underlying dig at precision here. To be sure, this sample size of 261 produces confidence intervals that might be too large for most people’s stomachs (see e.g., the HDI for the d above). I was too lazy last night to deal with the complexities of getting an approximate 95% CI for the d using the “traditional” methods. Back of the hand computations suggests a 95% CI for the d as somewhere between -.23 to .26.

  4. I think it is important to do such replications (particularly, as you did, the high powered once), but I’d prefer to see this go through a peer review stage, so we can have questions answered (and, preferably, work with the original authors to see how methods may differ or not).

    The recent special issue in Social Psychology is a nice example and one can easily pre-register replications at the OSF. Typically, I do not tend to detect mood effects (or pleasantness effects) for warm vs. cold. Not saying there are none, but I’d be curious as to the temperature differences between the packs (I’ve seen talks where mood effects are obtained once the temperature differences are substantial, in particular to the lower end). In addition, as with the original study, we need to find out far more about context effects of our labs. So, the original person may have not done the ‘able study’, the replicator may have not, but we only find out after a number of studies, that are published and where we can compare the different effect sizes.

    I hope to see more of them and also that we don’t conclude just on one non-replication that it “knocks down a silly pseudo-finding”. We only know things after multiple replications/non-replications. Other labs have found effects of warmth on memory or behavior (we have other folks as well, such as Johan Karremans). I am hesitant to believe the trait findings, simply because I find it hard to believe one can move those around, but only through a number of studies we could find this out. Curious to hear more though and I hope you send it in somewhere.

  5. There is an opportunity cost involved in preparing results for a peer-reviewed publication. Not only is there time and effort needed to write and polish text, there is an investment of emotional energy in having to deal with editors and reviewers who might be unpleasant. I think a paper on this topic should have more data. At the same time, I think Chris makes a good point about how I am investing my time. So I am torn. I also think David Funder has made a number of useful observations about what you have to deal with when you report a failure to replicate.

    For example, my earlier post about some of our showering replication studies generated accusations of me being incompetent, very junior, a plagiarist, someone motivated by the wrong reasons, etc. (I do take some responsibility for these reactions given the tone of my post but some of these were extreme). The 7-study paper about that showering result is now under review and it will be interesting to see how it is received.

    1. I get why it is important to do the work. To be frank, as a person who has spent a good chunk of time the last few years working on the DSM, i am very much living in a glass house. My calculus is: this the manual that is used to diagnose real people, organize clinician’s thinking about prognosis and treatment, and dictate policy on a number of important fronts, including at NIH. So the time is worth it, even if it doesn’t always feel like “real science”: Just like social psychology is worth saving, as is the poor grad student who sets out on a thesis about this thinking they are going to find a d > .50.

      But during the DSM-5 process, i have also come to appreciate more clearly how little the manual contributes to actual clinical practice, in the sense that the same treatments tend to work for different disorders, clinically important differences between people are not well captured by the medical model, etc. And in the end the DSM is only sort of responsive to research anyway. It is also heavily political and reducionistic and psychology journals are mostly ignored and there is even a “all of your numbers don’t really matter in the face of my clinical experience” vibe…I begin to wonder whether or not the field of clinical psychology would be better off if we simply ignored the DSM (i have to credit Kristian Markon here for planting that seed).

      So i wonder the same thing here – maybe the next time you see a study in which the author seems to have spent more time on their title than their analyses, the most effective reaction in the long run will be to find something else to read. Why reinforce attention on poorly conducted studies about trivial issues, when, in fact, basic psychological research has a fair amount of substance to offer the world, which needs it?

      All that said, i do like these take-down posts, so i am not at all averse, from a hedonistic perspective, to your keeping em coming, Brent!

  6. You can often save yourself a lot of time by looking for bias in the reported results. The following analysis took me about 20 minutes. From what I can gather, we are talking about

    Bargh & Shalev (2011). The Substitutability of Physical and Social Warmth in Daily Life. Emotion.

    The following is a back-of-the-envelope calculation of post-hoc power for a single main result from each experiment. The study reported multiple results for each experiment, so the probability of getting all of those effects to be significant is actually much lower than the post-hoc power values (how much depends on how the variables are correlated, which was not reported).

    Exp. 1a (duration of bath or shower) n=51, r=.29, p=0.038, power= 0.5425864
    Exp. 1b (duration of bath or shower) n=41, r=.33, p=0.035, power= 0.5603408
    Exp. 2 (cold-pack vs. warm pack) n1=25, n2=26, t=2.12, p= 0.0391, power= 0.5267013
    Exp. 3 (emotion regulation, main effect of temperature) F(2, 175)=3.11, p=0.047 power= 0.53
    Exp. 4 (this seems to be a case of getting an opposite result and spinning it as something they expected; at any rate it does not detour them from their conclusion)

    Product of power values is 0.085. So the set of experiments appears inconsistent. The theoretical conclusion might have been more believable if Bargh & Shalev (2011) had reported Experiment 4 as a failure.

    After you do this for a while, you can just look at the p values and get a feel for the presence of bias. For a set of 4 or 5 experiments to all reject the null and be bias free, the p values need to be quite small. Here they are all above 0.03.

    The presence of bias does not necessarily mean that any individual experiment in Bargh & Shalev (2011) is invalid, but personally I would not feel compelled to replicate a study from this set without some motivation other than this paper.

    If you should publish the null result, I think you should refrain from pooling your data with theirs in a meta-analysis. Their data _might_ be valid, but it might also have been produced with questionable research practices, and the latter could make the measurements invalid. It’s better to be safe than sorry.

    1. One more demonstration that the inconsistency test or incredibility index predicts failed plications. So much for Simonsohn’s assumption that the null hypothesis of bias is true and Greg capitalizes on chance. Still waiting for an incredible article that reports a replicable finding. A science on steroids (John et al) finally has a doping test and everybody can use it (if you know power). May the force (power) be with you!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s