What’s the First Rule about John Bargh’s Data?

Answer: You do not talk about John Bargh’s data.

I went on hiatus with back to school events and letter of recommendation writing.  However, I think this is a good story that raises lots of issues. I need to say upfront that these opinions are mine and do not necessarily reflect anyone else’s views. I might also be making a big enemy with this post, but I probably already have a few of those out there. To quote the Dark Knight: I’m not afraid, I’m angry.

Background: Bargh and Shalev (2012) published an article in Emotion where they predicted that trait loneliness would be “positively associated with the frequency, duration, and preferred water temperatures” of showers and baths (p. 156). The correlation between self-reported loneliness and self-reported “physical warmth extraction” from baths/showers was .57 in Study 1a (51 undergrads) and .37 in Study 1b (41 community members). This package received media attention and was discussed in a Psychology Today blog post with the title: “Feeling lonely? Take a warm bath.”

We failed to replicate this effect three times using three different kinds of samples. Our combined sample size was 925 and the overall estimate was – .02. We also used Bayesian estimation techniques and got similar results (the mean estimate was -.02 and 70% of the credible estimates were below zero). Again, the opinions expressed in this blog post are mine and only mine but the research was a collaborative effort with Rich Lucas and Joe Cesario.

[As an aside, John Kruschke gave a workshop at MSU this past weekend about Bayesian estimation. It was engaging and informative. This link will take you to his in press paper at JEP: General about the Bayesian t Test. It is well worth your time to read his paper.]

We just sent our paper off to get trashed in the undergo the peer review process.  However, the point that I want to raise is more important than our findings. Bargh let Joe Cesario look at his data but he forbids us from talking about what Joe observed. So a gag order is in place.

I think this is bull****. There is no reason why there should be a veil of secrecy around raw data. How can we have an open and transparent science if researchers are not allowed to make observations about the underlying data used to make published claims?

I doubt very much that there is even a moderate association between trait loneliness and showering habits. It might not be zero, but it is hard to believe the population value is anything around .50. Consider Figure 1 in Richard, Bond, and Stokes-Zoota (2003, p. 336). This is a summary of 474 meta-analytic effect sizes in the r-metric across social psychology. Richard et al. noted that 5.28% of the effect sizes they summarized were greater than .50. Viewed against this distribution, the .57 from Bargh and Shalev’s Study 1a is unusual. A .57 correlation is something I might expect to see when calculating the correlation between two measures of very similar constructs using self-report scales.

So before more data are collected on this topic, I would hold off on making any recommendations about taking warm baths/showers to lonely people. To quote Uli Schimmack: “In the real world, effect sizes matter.” I think replication and transparency matter as well.

Coverage of the Bargh and Shalev (2012) Study:




Free Advice about the Subject Pool

Around here, the Fall semester starts in just a few weeks. This means the MSU subject pool will soon be teeming with “volunteers” eager to earn their research participation credits. Like many of my colleagues, I have often wondered about the pros and cons of relying so heavily on college sophomores in the laboratory (e.g., Sears, 1986, 2008). Regardless of your take on these issues, it is hard to imagine that subject pools will go away in the near future. Thus, I think it is important to try to learn more about the characteristics of participants in these subject pools and to think more carefully about issues that may impact the generalizability of these types of studies. I still think college student subject pools generate convenience samples even if a certain researcher disagrees.

I did a paper with my former graduate student Edward Witt and our undergraduate assistant (Matthew Orlando) about differences in the characteristics of subject pool members who chose to participate at different points in the semester (Witt, Donnellan, & Orlando, 2011). We also tested for selection effects in the chosen mode of participation by offering an online and in-person version of the same study (participants were only allowed to participate through one mode).  We conducted that study in the Spring of 2010 with a total sample size of 512 participants.

In the original report, we found evidence that more extraverted students selected the in-person version of the study (as opposed to the online version) and that average levels of Conscientiousness were lower at the end of the semester compared to the beginning. In other words, individuals with relatively lower scores on this personality attribute were more likely to show up at the end of term. We also found that we had a greater proportion of men at the end of the term compared to the start. To be clear, the effect sizes were small and some might even say trivial. Nonetheless, our results suggested to us that participants at the start of the semester are likely to be different than participants at the end of the term in some ways. This result is probably unsurprising to anyone who has taught a college course and/or collected data from a student sample (sometimes naïve theories are credible).

We repeated the study in the Fall semester of 2010 but never bothered to publish the results (Max. N with usable data = 594). (We try to replicate our results when we can.) It is reassuring to note that the major results were replicated in the sense of obtaining similar effect size estimates and levels of statistical significance. We used the same personality measure (John Johnson’s 120-item IPIP approximation of the NEO PI-R) and the same design. Individuals who self-selected into the online version of the study were less extraverted than those who selected into the in-person version (d = -.18, t = 2.072, df = 592, p = .039; Witt et al., 2011: d = -.26).   This effect held controlling for the week of the semester and gender. Likewise, we had a greater proportion of men at the end of the term compared to the start (e.g., roughly 15% of the participants were men in September versus 43% in December).

The more interesting result (to me) was that average levels of Conscientiousness were also lower at the end of the semester rather than at the beginning (standardized regression coefficient for week = -.12, p = .005; model also includes gender). Again, the effect sizes were small and some might say trivial.  However, a different way to understand this effect is to standardize Conscientiousness within-gender (women self-report higher scores) and then plot average scores by week of data collection.

The average for the first two weeks of data collection (September of 2010) was .29 (SD = 1.04) whereas the average for the last three weeks (December of 2010) was -.18 (SD = 1.00).  Viewed in this light, the difference between the beginning of the semester and the end of the semester starts to look a bit more substantial.

So here is my free advice:  If you want more conscientiousness participants, be ready to run early in the term.  If you want to have an easier time recruiting men, wait till the end of the term. (Controlling for C does not wipe out the gender effect).

I would post the data but I am going to push Ed to write this up. We have a few other interesting variables that tried to pick up on careless responding that we need to think through.

Note: Edward Witt helped me prepare this entry.

Replicability as a Publication Criterion

I wanted to re-read Cronbach (1957) and I stumbled across this letter to the American Psychologist in 1957 from Ardie Lubin with the title of this post: Replicability as a publication criterion.

Just a quick excerpt: “Replicability and generalizability, of course, are not new criteria, and assuredly all editors employ them now in judging the soundness of an article. The only novelty here is the weight which would be placed on the question of whether the results are replicable. Every author would be asked to show some attempt in this direction. Articles using replication designs which are not satisfactory to the editor could be given lowest publication priority. Articles with no attempt at replication would be rejected.”

An Incredible Paper (and I mean that in the best way possible)

Ulrich Schimmack has a paper in press at Psychological Methods that should be required reading for anyone producing or consuming research in soft psychology (Title: “The Ironic Effect of Significant Results on the Credibility of Multiple-Study Articles”).  Sadly, I doubt this paper will get much attention in the popular press.  Uli argues that issues of statistical power are critical for evaluating a package of studies and his approach also fits very nicely with recent papers by Gregory Francis.  I am excited because it seems as if applied researchers are beginning to have access to a set of relatively easy to use tools to evaluate published papers.

(I would add that Uli’s discussion of power fits perfectly well with broader concerns about the importance of study informativeness as emphasized by Geoff Cumming in his recent monograph.)

Uli makes a number of recommendations that have the potential to change the ratio of fiction to non-fiction in our journals.  His first recommendation is to use power to explicitly evaluate manuscripts.  I think this is a compelling recommendation.  He suggests that authors need to justify the sample sizes in their manuscripts. There are too many times when I read papers and I have no clue why authors have used such small samples sizes.  Such concerns do not lend themselves to positive impressions of the work.

Playing around with power calculations or power programs leads to sobering conclusions.  If you expect a d-metric effect size of .60 for a simple two independent-groups study, you need 45 participants in each group (N=90) to have 80% power. The sample requirements only go up if the d is smaller (e.g., 200 total if d = .40 and 788 total if d = .20) or if you want better than 80% power.  Given the expected value of most effect sizes in soft psychology, it seems to me that sample sizes are going to have to increase if the literature is going to get more believable.  Somewhere, Jacob Cohen is smiling. If you hate NHST and want to think in terms of informativeness, that is fine as well.  Bigger samples yield tighter confidence intervals. Who can argue with calls for more precision?

Uli discusses other strategies for improving research practices such as the value of publishing null results and the importance of rewarding the total effort that goes into a paper rather than the number of statistically significant p-values.   It is also worth rewarding individuals and teams who are developing techniques to evaluate the credibility of the literature, actively replicating results, and making sure published findings are legitimate.  Some want to dismiss them as witch hunters.  I prefer to call them scientists.

Preliminary Thoughts about Guidelines and Recommendations for Exact Replications

Thanks to Chris Fraley and Fred Oswald for earlier comments on these ideas.

After the embarrassing methodological travesties of the last two years (e.g,. Bem’s publication of the ESP study in JPSP; the Big Three Fraudsters – Stapel, Smeesters; Sanna; Bargh’s Psychology Today rants), there is increased interest in replication studies.  This is a great development but there are some nuts and bolts issues that are important for conducting informative replications.  If the true population effect size is small and your replication study has a very small sample size, the replication attempt will not be very informative.

Thus, I started to think about a set of guidelines for designing exact (or near-exact) replication studies that might produce meaningful data.  I let this material sit on my desktop for months but I decided to post it here.

Three big issues have occurred to me

A. What counts as a replication?  A directional hit such that the new result is in the same direction as the original paper and statistically significant at p < .05 (or should it be .01 or .001)?  Or an effect size estimate that is in the ballpark of the original?  Some friends/colleagues of mine think the first outcome counts as a replication but I am not convinced.  Why? A trivial effect size will reach significance at p < .05 with a large enough sample size.  Let’s consider a real-life example. Bargh’s original walking study (experiment 2a) generated a d estimate of around 1.08 (N =30) in the published paper (computed from the reported t of 2.86 with df = 28, the mean difference between the two conditions was .98 seconds).   What is remarkable about Bargh et al. (1996) is probably the size of the effect.  (How many ds> 1.00 do you see in your work?). If I redo his study with 10,000 participants per condition and get a d-metric effect size estimate of .10 (p < .05), did I reproduce his results?  I don’t have the best answer for this question but I would prefer to count a replication as any study that obtains an effect size in the ballpark of the original study (to be arbitrary – say the 95% CIs overlap?).  This perspective leads to the next issue…

B. What kind of effect size estimate should researchers expect when planning the replication study?  I think Ioannidis is a tremendously smart person (e.g., 2008; Epidemiology) so I trust him when he argues that most discovered effect sizes are inflated.  Thus, I think researchers should expect some “shrinkage” in effect size estimates upon replication.  This unpleasant reality has consequences for study design.  Ultimately, I think a replication study should have a sample size that is equal to the original and preferably much larger.  A much smaller sample size than the original is not a good attribute of a replication study.

C. Do you address obvious flaws in the original?  Nearly all studies have flaws and sometimes researchers make inexplicable choices.  Do you try to fix these when conducting the replication study?  Say a group of researchers investigated the correlation between loneliness and taking warm showers/baths (don’t ask) and they decided to use only 10 out of 20 items on a well-established loneliness measure.  What do you do?  Use only their 10 items (if you could figure those out from the published report) or use the whole scale? My view is that you should use the full measure but that might mean that my new study is only a near-exact replication.  Fortunately, I can extract the 10 items from the 20 items so things are fine in this case.  Other examples with different IV/DVs might not be so easy to handle.

In light of those issues, I came up with these quick and dirty recommendations for simple experiments or correlational studies (replication studies when it is easy to identify a population correlation or mean-difference of interest).

1. Read the original study thoroughly and calculate effect size estimates if none are presented.   Get a little worried if the original effect size seems larger relative to other similar effect size estimates in the literature.  If you are clueless about expected effect sizes, get educated. (Cluelessness about expected effect sizes strikes me as major indicator of a poor psychological researcher).  Richard et al. (2003; Review of General Psychology) offer a catalogue of effect sizes in social psychology (the expected value might be around d of .40 or a correlation of .20 if I recall correctly).  Other sources are Meyer et al. (2001; American Psychologist) or Wetzels et al. (2011; Perspectives on Psychological Science – thanks to Tim Pleskac for the recommendation). Wetzel summarizes more experimental research in cognitive psychology.

2. In line with the above discussion and the apparent prevalence of questionable research practices/researcher degrees of freedom, expect that the published effect size estimate is positively biased from the true population value. Thus, you should attempt to collect a larger sample size for your replication study.  Do a series of simple power calculations assuming the population effect size is 90%, 75%, 50%, and 25%, and 10% of the published value.  Use those values to decide on the new sample size.  When in doubt, go large.  There is a point in which an effect is too small to care about but this is hard to know and it depends on a number of factors.  Think about the confidence interval around the parameter estimate of interest.  Smaller is better and a larger N is the royal road to smaller confidence intervals.

3. Consider contacting the original authors for their materials and procedures. Hopefully they are agreeable and send you everything.  If not, get nervous but do the best you can to use their exact measures from the published write-up. ***Controversial:  Note in the write-up if they ignored your good faith attempts to obtain their materials. If there was a stated reason for not helping you, inform readers of their reasons.  I think the community needs to know who is willing to facilitate replications and who is not. ***

4. Conduct the study with care.

5. Analyze the data thoroughly. Compute effect size estimates. Compare with the original.  Plan to share your dataset with the original authors so keep good documentation and careful notes.  (Actually you should plan to share your dataset with the entire scientific community, see Wicherts & Bakker [2012, Intelligence]).

6. Write up the results.  Try to strike an even-handed tone if you fail to replicate the published effect size estimate.  Chance is lumpy (Abelson) and no one knows the true population value.  Write as if you will send the paper to the original authors for comments.

7. Try to publish the replication or send it to the Psych File Drawer website (http://www.psychfiledrawer.org/).  The field has got to keep track of these things.

8. Take pride in doing something scientifically important even if other people don’t give a damn.  Replication is a critical scientific activity (Kline, 2004, p. 247) and it is time that replication studies are valued.