Preliminary Thoughts about Guidelines and Recommendations for Exact Replications

Thanks to Chris Fraley and Fred Oswald for earlier comments on these ideas.

After the embarrassing methodological travesties of the last two years (e.g,. Bem’s publication of the ESP study in JPSP; the Big Three Fraudsters – Stapel, Smeesters; Sanna; Bargh’s Psychology Today rants), there is increased interest in replication studies.  This is a great development but there are some nuts and bolts issues that are important for conducting informative replications.  If the true population effect size is small and your replication study has a very small sample size, the replication attempt will not be very informative.

Thus, I started to think about a set of guidelines for designing exact (or near-exact) replication studies that might produce meaningful data.  I let this material sit on my desktop for months but I decided to post it here.

Three big issues have occurred to me

A. What counts as a replication?  A directional hit such that the new result is in the same direction as the original paper and statistically significant at p < .05 (or should it be .01 or .001)?  Or an effect size estimate that is in the ballpark of the original?  Some friends/colleagues of mine think the first outcome counts as a replication but I am not convinced.  Why? A trivial effect size will reach significance at p < .05 with a large enough sample size.  Let’s consider a real-life example. Bargh’s original walking study (experiment 2a) generated a d estimate of around 1.08 (N =30) in the published paper (computed from the reported t of 2.86 with df = 28, the mean difference between the two conditions was .98 seconds).   What is remarkable about Bargh et al. (1996) is probably the size of the effect.  (How many ds> 1.00 do you see in your work?). If I redo his study with 10,000 participants per condition and get a d-metric effect size estimate of .10 (p < .05), did I reproduce his results?  I don’t have the best answer for this question but I would prefer to count a replication as any study that obtains an effect size in the ballpark of the original study (to be arbitrary – say the 95% CIs overlap?).  This perspective leads to the next issue…

B. What kind of effect size estimate should researchers expect when planning the replication study?  I think Ioannidis is a tremendously smart person (e.g., 2008; Epidemiology) so I trust him when he argues that most discovered effect sizes are inflated.  Thus, I think researchers should expect some “shrinkage” in effect size estimates upon replication.  This unpleasant reality has consequences for study design.  Ultimately, I think a replication study should have a sample size that is equal to the original and preferably much larger.  A much smaller sample size than the original is not a good attribute of a replication study.

C. Do you address obvious flaws in the original?  Nearly all studies have flaws and sometimes researchers make inexplicable choices.  Do you try to fix these when conducting the replication study?  Say a group of researchers investigated the correlation between loneliness and taking warm showers/baths (don’t ask) and they decided to use only 10 out of 20 items on a well-established loneliness measure.  What do you do?  Use only their 10 items (if you could figure those out from the published report) or use the whole scale? My view is that you should use the full measure but that might mean that my new study is only a near-exact replication.  Fortunately, I can extract the 10 items from the 20 items so things are fine in this case.  Other examples with different IV/DVs might not be so easy to handle.

In light of those issues, I came up with these quick and dirty recommendations for simple experiments or correlational studies (replication studies when it is easy to identify a population correlation or mean-difference of interest).

1. Read the original study thoroughly and calculate effect size estimates if none are presented.   Get a little worried if the original effect size seems larger relative to other similar effect size estimates in the literature.  If you are clueless about expected effect sizes, get educated. (Cluelessness about expected effect sizes strikes me as major indicator of a poor psychological researcher).  Richard et al. (2003; Review of General Psychology) offer a catalogue of effect sizes in social psychology (the expected value might be around d of .40 or a correlation of .20 if I recall correctly).  Other sources are Meyer et al. (2001; American Psychologist) or Wetzels et al. (2011; Perspectives on Psychological Science – thanks to Tim Pleskac for the recommendation). Wetzel summarizes more experimental research in cognitive psychology.

2. In line with the above discussion and the apparent prevalence of questionable research practices/researcher degrees of freedom, expect that the published effect size estimate is positively biased from the true population value. Thus, you should attempt to collect a larger sample size for your replication study.  Do a series of simple power calculations assuming the population effect size is 90%, 75%, 50%, and 25%, and 10% of the published value.  Use those values to decide on the new sample size.  When in doubt, go large.  There is a point in which an effect is too small to care about but this is hard to know and it depends on a number of factors.  Think about the confidence interval around the parameter estimate of interest.  Smaller is better and a larger N is the royal road to smaller confidence intervals.

3. Consider contacting the original authors for their materials and procedures. Hopefully they are agreeable and send you everything.  If not, get nervous but do the best you can to use their exact measures from the published write-up. ***Controversial:  Note in the write-up if they ignored your good faith attempts to obtain their materials. If there was a stated reason for not helping you, inform readers of their reasons.  I think the community needs to know who is willing to facilitate replications and who is not. ***

4. Conduct the study with care.

5. Analyze the data thoroughly. Compute effect size estimates. Compare with the original.  Plan to share your dataset with the original authors so keep good documentation and careful notes.  (Actually you should plan to share your dataset with the entire scientific community, see Wicherts & Bakker [2012, Intelligence]).

6. Write up the results.  Try to strike an even-handed tone if you fail to replicate the published effect size estimate.  Chance is lumpy (Abelson) and no one knows the true population value.  Write as if you will send the paper to the original authors for comments.

7. Try to publish the replication or send it to the Psych File Drawer website (http://www.psychfiledrawer.org/).  The field has got to keep track of these things.

8. Take pride in doing something scientifically important even if other people don’t give a damn.  Replication is a critical scientific activity (Kline, 2004, p. 247) and it is time that replication studies are valued.

Advertisements

Author: mbdonnellan

Professor Social and Personality Psychology Texas A &M University

8 thoughts on “Preliminary Thoughts about Guidelines and Recommendations for Exact Replications”

  1. I have some unusual perspectives on these issues. (At least some of my Illinois colleagues consider them unusual.) Specifically, I do not think “replication” is something for which we should be striving as an end in and of itself. The replication mindset places too much value on individual studies. I would prefer to approach these issues using the lens of meta-analysis.

    To clarify the distinction, consider the following thought experiment:

    Research Team A conducts a study on priming and walking speed (parameter d). Based on a sample size of 40, they estimate d. They then conduct the same study again (with an n of 40) because they read some valuable things about direct replications.

    Research Team B conducts the same study using a sample size of 100.

    Which team has done the better research? If you value “direct replication,” then the answer is obvious: Research Team A has done the higher quality research. They designed and ran a study, estimated a parameter, and then did an exact replication in which they estimated the parameter again. Research Team B, in contrast, failed to conduct a replication study.

    If you think like a meta-analyst, however, you would give a slight hat tip to Research Team B. Why? Team B’s estimate is based on data from 100 individuals whereas Team A’s research involved 40 + 40 individuals. Team B didn’t conduct a direct replication study in the traditional sense of the term. Nonetheless, their estimate of the parameter of interest will be more precise.

    My preference is for studies (and research literatures) in which parameters are estimated with a high degree of precision. I’d hate to see a study based on 400 people not taken seriously because it hasn’t been replicated, but to see a series of n = 20 studies being celebrated because they conducted replications.

    A meta-analytic perspective suggests some alternative recommendations for designing replication studies. Should a replication study use the same sample size as the original study? The question only makes sense if we place the original study on a special pedestal. From a meta-analytic perspective, the recommendation should be to use the largest sample size possible, given your current resources, your personal or theoretical tolerance for sampling error, and whatever trade-offs are involved. Should you use a shortened version of the original measures when the full version is available? Not if your goal is to estimate d as well as possible.

    There is nothing special about the “original” study other than the fact that it was done first. The original study is potentially valuable because it alerts researchers to a theoretically or intuitively interesting parameter. The fact that the study was “first,” however, doesn’t mean that it was in a unique position to answer the question better than studies that might follow. We shouldn’t discourage improvements to research methods for the sake of sanctifying the original study.

    1. Being the other faculty at the U of I that tends to find some of Dr. Fraley’s positions unusual, let me weigh in…. I don’t find Chris’s points very compelling because a meta-analyst would be interested in all of the data, not just one study because it has a bigger N.

      The point that Chris and I agree on and which is described very nicely in Geoff Cummings new book (http://www.routledge.com/books/details/9780415879682/) is that we should be aspiring to making point estimates with increasing precision. Period. This perspective nullifies any issues about how to conduct the replications since any new findings, especially that from a study that directly replicates the original, will provide valuable new information–even if it is underpowered compared to the original.

      We worry too much about the nature of the replication, in part because we still rely solely on NHST to evaluate our studies. If we did not care about statistical significance, then most of the points raised by Donnellan are moot.

      Finally, replication is easy. Redo the original study using the same materials and methods (as closely as possible), without any concern about the findings. Replication is a method issue, not a statistical significance issue, or even an effect size issue–at least in the case of a specific study.

  2. Interesting post, Brent. Chris, Uli Schimmack has a paper coming out in Psych Methods that makes many similar points to the ones you raise here: “The ironic effect of significant results on the credibility of multiple-study articles.” It’s a good read.

  3. Chris, I wholeheartedly agree with you that one large study is better than many small ones, but that doesn’t mean that replications (especially independent replications) are not still valuable. Well-done, direct replications are one of the only ways to call bull on a published finding. We shouldn’t put the original findings on a pedestal, but as a result of the fact that the finding is published (and must therefore be “true”), researchers tend to do so. So if Bargh or whoever collects a study with n = 30, gets a significant effect (perhaps by chance), publishes it, and self-cites the crap out of it until people accept it as canon, then the only way to get the field to consider discounting the finding is to directly replicate it and show that the effect is trivially small or non-existent. I agree that the way to do this is with as large a sample as possible, ideally much larger than the original sample size in order to give the best estimate of the true effect size. Looking forward to more blog posts, Brent 🙂

  4. One of the things that sometimes gets left out of these discussions is the importance of independent replication. That’s implicit in Brent’s guidelines (since he is talking about how to replicate somebody else’s work), but I think it matters enough to be brought to the forefront.

    To take Chris’s example: suppose that Team A did the initial N=40 study, but Team C did the N=40 replication. Now, which body of evidence is more convincing: the N=40+40 1-2 combo of Teams A and C, or the single N=100 study of Team B?

    There’s no right answer. But all else held equal, the more independent Team C is from Team A (no personnel in common, one PI isn’t the former student of the other, no reason to suspect shared allegiance effects or the like), the more convincing I would find the 1-2 combo.

    When a replication is independent, it becomes more than just a test of abstract statistical properties (sampling error etc.). It’s also a test of how thoroughly the original researchers reported their methods. It’s a test of whether the effect depends on local conditions that are assumed not to matter (like the personnel conducting the study, the historical and cultural milieu in which the lab is doing its research, the color of the paint in the lab room, whatever). And it’s a chance for another research team to obtain and review the stimuli and measures, which often aren’t part of the published paper (and they should be, but that’s a different problem).

    This isn’t a contradiction of the meta-analytic mindset that Chris is proposing, but it is an expansion of it. In a meta-analysis, you can test for heterogeneity of effect sizes; you can code things like geographical location, publication year, etc. and test them as moderators; you can code for allegiance effects and test them as moderators too; etc. But short of a proper meta-analysis, the small-scale analog is that when an effect replicates robustly across conditions that aren’t supposed to matter, you trust it more; and that’s more likely when the replication is independent.

  5. These are cool comments! I agree with Chris about the importance of focusing on the precision of parameter estimates and ultimately about the value of meta-analytic thinking. I love large sample sizes and I hoped that idea came across in step #2. One of my favorite J. Cohen quotes is this one: “I have so heavily emphasized the desirability of working with few variables and large sample sizes that some of my students have spread the rumor that my idea of the perfect study is one with 10,000 cases and no variables. They go too far” (1990, p. 1305).

    Incidentally, I just finished reading Geoff Cumming’s (2012) book Understanding the New Statistics and he stresses estimation very heavily. He develops the idea that researchers should care about the informativeness of a study rather than power in the traditional sense (consistent with his anti-NHST perspective). The idea is that an informative study provides useful information about the world and I think it mostly boils down to the precision of the parameter estimate for him. Increasing the sample size and using “good” measures are the things researchers can do to improve the “information yield” of a study.

    (I might post a quick review of the Cumming book in the future but I think this is a great textbook for graduate students. Although I knew most of the material, the presentation was clear and engaging.)

    I also agree with Sanjay and Katie about the importance of independent replication by outsiders. This approach makes it less likely that the experimental effect depends on some weird package of unreported manipulations (e.g., the effect is obtained with attractive opposite-sex lab assistants dressed in lab coats while carrying clipboards). I don’t find effects that can ONLY be obtained by one lab or “crew” to be very interesting. This is one of my concerns with the Bargh elderly prime studies. Why does Pashler apparently struggle to replicate the effect using what strikes me to be a more precise method of measuring walking speed (a digital timing system)? Is there some additional ingredient necessary to get the effect to work? What is it and what does that mean for the underlying theory? What doesn’t Bargh cite the successful independent replications in his blog posts?

    As to Katie’s point, I worry a lot about those “effects” that have only been demonstrated once (or only by one lab) and then take on a life of their own in the literature. I love the David Lykken line: “As Mark Twin put it…it is not so much what we don’t know that hurts us, as those things we do know that aren’t so” (Lykken, 1991, p. 8). I think there are a number of findings in the literature that are exaggerations and probably fall under this quotation. For example, the effect size estimates Pashler reported in a table in this paper strike me as implausible. (Thanks to Chris for sending me the link).

    http://laplab.ucsd.edu/articles/Pashler_etal_2012.pdf

    One thing I love about the Greg Francis papers are his tables which show the effect size estimates and sample sizes across a published article. It can be striking to see both the small sample sizes and then see the sample sizes flop around. Few individual studies seem to have enough power (power = .80) to detect a moderate effect size of say d = .40 (roughly 51 per group). So what does that say about the field? The precision of the estimates cannot be very good.

    All of these ideas are exciting because they seem to be pathways toward a cumulative “soft” psychology. That’s hopefully the point.

  6. This is a great discussion and I’ve learned a lot from reading everyone’s perspective on these issues. (With the exception of Brent R’s, of course!) Thanks for breaking the ice, Brent D.

    Jennifer: Uli sent me the paper in question. It is fantastic! Thanks for the recommendation.

    Here is a nice quote from the paper:

    “The IC-index [an index that is related to the index that G. Francis uses] can be helpful in putting pressure on editors and journals to curb the proliferation of false-positive results because it can be used to evaluate editors and journals in terms of the credibility of the results that are published in these journals. As everybody knows, the value of a brand rests on trust, and it is easy to destroy this value when consumers lose that trust.”

    Ever since the publication of the Bem paper and the aftermath that followed, one of the things that has concerned me is that it will be impossible to change the standards in our field without creating change or modifying incentives at the highest editorial/journal levels. Indeed, I’ve felt a bit hopeless at times given how resistant the top journals have been to acknowledge the basic problems and to propose constructive solutions.

    Uli is essentially proposing a method that would allow this change to happen in a grass roots fashion. If outsiders started evaluating the quality of journals and their publications with respect to certain values (e.g., replicability of findings, quality of methods, precision of estimates, transparency of methods materials and data), the journals/editors would either have to adapt to improve their reputation or explicitly take the stance on not valuing these qualities. What we might need, in other words, is a formal “consumer reports” for our leading journals. (I bet that, across journals, it would correlate negatively with a measure of the media attention given to published articles. At least in the short run. But, in the long run, I bet it would nicely predict variation in impact because scientists would want to build their work on designs and findings that they can trust.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s