Things that make me skeptical…

Simine Vazire crafted a thought provoking blog post about how some in the field respond to counter-intuitive findings.  One common reaction among critics of this kind of research is to claim that the results are unbelievable.   This reaction seems to fit with the maxim that extraordinary claims should require extraordinary evidence (AKA the Sagan doctrine).  For example, the standard of evidence needed to support the claim that a high-calorie/low nutrient diet coupled with a sedentary life style is negatively associated with morbidity might be different than the standard of proof needed to support the claim that attending class is positively associated with exam performance.  One claim seems far more extraordinary than the other.  Put another way: Prior subjective beliefs about the truthiness of these claims might differ and thus the research evidence needed to modify these pre-existing beliefs should be different.

I like the Sagan doctrine but I think we can all appreciate the difficulties that arise when trying to determine standards of evidence needed to justify a particular research claim.  There are no easy answers except for the tried and true response that all scientific claims should be thoroughly evaluated by multiple teams using strong methods and multiple operational definitions of the underlying constructs.  But this is a “long term” perspective and provides little guidance when trying to interpret any single study or package of studies.  Except that it does, sort of.  A long term perspective means that most findings should be viewed with a big grain of salt, at least initially.  Skepticism is a virtue (and I think this is one of the overarching themes of Simine’s blog posts thus far).   However, skepticism does not preclude publication and even some initial excitement about an idea.  It simply precludes making bold and definitive statements based on initial results with unknown generality.  More research is needed because of the inherent uncertainty of scientific claims. To quote a lesser known U2 lyric – “Uncertainty can be a guiding light”.

Anyways, I will admit to having the “unbelievable” reaction to a number of research studies.  However, my reaction usually springs from a different set of concerns rather than just a suspicion that a particular claim is counter to my own intuitions.  I am fairly skeptical of my own intuitions. I am also fairly skeptical of the intuitions of others.  And I still find lots of studies literally unbelievable.

Here is a partial list of the reasons for my skepticism. (Note: These points cover well worn ground so feel free to ignore if it sounds like I am beating a dead horse!)

1.  Large effect sizes coupled with small sample sizes.  Believe it or not, there is guidance in the literature to help generate an expected value for research findings in “soft” psychology.  A reasonable number of effects are between .20 and .30 in the r metric and relatively few are above .50 (see Hemphill, 2003; Richard et al., 2003).   Accordingly, when I read studies that generate “largish” effect size estimates (i.e., r ≥ |.40|), I tend to be skeptical.  I think an effect size estimate of .50 is in fact an extraordinary claim.

My skepticism gets compounded when the sample sizes are small and thus the confidence intervals are wide.  This means that the published findings are consistent with a wide range of plausible effect sizes so that any inference about the underlying effect size is not terribly constrained.  The point estimates are not precise. Authors might be excited about the .50 correlation but the 95% CI suggests that the data are actually consistent with anything from a tiny effect to a massive effect.  Frankly, I also hate it when the lower bound of the CI falls just slightly above 0 and thus the p value is just slightly below .05.  It makes me suspect p-hacking was involved.   (Sorry, I said it!)

2. Conceptual replications but no direct replications.  The multi-study package common to such prestigious outlets like PS or JPSP has drawn critical attention in the last 3 or so years.  Although these packages seem persuasive on the surface, they often show hints of publication bias on closer inspection.   The worry is that the original researchers actually conducted a number of related studies and only those that worked were published.   Thus, the published package reflects a biased sampling of the entire body of studies.  The ones that failed to support the general idea were left to languish in the proverbial file drawer.  This generates inflated effect size estimates and makes the case for an effect seem far more compelling than it should be in light of all of the evidence.  Given these issues, I tend to want to see a package of studies that reports both direct and conceptual replications.  If I see only conceptual replications, I get skeptical.  This is compounded when each study itself has a modest sample size with a relatively large effect size estimate that produces a 95% CI that gets quite close to 0 (see Point #1).

3. Breathless press releases.  Members of some of my least favorite crews in psychology seem to create press releases for every paper they publish.  (Of course, my perceptions could be biased!).  At any rate, press releases are designed by the university PR office to get media attention.  The PR office is filled with smart people trained to draw positive attention to the university using the popular media.  I do not have a problem with this objective per se.  However, I do not think this should be the primary mission of the social scientist.  Sometimes good science is only interesting to the scientific community.  I get skeptical when the press release makes the paper seem like it was the most groundbreaking research in all of psychology.  I also get skeptical when the press release draws strong real world implications from fairly constrained lab studies.  It makes me think the researchers overlooked the thorny issues with generalized causal inference.

I worry about saying this but I will put it out there – I suspect that some press releases were envisioned before the research was even conducted.  This is probably an unfair reaction to many press releases but at least I am being honest.  So I get skeptical when there is a big disconnect between the press release and the underlying research like when sweeping claims are made on a study of say 37 kids.  Or big claims about money and happiness are drawn from priming studies involving pictures of money.

I would be interested to hear what makes others skeptical of published claims.

 

A little background tangential to the main points of this post:

One way to generate press excitement is to quote the researcher(s) as being shocked by the results.  Unfortunately, I often think some of shock and awe expressed in these press releases is disingenuous.  Why?  Researchers designed the studies to test specific predictions in the first place.  So they had some expectations as to what they would find.  Alternatively, if someone did obtain a shocking initial result, they should conduct multiple direct replications to make sure the original result was not simply a false positive.  This kind of narrative is not usually part of the press release.

I also hate to read press releases that generalize the underlying results well beyond the initial design and purpose of the research.  Sometimes the real world implications of experiments are just not clear.  In fact, not all research is designed to have real world implications.  If we take the classic Mook reading at face value, lots of experimental research in psychology has no clear real world implications.   This is perfectly OK but it might make the findings less interesting to the general public.  Or at least it probably requires more background knowledge to make the implications interesting.  Such background is beyond the scope of the press release.

 

More Null Results in Psychological Science — Comments on McDonald et al. (2014) and Crisp and Birtel (2014)

Full Disclosure:  I am second author on the McDonald et al. (2014) commentary.

Some of you may have seen that Psychological Science published our commentary on the Birtel and Crisp (2012) paper.  Essentially we tried to replicate two of their studies with larger sample sizes (29 versus 240 and 32 versus 175, respectively) and obtained much lower effect size estimates. It is exciting that Psychological Science published our work and I think this is a hint of positive changes for the field.  Hopefully nothing I write in this post undercuts that overarching message.

I read the Crisp and Birtel response and I had a set of responses (shocking, I know!). I think it is fair that they get the last word in print but I had some reactions that I wanted to share.  Thus, I will outlet a few in this blog post. Before diving into issues, I want to reiterate the basic take home message of McDonald et al. (2014):

“Failures to replicate add important information to the literature and should be a normal part of the scientific enterprise. The current study suggests that more work is needed before Birtel and Crisp’s procedures are widely implemented. Interventions for treating prejudice may require more precise manipulations along with rigorous evaluation using large sample sizes.” (p. xx)

1.  Can we get a mulligan on our title? We might want to revise the title of our commentary to make it clear that our efforts applied to only two specific findings in the original Birtel and Crisp (2012) paper. I think we were fairly circumscribed in the text itself but the title might have opened the door for how Crisp and Birtel (2014) responded.  They basically thanked us for our efforts and pointed out that our two difficulties say nothing about the entire imagined contact hypothesis.  They even argued that we “overgeneralized” our findings to the entire imagined contact literature.  To be frank, I do not think they were being charitable to our piece with this criticism because we did not make this claim in the text.  But titles are important and our title might have suggested some sort of overgeneralization.  I will let readers make their own judgments.  Regardless, I wish we had made the title more focused.

2.  If you really believe the d is somewhere around .35, why were the sample sizes so small in the first place?  A major substantive point in the Crisp and Birtel (2014) response is that the overall d for the imagined contact literature is somewhere around .35 based on a recent Miles and Crisp (2014) meta-analysis.  That is a reasonable point but I think it actually undercuts the Birtel and Crisp (2012) paper and makes our take home point for us (i.e., the importance of using larger sample sizes in this literature).  None of the original Birtel and Crisp (2012) studies had anywhere near the power to detect a population d of .35.  If we take the simple two-group independent t-test design, the power requirements for .80 suggest the need for about 260 participants (130 in each group).   The largest sample size in Birtel and Crisp (2012) was 32.

3. What about the ManyLabs paper?  The now famous ManyLabs paper of Klein et al. (in press) reports a replication attempt of an imagined contact study (Study 1 in Husnu & Crisp, 2010).  The ManyLabs effort yielded a much lower effect size estimate (d = .13, N = 6,336) than the original report (d = .86 or .84 as reported in Miles & Crisp, 2014; N = 33).  This is quite similar to the pattern we found in our work.  Thus, I think there is something of a decline effect in operation.  There is a big difference in interpretation between a d of .80 and a d around .15.  This should be worrisome to the field especially when researchers begin to think of the applied implications of this kind of work.

4. What about the Miles and Crisp Meta-Analysis (2014)? I took a serious look at the Miles and Crisp meta-analysis and I basically came away with the sinking feeling that much more research needs to be done to establish the magnitude of the imagined contact effects.  Many of the studies used in the meta-analysis were grossly underpowered.  There were 71 studies and only 2 had sample sizes above 260 (the threshold for having a good chance to detect a d = .35 effect using the standard between-participants design).  Those two large studies yielded basically null effects for the imagined contact hypothesis (d = .02 and .05, ns = 508 and 488, respectively). The average sample size of the studies in the meta-analysis was 81 (81.27 to be precise) and the median was 61 (Min. = 23 and Max. = 508).  A sample size of 123 was in the 90th percentile (i.e., 90% of the samples were below 123) and nearly 80% of the studies had sample sizes below 100.

Miles and Crisp (2014) were worried about sample size but perhaps not in the ways that I might have liked.   Here is what they wrote: “However, we observed that two studies had a sample size over 6 times the average (Chen & Mackie, 2013; Lai et al., 2013). To ensure that these studies did not contribute disproportionately to the summary effect size, we capped their sample size at 180 (the size of the next largest study) when computing the standard error variable used to weight each effect size.” (p. 13).  Others can weigh in about this strategy but I tend to want to let the sample sizes “speak for themselves” in the analyses, especially when using a random-effects meta-analysis model.

 What’s it all mean?

Not to bring out the cliché but I think much more work needs to be done here.  As it stands, I think the d = .35 imagined contact effect size estimate is probably upwardly biased.  Indeed, Miles and Crisp (2014) found evidence of publication bias such that unpublished studies yielded a smaller overall effect size estimate than published studies (but the unpublished studies still produce an estimate that is reliably larger than zero).  However this shakes out, researchers are well advised to use much larger sample sizes than tends to characterize this literature based on my summary of the sample sizes in Miles and Crisp (2014).  I also think more work needs to be done to evaluate the specific Birtel and Crisp (2012) effects.  We now have collected two more unpublished studies with even bigger sample sizes and we have yet to get effect sizes that approximate the original report.

I want to close by trying to clarify my position.  I am not saying that the effect sizes in question are zero or that this is an unimportant research area.  On the contrary, I think this is an incredibly important topic and thus it requires even greater attention to statistical power and precision.

 

Updated 26 Feb 2014: I corrected the sample size from study 1 from 204 to 240.

Warm Water and Loneliness

Our paper on bathing/showering habits and loneliness has been accepted (Donnellan, Lucas, & Cesario, in press).  The current package has 9 studies evaluating the correlation between trait loneliness and a preference for warm showers and baths as inspired by Studies 1a and 1b in Bargh and Shalev (2012; hereafter B & S).  In the end, we collected data from over 3,000 people and got effect size estimates that were considerably smaller than the original report.  Below are some random reflections on the results and the process. As I understand the next steps, B & S will have an opportunity to respond to our package (if they want) and then we have the option of writing a brief rejoinder.

1. I blogged about our inability to talk about original B & S data in the Fall of 2012.  I think this has been one of my most viewed blog entries (pathetic, I know).  My crew can apparently talk about these issues now so I will briefly outline a big concern.

Essentially, I thought the data from their Study 1a were strange. We learned that 46 of the 51 participants (90%) reported taking less than one shower or bath per week.  I can see that college students might report taking less than 1 bath per week, but showers?  The modal response in each of our 9 studies drawn from college students, internet panelists, and mTurk workers was always “once a day” and we never observed more than 1% of any sample telling us that they take less than one shower/bath per week.  So I think this distribution in the original Study 1a has to be considered unusual on both intuitive and empirical grounds.

The water temperature variable was also odd given that 24 out of 51 participants selected “cold” (47%) and 18 selected “lukewarm” (35%).   My own intuition is that people like warm to hot water when bathing/showering.  The modal response in each of our 9 samples was “very warm” and it was extremely rare to ever observe a “cold” response.

My view is that the data from Study 1a should be discarded from the literature. The distributions from 1a are just too weird.  This would then leave the field with Study 1b from the original B & S package based on 41 community members versus our 9 samples with over 3,000 people.

2.  My best meta-analytic estimate is that the correlation between trait loneliness and the water temperature variable is .026 (95% CI: -.018 to .069, p = .245).  This is based on a random effects model using the 11 studies in the local literature (i.e., our 9 studies plus Studies 1a and 1b – I included 1a to avoid controversy).  Researchers can debate about the magnitude of correlations but this one seems trivial to me especially because we are talking about two self-reported variables. We are not talking about aspirin and a life or death outcome or the impact of a subtle intervention designed to boost GPA.  Small effects can be important but sometimes very small correlations are practically and theoretically meaningless.

3. None of the original B and S studies had adequate power to detect something like the average .21 correlational effect size found across many social psychological studies (see Richard et al., 2003).  Researchers need around 175 participants with power set to .80 for the r = .21 expectation. If one takes sample size as an implicit statement about researcher expectations about the underlying effect sizes, it would seem like the original researchers thought the effects they were evaluating were fairly substantial.  Our work suggests that the effects in question are probably not.

In the end, I am glad this paper is going to see the light of day.  I am not sure all the effort was worth it but I hope our paper makes people think twice about the size of the connection between loneliness and warm showers/baths.

25 Jan 2014:  Corrected some typos.

Go Big or Go Home – A Recent Replication Attempt

I decided to adopt my 5% suggestion to dedicate a relatively small percentage of one’s research time to replication efforts.  In other words, I spent time this summer and fall working on several replication studies.  Truth be told, these efforts amounted to more than 5% of my time but these efforts have been fruitful in terms of papers. Moreover, these replication attempts were collaborative efforts with graduate students (and some undergraduates).  My major role was often as data analyst.  Indeed, I like to independently analyze the data to make sure that I come up with the same results. I also find data analysis inherently enjoyable and far more exciting than writing papers, editing papers, or dealing with committees. So basically I got to have fun and perhaps make a few modest contributions to the literature.

One replication effort concerned whether we could duplicate the results of 2008 Psychological Science paper about the impact of cleanliness on moral judgments. The original finding was that participants primed with cleanliness were less harsh in their moral judgments than control participants. The first two authors on this paper are outstanding graduate students in my department (Johnson, Cheung, & Donnellan, in press). For those not interested in gory details: We obtained much smaller estimates than the original paper.

Basic design:

Prime/induce cleanliness in one group and use a second group as a control condition. Compare ratings on a series of six moral vignettes that are aggregated into a composite rating.  In sum, the IV is cleanliness/no cleanliness and the DV is ratings of wrongness on moral vignettes. The scale for the moral ratings is such that higher scores reflect harsher judgments. Negative effect size estimates indicate that those in the clean group were less harsh (on average) than those in the control group. Materials were obtained from the original researchers and we tried to follow procedures as close as possible. All participants were college students.

Study 1:

Cleanliness was primed with a scrambled sentence task in the original study (N = 40).  The cleanliness priming effect was close to the magic p < .05 threshold for the composite DV (d = -.60, 95% CI = -1.23 to .04).  It was hard to see that the overall effect was p = .064 in the published paper because the authors reported the dreadful p-rep statistic but we computed the standard p-value for our report. Those in the cleanliness prime condition rated the vignettes as less wrong than the control group at p < .065.  We used the same materials with 208 participants and got an effect size estimate very close to zero (d = -.01, 95% CI = -.28 to .26).  One caveat is that there is an additional study on the PsychFile drawer website that found an effect consistent with the original using a one-tailed significance test (estimated d = -.47, N = 60).

Study 2:

Cleanliness was induced with hand washing in the original study (N = 43). Participants were shown a disgusting video clip and then assigned to condition.  Participants completed ratings of the same vignettes used in Study 1 but with a slightly different response scale.  The effect on the overall composite passed the p < .05 threshold in the original publication (d = -.85, 95% CI = -1.47 to -.22) but not in our replication study with 126 participants (d = .01, 95% CI = -34 to .36).

We conducted both studies in person like the original authors and the package is now in press for the special issue of Social Psychology about replication studies. This means that everything was preregistered and all of the data, materials, and proposal are posted here: http://osf.io./project/zwrxc/.

Study 1 Redux (Unpublished):

After finishing the package, we decided to try to conduct a much larger replication attempt of Study 1 but using the internet to facilitate data collection. I like narrow confidence intervals so I wanted to try for a very large sample size.  A larger sample size would also facilitate tests of moderation. Using the internet would make data collection easier but it might impair data quality.  (I have ideas about dealing with that issue but those will appear in another paper, hopefully!). We selected out anyone who said they made up their responses or did not answer honestly. Our sample size for this replication attempt was 731 college students.  As with our in press study, the effect of the sentence unscrambling task on the composite DV was not statistically detectable (t = 0.566, df = 729, p = .578). The d was quite small (d = .04, 95% = -.10 to .19) and the 95% CI fell below the “magic” |.20| threshold for a so-called small effect.   The most interesting finding (to me) was that the Honesty-Humility scale of the HEXACO (Lee & Ashton, 2004) was the best predictor of the moral composite ratings across conditions (r = .356, 95% CI = .291 to .418, p < .05, N = 730) out of the four individual difference measures we included at the end of the study (a disgust scale, a bodily self-consciousness scale, and a single-item liberalism-conservatism scale).  No individual difference we included moderated the null effect of condition (same for gender).   So we tried to find moderators, honestly.

Summary:
We have struck out twice now to find the original Study 1 effect.  I did a quick and dirty random effects meta-analysis using all four attempts to duplicate Study 1 (the original, the one on the PsychFile drawer, and our two studies).  The point estimate for the d was -.130 (95% CI = -.377 to .117, p = .303) and the estimate was even closer to zero using a common effect (or fixed effect) model (d = -.022, 95% CI = -.144 to .099, p = .718).  I will provide an excerpt from the in press paper as I think the message is useful to ward off the critics who will blast our research skills and accuse me of some sort of motivated bias…

The current studies suggest that the effect sizes surrounding the impact of cleanliness on moral judgments are probably smaller than the estimates provided by the original studies…It is critical that our work is not considered the last word on the original results…. More broadly, we hope that researchers will continue to evaluate the emotional factors that contribute to moral judgments.

 

So more research is probably needed to better understand this effect [Don’t you just love Mom and Apple Pie statements!].  However, others can dedicate their time and resources to this effect.  We gave it our best shot and pretty much encountered an epic fail as my 10 year old would say. My free piece of advice is that others should use very large samples and plan for small effect sizes.

Note: Thanks to David Johnson and Felix Cheung.  I deserve any criticism and they deserve any credit.

Update (2 Jan 2014).  David placed the write-up on the Psych File Drawer website. Plus the data are available as well.

David Johnson, Felix Cheung, Brent Donnellan. Cleanliness primes do not influence moral judgment. (2014, January 01). Retrieved 10:05, January 02, 2014 from http://www.PsychFileDrawer.org/replication.php?attempt=MTcy

The Life Goals of Kids These Days Part II

This is a follow-up to my January 16 blog post with some new data!  Some of my former students and now colleagues have launched a longitudinal study of college students. In the Fall of 2013 we gave a large sample of first year students the Monitoring the Future goal items.  I thought it would be fun to see what these data looked like and how these goals were correlated with certain measures of personality.  These data are from a school in the Southwest and are drawn from all incoming first-year students.

Students were asked about 14 goals and could answer on a 1 to 4 point scale (1=”Not Important” whereas 4=”Extremely Important”).  Descriptive data for the 14 goals in order of the average level of endorsement are reported below.  I also included the ranking for Millennials as reported in Arnett, Trzesniewski, and Donnellan (2013) and described in my older post.

Table 1: Goals for First Year Students (Unnamed School in the Southwest) using the Monitoring the Future Goal Items

Goal

Rank in MTF for Millennials

M

SD

% Reporting Extremely Important

Having a good marriage and family life

1

3.54

.80

69.7

Being successful in my line of work

5

3.54

.64

61.3

Having strong friendships

3

3.52

.68

61.6

Being able to find steady work

2

3.51

.65

58.3

Finding a purpose and meaning in my life

6

3.35

.84

55.0

Being able to give my children better opportunities than I’ve had

4

3.32

.87

53.8

Having plenty of time for recreation and hobbies

7

3.11

.81

36.7

Making a contribution to society

9

3.11

.87

39.4

Discovering new ways to experience things

10

2.89

.91

28.3

Having lots of money

8

2.67

.91

21.3

Living close to parents and relatives

11

2.50

1.03

21.2

Working to correct social and economic inequalities

13

2.41

.99

17.3

Being a leader in my community

12

2.35

1.01

17.0

Getting away from this area of the country

14

1.83

1.01

10.1

Note: N = 1,245 to 1,254

As before, marriage and friendships was seemingly highly valued as was being successful and finding steady work. So these first year college students want it all – success in love and work.  Damn these kids — who do they think they are?

I was then able to correlate the goal responses with measures of self-esteem, narcissism, and the Big Five. Below is a table showing the relevant correlations.

Table 2: Correlations between Goal Items and Measures of Self-Esteem, Narcissism, Extraversion, and Agreeableness

Goal

Self-Esteem

NPI Total

NPI-EE

PDQ-NPD

Extraversion

Agreeableness

Having a good marriage and family life

.17

.05

-.09

-.07

.17

.29

Being successful in my line of work

.18

.18

-.01

.04

.19

.19

Having strong friendships

.16

.08

-.08

-.05

.26

.25

Being able to find steady work

.15

.09

-.03

-.02

.14

.20

Finding a purpose and meaning in my life

.04

.10

-.03

.00

.17

.15

Being able to give my children better opportunities than I’ve had

.11

.11

-.06

.03

.20

.25

Having plenty of time for recreation and hobbies

.07

.18

.08

.09

.15

.07

Making a contribution to society

.14

.18

-.03

.02

.25

.20

Discovering new ways to experience things

.15

.26

.05

.11

.27

.12

Having lots of money

.08

.34

.26

.21

.18

.03

Living close to parents and relatives

.12

.11

.01

.04

.16

.24

Working to correct social and economic inequalities

.08

.19

.03

.05

.19

.14

Being a leader in my community

.13

.36

.12

.16

.35

.18

Getting away from this area of the country

-.09

.19

.18

.18

.04

-.13

Note: Correlations ≥ |.06| are statistically significant at p < .05.  Correlations  ≥ |.20| are bolded. Self-Esteem was measured with the Rosenberg (1989) scale. The NPI (Raskin & Terry, 1988) was used so we that could compute the NPI-EE (Entitlement/Exploitativeness) subscale (see Ackerman et al., 2011) and even the total score (yuck!). The PDQ-NPD column is the Narcissistic Personality Disorder subscale of the Personality Diagnostic Questionnaire-4 (Hyler, 1994).  Extraversion and Agreeableness were measured using the Big Five Inventory (John et al., 1991).

What do I make of these results?  On the face of it, I do not see a major cause for alarm or worry.  These college students seem to want it all and it will be fascinating to track the development of these goals over the course of their college careers.  I also think Table 2 provides some reason to caution against using goal change studies as evidence of increases in narcissism but I am probably biased.  However, I do not think there is compelling evidence that the most strongly endorsed goals are strongly positively related to measures of narcissism.  This is especially true when considering the NPI-EE and PDQ correlations.

Thanks to Drs. Robert Ackerman, Katherine Corker, and Edward Witt.

I don’t care about effect sizes — I only care about the direction of the results when I conduct my experiments

This claim (or some variant) has been invoked by a few researchers when they take a position on issues of replication and the general purpose of research.  For example, I have heard this platitude from some quarters when they were explaining why they are unconcerned when an original finding with a d of 1.2 reduces to a d of .12 upon exact replications. Someone recently asked me for advice on how to respond to someone making the above claim and I struggled a bit.  My first response was to dig up these two quotes and call it a day.

Cohen (1994): “Next, I have learned and taught that the primary product of research inquiry is one or more measures of effect size, not p values.” (p. 1310).

Abelson (1995): “However, as social scientists move gradually away from reliance on single studies and obsession with null hypothesis testing, effect size measures will become more and more popular” (p. 47).

But I decided to try a bit harder so here are my random thoughts at trying to respond to the above claim.

1.  Assume this person is making a claim about the utility of NHST. 

One retort is to ask how the researcher judges the outcome of their experiments.  They need a method to distinguish the “chance” directional hit from the “real” directional hit.  Often the preferred tool is NHST such that the researcher will judge that their experiment produced evidence consistent with their theory (or it failed to refute their theory) if the direction of the difference/association was consistent with their prediction and the p value was statistically significant at some level (say an alpha of .05).  Unfortunately, the beloved p-value is determined, in part, by the effect size.

To quote from Rosenthal and Rosnow (2008, p. 55):

Because a complete account of “the results of a study” requires that the researcher report not just the p value but also the effect size, it is important to understand the relationship between these two quantities.  The general relationship…is…Significance test = Size of effect * Size of study.

So if you care about the p value, you should care (at least somewhat) about the effect size.  Why? The researcher gets to pick the size of the study so the critical unknown variable is the effect size.  It is well known that given a large enough N, any trivial difference or non-zero correlation will attain significance (see Cohen, 1994, p. 1000 under the heading “The Nil Hypothesis”). Cohen notes that this point was understood as far back as 1938.  Social psychologists can look to Abelson (1995) for a discussion of this point as well (see p. 40).

To further understand the inherent limitations of this NHST-bound approach, we can (and should) quote from the book of Paul Meehl (Chapter 1978).

Putting it crudely, if you have enough cases and your measures are not totally unreliable, the null hypothesis will always be falsified, regardless of the truth of the substantive theory. Of course, it could be falsified in the wrong direction, which means that as the power improves, the probability of a corroborative results approaches one-half. However, if the theory has no verisimilitude – such that we can imagine, so to speak, picking our empirical results randomly out of a directional hat apart from any theory – the probability of a refuting by getting a significant difference in the wrong direction also approaches one-half.  Obviously, this is quite unlike the situation desired from either a Bayesian, a Popperian, or a commonsense scientific standpoint.”  (Meehl, 1978, p. 822).

Meehl gets even more pointed (p. 823):

I am not a statistician, and I am not making a statistical complaint. I am making a philosophical complaint or, if you prefer, a complaint in the domain of scientific method. I suggest that when a reviewer tries to “make theoretical sense” out of such a table of favorable and adverse significance test results, what the reviewer is actually engaged in, willy-nilly or unwittingly, is meaningless substantive constructions on the properties of the statistical power function, and almost nothing else.

Thus, I am not sure that this appeal to directionality with the binary outcome from NHST (i.e., a statistically significant versus not statistically significant result according to some arbitrary alpha criterion) helps make the above argument persuasive.  Ultimately, I believe researchers should think about how strongly the results of a study corroborate a particular theoretical idea.  I think effect sizes are more useful for this purpose than the p-value.  You have to use something – why not use the most direct indicator of magnitude?

A somewhat more informed researcher might tell us to go read Wainer (1999) as a way to defend the virtues of NHST.  This paper is called “One Cheer for Null Hypothesis Significance Testing” and appeared in Psychological Methods in 1999.  Wainer suggests 6 cases in which a binary decision would be valuable.  His example from psychology is testing the hypothesis that the mean human intelligence score at time t is different from the mean score at time t+1.

However, Wainer also seems to find merit in effect sizes.  He writes this as well “Once again, it would be more valuable to estimate the direction and rate of change, but just being able to state that intelligence is changing would be an important contribution (p. 213). He also concludes that “Scientific investigations only rarely must end with a simple reject-not reject decision, although they often include such decisions as part of their beginnings” (p. 213).  So in the end, I am not sure that any appeal to NHST over effect size estimation and interpretation works very well.  Relying exclusively on NHST seems way worse than relying on effect sizes.

2.  Assume this person is making a claim about the limited value of generalizing results from a controlled lab study to the real world.

One advantage of the lab is the ability to generate a strong experimental manipulation.  The downside is that any effect size estimate from such a study may not represent typical world dynamics and thus risks misleading uninformed (or unthinking) readers.  For example, if we wanted to test the idea that drinking regular soda makes rats fat, we could give half of our rats the equivalent of 20 cans of coke a day whereas the other half could get 20 cans of diet coke per day.  Let’s say we did this experiment and the difference was statistically significant (p < .0001) and we get a d = 2.0.  The coke exposed rats were heavier than the diet coke exposed rats.

What would the effect size mean?  Drawing attention to what seems like a huge effect might be misleading because most rats do not drink 20 cans of coke a day.  The effect size would presumably fluctuate with a weaker or stronger manipulation.  We might get ridiculed by the soda lobby if we did not exercise caution in portraying the finding to the media.

This scenario raises an important point about the interpretation of the effect sizes but I am not sure it negates the need to calculate and consider effect sizes.  The effect size from any study should be viewed as an estimate of a population value and thus one should think carefully about defining the population value.  Furthermore, the rat obesity expert presumably knows about other effect sizes in the literature and can therefore place this new result in context for readers.  What effect sizes do we see when we compare sedentary rats to those who run 2 miles per day?  What effect sizes do we see when we compare genetically modified “fat” rats to “skinny” rats?  That kind of information helps the researcher interpret both the theoretical and practical importance of the coke findings.

What Else?

There are probably other ways of being more charitable to the focal argument. Unfortunately, I need to work on some other things and think harder about this issue. I am interested to see if this post generates comments.  However, I should say that I am skeptical that there is much to admire about this perspective on research.  I have yet to read a study where I wished the authors omitted the effect size estimate.

Effect sizes matter for at least two other reasons beyond interpreting results.  First, we need to think about effect sizes when we plan our studies.  Otherwise, we are just being stupid and wasteful.  Indeed, it is wasteful and even potentially unethical to expend resources conducting underpowered studies (see Rosenthal, 1994).  Second, we need to evaluate effect sizes when reviewing the literature and conducting meta-analyses.  We synthesize effect sizes, not p values.  Thus, effect sizes matter for planning studies, interpreting studies, and making sense of an overall literature.

[Snarky aside, skip if you are sensitive]

I will close with a snarky observation that I hope does not detract from my post. Some of the people making the above argument about effect sizes get testy about the low power of failed replication studies of their own findings.   I could fail to replicate hundreds (or more) important effects in the literature by running a bunch of 20 person studies. This should surprise no one. However, a concern about power only makes sense in the context of an underlying population effect size.  I just don’t see how you can complain about the power of failed replications and dismiss effect sizes.

Post Script (6 August 2013):

Daniel Simons has written several good pieces on this topic.  These influenced my thinking and I should have linked to them.  Here they are:

http://blog.dansimons.com/2013/03/what-effect-size-would-you-expect.html

http://blog.dansimons.com/2013/03/a-further-thought-experiment-on.html

Likewise, David Funder talked about similar issues (see also the comments):

http://funderstorms.wordpress.com/2013/02/01/does-effect-size-matter/

http://funderstorms.wordpress.com/2013/02/09/how-high-is-the-sky-well-higher-than-the-ground/

And of course, Lee Jussim (via Brent Roberts)…

http://pigee.wordpress.com/2013/02/23/when-effect-sizes-matter-the-internal-incoherence-of-much-of-social-psychology/

Six Principles and Six Summer Readings

I helped contribute a short piece for a divisional newsletter about methodological reform issues.  I did this with three much smarter colleagues (Lucas, Fraley, and Roisman) and this is something I highly recommend.  However, I take all responsibility for the ideas in this post.

Anyways, this turned out to be an interesting chance to write about methodological reform issues and provide a “Summer Reading” list of current pieces. We tried to take a “friendly” approach by laying out the issues so individual researchers could read more and make  informed decisions on their own.  Here is a link to the ReformPrimer in draft form.  I posted the greatest hits below.

Six Principles and Practices to Consider Adopting in Your Own Work

1. Commit to Total Scientific Honesty.  See Lykken (1991) and Feynman (1985).

2. Be Aware of the Impact of Researcher Degrees of Freedom.  See Simmons et al. (2011).

3. Focus on Effect Size Estimation Rather than Statistical Significance.  See Cumming (2012) or Fraley and Marks (2007) or Kline (2013).

4. Understand that Sample Size and Statistical Power Matter.  See Cohen (1962) and Ioannidis (2005) and well a whole bunch of stuff like Francis (2013) and Schimmack (2012).

5. Review Papers for Methodological Completeness, Accuracy, and Plausibility.  See, for example, Kashy et al. (2009).  Sometimes effect sizes can just be too large, you know.  Standard errors are not the same thing as standard deviations…

6. Focus on Reproducibility and Replication. See, for example, Asendorpf et al. (2013).

Six Recent Readings

These are the perfect readings to take to the beach or local pool.  Who needs a thriller loosely tied to Dante’s Inferno or one of the books in the Song of Ice and Fire?

Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denisseen, J. J. A., Fielder, K. et al. (2013). Recommendations for increasing replicability in psychology. European Journal of Personality, 27, 108-119. DOI: 10.1002/per.1919  [Blame me for the horrible title of the Lucas and Donnellan comment.]

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth-telling. Psychological Science, 23, 524–532. DOI: 10.1177/0956797611430953

 LeBel, E. & Peters, K. R. (2011). Fearing the future of empirical psychology: Bem’s (2011) evidence of psi as a case study of deficiencies in model research practice. Review of General Psychology, 15, 371-379. DOI: 10.1037/a0025172

Pashler, H., & Wagenmakers, E-J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science, 7, 528-530. DOI: 10.1177/1745691612465253  [The whole issue is worth reading but we highlighted the opening piece.]

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551-566. DOI: 10.1037/a0029487

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 11, 1359-1366. DOI: 10.1177/0956797611417632

 Enjoy!