Just Do It!

I want to chime in about the exciting new section in Perspectives on Psychological Science dedicated to replication.  (Note: Sanjay and David have more insightful takes!). This is an important development and I hope other journals follow with similar policies and guidelines.  I have had many conversations about methodological issues with colleagues over the last several years and I am constantly reminded about how academic types can talk themselves into inaction at the drop of a hat. That fact that something this big is actually happening in a high profile outlet is breathtaking (but in a good way!).

Beyond the shout out to Perspectives, I want to make a modest proposal:  Donate 5 to 10% of your time to replication efforts.  This might sound like a heavy burden but I think it is a worthy goal. It is also easier to achieve with some creative multitasking.   Steer a few of those undergraduate honors projects toward a meaningful replication study or have first year graduate students pick a study and try to replicate it during their first semester on campus.  Then make sure to take an active role in the process to make these efforts worthwhile for the scientific community.  Beyond that, let yourself be curious!  If you read about an interesting study, try to replicate it.  Just do it.

I also want to make an additional plug for a point Richard Lucas and I make in an upcoming comment (the title of our piece is my fault):  Support those journals who value replications by reviewing for them and providing them with content (i.e., submissions) and (gasp!) consider refusing to support journals that do not support replication studies or endorse sound methodological practices. Just do it (or not).

I will end with some shameless self-promotion and perhaps a useful reminder about reporting practices. Debby Kashy and I were kind of prescient in our 2009 paper about research practices in PSPB (along with Robert Ackerman and Daniel Russell).  Here is what we wrote (see p. 1139):

“All in all, we hope that researchers strive to find replicable effects, the building blocks of a cumulative science. Indeed, Steiger (1990) noted, “An ounce of replication is worth a ton of inferential statistics” (p. 176). As we have emphasized throughout, clear and transparent reporting is vital to this aim. Providing enough details in the Method and Results sections allows other researchers to make meaningful attempts to replicate the findings. A useful heuristic is for authors to consider whether the draft of their paper includes enough information so that another researcher could collect similar data and replicate their statistical analyses.”

One for the File Drawer?

I once read about an experiment in which college kids held either a cold pack or a warm pack and then reported about their levels of so-called trait loneliness. We just tried a close replication of this experiment involving the same short form loneliness scale used by the original authors. I won’t out my collaborators but I want to acknowledge their help.

The original effect size estimate was pretty substantial (d = .61, t = 2.12, df = 49) but we used 261 students so we could have more than adequate power. Our attempt yielded a much small effect size than the original (d =-.01, t = 0.111, df = 259, p = .912).  The mean of the cold group (2.10) was darn near the same as the warm group (2.11; pooled SD = .61).  (We also get null results if you restrict the analyses to just those who reported that they believed the entire cover story: d = -.17.  The direction is counter to predictions, however.)

Failures to replicate are a natural part of science so I am not going to make any bold claims in this post. I do want to point out that the reporting in the original is flawed. (The original authors used a no-pack control condition and found no evidence of a difference between the warm pack and the no-pack condition so we just focused on the warm versus cold comparison for our replication study).  The sample size was reported as 75 participants. The F value for the one-way ANOVA was reported as 3.80 and the degrees of freedom were reported as 2, 74.  The numerator for the reference F distribution should be k -1 (where k is the number of conditions) so the 2 was correct.  However, the denominator was reported as 74 when it should be N – k or 72 (75 – 3).   Things get even weirder when you try to figure out the sample sizes for the 3 groups based on the degrees of freedom reported for each of the three follow-up t-tests.

We found indications that holding a cold pack did do something to participants.  Both the original study and our replication involved a cover story about product evaluation. Participants answered three yes/no questions and these responses varied by condition.

Percentage answering “Yes” to the Pleasant Question:

Warm: 96%     Cold: 80%

Percentage answering “Yes” to the Effective Question:

Warm: 98%     Cold: 88%

Percentage answering “Yes” to the Recommending to a Friend Question:

Warm: 95%   Cold: 85%

Apparently, the cold packs were not evaluated as positively as the warm packs.  I can foresee all sorts of criticism coming our way. I bet one thread is that were are not “skilled” enough to get the effect to work and a second thread is that we are biased against the original authors (either explicitly or implicitly). I’ll just note these as potential limitations and call it good.  Fair enough?

The Life Goals of Kids These Days

The folks at the Language Log did a nice job of considering some recent claims about the narcissism and delusions of today’s young people. I want to piggy-back on that post with an illustration from another dataset based on work I have done with some colleagues.

We considered a JPSP paper by a group I will just refer to as Drs. Smith and colleagues. Smith et al. used data from the Monitoring the Future Study from 1976 to 2008 to evaluate possible changes in the life goals of high school seniors. They classified high school seniors from 1976 to 1978 as Baby Boomers (N = 10,167) and those from 2000 to 2008 as Millennials (N= 20,684). Those in-between were Gen Xers but I will not talk about them in the interest of simplifying the presentation.

Students were asked about 14 goals and could answer on a 1 to 4 point scale (1=Not Important to 4=Extremely Important). Smith et al. used a centering procedure to report the goals but I think the raw numbers are as enlightening.  Below are the 14 goals ranked by the average level of endorsement for the Millennials.

Mean Level

% Reporting Extremely Important

Goal

Millennials

Boomers

SD

Millennials

Boomers

Having a good marriage and family life

3.64

3.57

.76

76.1%

73.3%

Being able to find steady work

3.59

3.54

.66

67.2%

63.4%

Having strong friendships

3.57

3.49

.70

66.5%

60.8%

Being able to give my children better opportunities than I‘ve had

3.54

3.30

.78

66.7%

50.5%

Being successful in my line of work

3.53

3.40

.72

63.5%

54.2%

Finding purpose and meaning in my life

3.41

3.52

.80

59.8%

64.3%

Having plenty of time for recreation and hobbies

3.10

2.88

.79

33.3%

24.5%

Having lots of money

2.83

2.54

.89

25.9%

16.5%

Making a contribution to society

2.81

2.63

.87

24.0%

18.0%

Discovering new ways to experience things

2.80

2.70

.88

24.0%

20.0%

Living close to parents and relatives

2.50

2.04

.97

17.5%

8.3%

Being a leader in my community

2.38

1.91

.98

15.7%

6.8%

Working to correct social and economic inequalities

2.30

2.22

.92

12.4%

10.0%

Getting away from this area of the country

1.98

1.80

1.08

14.5%

11.4%

Overall Goal Rating

3.00

2.82

.40

What do I make of this?  Not surprisingly, I see more similarities than big differences.  Marriage and family life are important to students as is having a steady job. So high school students want it all – success in love and work.  I do not see “alarming” trends in these results but this is my subjective interpretation.

As I said, Smith et al. used a centering approach with the data.  I think they computed a grand mean across the 14 goals for each respondent and then centered each individual’s response to the 14 goals around that grand mean.  Such a strategy might be a fine approach but it seems to make things look “worse” for the Millennials in comparison to Boomers.  I will let others judge as to which analytic approach is better but I do worry about researcher degrees of freedom here.  I also just like raw descriptive statistics.

[The Monitoring the Future Data are available through ICPSR. My standard $20 contribution to the charity of choice for the first person who emails me with any reporting errors holds.  I really do hope others look at the data themselves.]

Your arguments only make sense if you say them very fast…

“Although well-meaning, many of the suggestions only make sense if you say them very fast.”   -Howard Wainer (2011, p. 7) from Uneducated Guesses

I love this phrase: Your ideas only make sense if you say them very fast. I find myself wanting to invoke this idea anytime I hear some of the counterarguments to methodological reform. For example, I think this line applies to NS’s comment about climate change skeptics.

Anyways, I am about 90% done with the articles in the November 2012 special issue of Perspectives on Psychological Science.  I enjoyed reading most of the articles and it is good resource for thinking about reform in psychological research. It should probably be required reading in graduate seminars. So far, the article that generated the strongest initial reaction was the Galak and Meyvis (2012; hereafter G & M) reply to Francis (2012).  I think they basically made his point for him.  [I should disclose that I think the basic idea pursued by G and M seems plausible and I think their reply was written in a constructive fashion. I just did not find their arguments very convincing.]

Francis (2012) suggests that the 8 studies in their package are less compelling when viewed in the aggregate because the “hit” rate is much higher than one would expect given the sample sizes and effect size in question. The implication is that there is probably publication bias. [Note: People sometimes quibble over how Francis calculates his effect size estimate but that is a topic for another blog post.]

I happen to like the “Francis” tables because readers get to see effect size estimates and the sample sizes stripped clean of narrative baggage.  Usually the effect sizes are large and the sample sizes are small. This general pattern would seem to characterize the G and M results.  (Moreover, the correlation between effect size estimates and sample sizes for the G and M set of results was something like -.88.  Ouch!).

G and M acknowledge that they had studies in their file drawer:  “We reported eight successful demonstrations of this phenomenon in our paper, but we also conducted five additional studies whose results either did not reach conventional levels of significance or did reach significance but ended up being rhetorically redundant” (G & M, 2012, p. 595). So there was selective reporting. Case closed in my book. Game over. As an aside, I am not sure I can distinguish between those desperate effect sizes who are reaching toward the p < .05 promised land from those who are fleeing from it. Can you?  It probably takes ESP.

G and M calculated their overall effect size as a g* of .38 (95% CI .25 to .51) with all studies in the mix whereas Francis reported the average g* from the published work as .57.  So it seems to me that their extra data brings down the overall effect size estimate.  Is this a hint of the so-called decline effect?  G and M seem to want to argue that because the g* estimate is bigger than zero that there is no real issue at stake. I disagree. Scientific judgment is rarely a yes/no decision about the existence of an effect. It is more often about the magnitude of the effect.  I worry that the G and M approach distorts effect size estimates and possibly even perpetuates small n studies in the literature.

G and M also stake a position that I fail to understand:  “However, as is the case for many papers in experimental psychology, the goal was never to assess the exact size of the effect, but rather to test between competing theoretical predictions” (G & M, 2012, p.  595). People use this or a similar argument to dismiss current concerns about the paucity of exact replications and the proliferation of small sample sizes in the literature.  What I do understand about this argument makes me skeptical. Let’s quote from the bible of Jacob Cohen (1990, p. 1309):

“In retrospect, it seems to me simultaneously quite understandable yet also ridiculous to try to develop theories about human behavior with p values from Fisherian hypothesis testing and no more than a primitive sense of effect size.”

So effect sizes matter for theories. Effect sizes tell us something about the magnitude of the associations in question (causal or otherwise) and I happen to think this is critical information for evaluating the truthiness of a theoretical idea. Indeed, I think the field of psychology would be healthier if we focused on getting more precise estimates of particular effects rather than playing this game of collecting “hits” from a bunch of underpowered “conceptual” extensions of a general idea.  I actually think G and M should have written this statement: “As is the case for many papers in psychology, our goal was to present as much evidence as possible for our preferred theoretical orientation.”

This strategy seems common but I believe it ultimately produced a JPSP paper on ESP.  So perhaps it is time to discard this approach and try something else for a change. Heck, even a 1-year recess might be worth it. That moratorium on NHST worked, right?

“At any given time we know what we are doing….”

Disclaimer: Both Robert MacCallum (e.g., 2003) and George Box (e.g., 1979) have written extensively about the value of models and I will basically steal parrot their ideas in this post. Moreover, I did not sleep much last night…

Let the postmortem on the 2012 election begin! One story will likely involve the accuracy of well-conducted polls and the success of Nate Silver’s methods over “gut-based” methods favored by pundits and campaign workers.  Not surprisingly, I like much of this story as is nicely summed up by this cartoon (thanks to Skip G. for posting this one on “the” Facebook).

But what if Nate Silver was wrong? What if we woke up today and Romney won 303 electoral votes and Obama lost the election?  I think Mr. Silver would have been in a much better position than his “gut-based” critics are in today. The reason boils down to the advantages of models.  It is really useful to have a formalized recipe for prediction. To quote Box (1979): “The great advantage of the model-based over the ad hoc approach, it seems to me, is that at any given time we know what we are doing.”

I happen to like the model for science depicted by George Box (see Figure A1 in his 1976 Science and Statistics paper).  The basic idea is that errors drive the accumulation of knowledge in an iterative cycle. Learning is produced when there is “a discrepancy between what tentative theory suggests should be so and what practices says is so” (p. 791).  In other words, there is something to be gained when predictions from models and empirical facts disagree.  These errors lead to better models, at least in the ideal case.

So if Mr. Silver’s model was gravely wrong, he could have spent the next days and weeks figuring out where his model went wrong.  He has his predicted values and he has the actual values.  He can test alternative models to find ones that outperformed his original model. He is in a good position to learn something.  Compare his plight with that of the gut-based pundit.  How are they going to figure out why their predictions failed?  What are they going to learn?

Moral of my story: Models rule.  I think there might be a bigger lesson in here for “soft” psychology but I am too tired to express it properly.

Politics and Marital Quality: Or How I Wasted My Morning

I had been wondering if political orientation or discrepancies in political orientation might be related to relationship quality. I think this is an interesting question in light of a close presidential election. Fortunately, I had access to some data on these variables from around 330 heterosexual married couples. I conducted some preliminary analyses this morning and the short story is a bunch of null findings.

Measures: Political orientation was measured on the “traditional seven-point scale” where 1=extremely liberal to 7 = extremely conservative (see Knight, 1999). Marital quality was measured using five items from the quality of marriage index (Norton, 1983).  The internal consistencies were typical of this measure (alphas ≥ .90 for wives and husbands)

Descriptive Results: Husbands were slightly more conservative than wives (Husband Mean = 4.63, Wife Mean = 4.33, Pooled SD = 1.36; d = .22). Husbands and wives did not differ in terms of marital quality (Husband Mean = 4.26, Wife Mean = 4.25, Pooled SD = .83, d = .01). There was evidence of spousal similarity for political orientation (ICC = .54) and marital quality (ICC = .62). None of the zero-order correlations involving political orientation and marital quality were impressive or statistically significant (largest r = -.05).

Actor Partner Interdependence Model (APIM) Results: I squared the difference between political orientation scores from husbands and wives and used that score in a very basic dyadic model.  I specified the APIM for interchangeable dyads with the exception of allowing for a mean-level difference in political orientation between wives and husbands.  None of the relevant effects were statistically different from zero:  Actor effect: .008 (SE = .023); Partner Effect: -.015 (SE = .023); Discrepancy Effect: -.013 (SE = .016). Thus, political orientation did not seem to matter for the individual’s report of marital quality or for her/his partner’s report of marital quality.  The discrepancy did not seem to matter either.

A weakness is the single-item measure of political orientation and the fact that these couples had been together for a period of time (Average age of husbands was around 37 years versus 35 years for wives).  Nonetheless, these initial results were not compelling to me.  Darn! It would have made an interesting story.  If anyone else has better data on this issue or more convincing results, let me know.

Two Types of Researchers?

Last winter I gave a quick brown bag where I speculated about the possibility of two distinct types of researchers. I drew from a number of sources to construct my prototypes. To be clear, I do not suspect that all researchers will fall neatly into one of these two types. I suspect these are so-called “fuzzy” types. I also know that at least one of my colleagues hates this idea. Thus, I apologize in advance.

Regardless, I think there is something to my working taxonomy and I would love to get data on these issues. Absent data, this will have to remain purely hypothetical. There is of course a degree of hyperbole mixed in here as well. Enjoy (or not)!

Approach I Approach II
Ioannidis (2008) Label: Aggressive Discoverer Reflective Replicator
Abelson (1995) Label: Brash/Liberal Stuffy/Conservative
Tetlock (2005) or Berlin (1953) Label: Hedgehogs Foxes
Focus: Discovery Finding Sturdy Effects
Preference: Novelty Definitiveness
Research Materials: Private possessions Public goods
Ideal Reporting Standard: Interesting findings only Everything
Analytic Approach: Find results to support view Concerned about sensitivity
Favorite Sections of Papers: Introduction & Discussion Method & Results
Favorite Kind of Article: Splashy reports that get media coverage Meta-Analyses
View on Confidence Intervals: Unnecessary clutter The smaller the better
Stand on the NHST Controversy: What controversy? Jacob Cohen was a god
View on TED Talks: Yes. Please pick me. Meh!
Greatest Fear: Getting scooped Having findings fail to replicate
Orientation in the Field: Advocacy Skepticism
Error Risk: Type I Type II