Careless Responders and Factor Structures

Warning: This post will bore most people.  Read at your own risk. I also linked to some  articles behind pay walls. Sorry!

I have a couple of research obsessions that interest me more than they should. This post is about two in particular: 1) the factor structure of the Rosenberg Self-Esteem Scale (RSE); and 2) the impact that careless responding can have on the psychometric properties of measures.  Like I said, this is a boring post.

I worked at the same institution as Neal Schmitt for about a decade and he once wrote a paper in 1985 (with Daniel Stults) illustrating how careless respondents can contribute to “artifact” factors defined by negatively keyed items (see also Woods, 2006).  One implication of Neal’s paper is that careless responders (e.g., people who mark a “1” for all items regardless of the content) confound the evaluation of the dimensionality of scales that include both positively and keyed items.  This matters for empirical research concerning the factor structure of the RSE.  The RSE is perfectly balanced (it has 5 positively-keyed items and 5 negatively-keyed items). Careless responders might contribute to method artifacts when evaluating the structure of the RSE.

This issue then raises a critical issue — how do you identify careless responders? There is an entire literature on this subject (see e.g., Meade & Craig, 2012) that is well worth reading. One option is to sprinkle directed response items throughout a survey (i.e., “Please mark 4 for quality control purposes”). The trick is that participants can be frustrated by too many of these so these items have to be used judiciously. A second option is to include scales developed explicitly to identify careless responders (see e.g., Marjanovic, Struthers, Cribbie, & Greenglass, 2014).  These are good strategies for new data collections. They are not suitable for identifying careless respondents from existing datasets (see Marjanovic, Holden, Struthers, Cribbie, & Greenglass, 2015).  This could be a concern as Meade and Craig found that between 10% and 12% of undergraduate participants to a long survey could be flagged as careless responders using a cool latent profile technique. My take away from their paper is that many datasets might have some degree of contamination.  Yikes!

I experimented with different methods for detecting careless responders on an ad-hoc basis several years ago for a conference talk.  One approach took advantage of the fact that the RSE is a balanced scale. Thus, I computed absolute value discrepancy scores between the positively and negatively keyed items.  [I’m sure someone had the idea before me and that I read about it but simply forgot the source. I also know that some people believe that positively and negatively keyed items reflect different constructs. I’m kind of skeptical of that argument.]

For example, image Dr. Evil responds a “1” to all 10 of the RSE items assessed on a 5 point Likert-type scale.  Given that half of the RSE items are reverse scored, 5 of Dr. Evil’s 1s will be transformed to 5s.  Her/his average for the positively keyed items will be 1 whereas the average for the negatively keyed items will be a 5.  This generates a value of 4 on the discrepancy index (the maximum in this example).  I basically found that selecting people with smaller discrepancy scores cleaned up the evaluation of the factor structure of the RSE.  I dropped the 10% of the sample with the highest discrepancy scores but this was made on a post hoc basis.

[I know there are all sorts of limitations and assumptions with this approach. For example, one obvious limitation is that Dr. Super Evil who responds a 3 to all items, regardless of her/his true feelings, earns a discrepancy score of 0 and is retained in the analysis. Dr. Super Evil is a real problem. I suspect she/he is friends with the Hamburglar.]

Marjanovic et al. (2015) recently published an interesting approach for detecting careless responding.  They propose calculating the standard deviation of the set of items designed to assess the same construct for each person (called the inter-item standard deviation or ISD).  Here the items all need to be keyed in the correct direction and I suspect this approach works best for scales with a mix of positive and negatively keyed items given issues of rectangular responding. [Note: Others have used the inter-item standard deviation as an indicator of substantive constructs but these authors are using this index as a methodological tool.]

Marjanovic et al. (2015) had a dataset with responses to Marjanovic et al. (2014) Conscientious Responders Scale (CRS) as well as responses to Big Five scales.  A composite based on the average of the ISDs for each of the Big Five scales was strongly negatively correlated with responses to the CRS (r = -.81, n = 284). Things looked promising based on the initial study. They also showed how to use a random number generator to develop empirical benchmarks for the ISD.  Indeed, I got a better understanding of the ISD when I simulated a dataset of 1,000 responses to 10 hypothetical items in which item responses were independent and drawn from a distribution whereby each of the five response options has a .20 proportion in the population.  [I also computed the ISD when preparing my talk back in the day but I focused on the discrepancy index – I just used the ISD to identify the people who gave all 3s to the RSE items by selecting mean = 3 and ISD = 0.  There remains an issue with separating those who have “neutral” feelings about the self from people like Dr. Super Evil.]

Anyways, I used their approach and it works well to help clean up analyses of the factor structure of the RSE.  I first drew a sample of 1,000 from a larger dataset of responses to the RSE (the same dataset I used for my conference presentation in 2009).  I only selected responses from European American students to avoid concerns about cultural differences.  The raw data and a  brief description are available.  The ratio of the first to second eigenvalues was 3.13 (5.059 and 1.616) and the scree plot would suggest 2 factors. [I got these eigenvalues from Mplus and this is based on the correlation matrix with 1.0s on the diagonal.  Some purists will kill me. I get it.]

I then ran through a standard set of models for the RSE.  A single factor model was not terribly impressive (e.g., RMSEA = .169, TLI = .601, SRMR = .103) and I thought the best fit was a model with a single global factor and correlated residuals for the negatively and positively keyed items minus one correlation (RMSEA = .068, TLI = .836, SRMR = .029).  I computed the internal consistency coefficient (alpha = .887, average inter-item correlation = .449). Tables with fit indices, the Mplus syntax, and input data are available.

Using the Marjanovic et al (2015) approach with random data, I identified 15% of the sample that could be flagged as random responders (see their paper for details). The RSE structure looked more unidimensional with this subset of 850 non-careless responders. The ratio of the first to second eigenvalues was 6.22 (6.145 and 0.988) and the models tended to have stronger factor loadings and comparatively better fit (even adjusting for the smaller sample size).  Consider that the average loading for the single factor model for all participants was .67 and this increased to .76 with the “clean” dataset. The single global model fit was still relatively unimpressive but better than before (RMSEA = .129, TLI = .852, SRMR = .055) and the single global model with correlated item residuals was still the best (RMSEA = .063, TLI = .964, SRMR = .019).  The alpha was even a bit better (.926, average inter-item correlation = .570).

So I think there is something to be said for trying to identify careless responders before undertaking analyses designed to evaluate the structure of the Rosenberg and other measures as well.  I also hope people continue to develop and evaluate simple ways for flagging potential careless responders for both new and existing datasets.  This might not be “sexy” work but it is important and useful.

 

Updates (1:30 CST; 2 June 2015): A few people sent/tweeted links to good papers.

Huang et al. (2012). Detecting and deterring insufficient effort responding to surveys.

Huang, Liu, & Bowling (2015). Insufficient effort responding: Examining an insidious confound in survey data.

Maniaci & Roggee (2014). Caring about carelessness: Participant inattention and its effects on research.

Reise & Widaman (1999). Assessing the fit of measurement models at the individual level: A comparison of item response theory and covariance structure approaches.

(1:00 CST; 3 June 2015): Even More Recommendations!  Sanjay rightly pointed out that my post was stupid. But the references and suggested readings are gold!  So even if my post wasted your time, the references should prove useful.

DeSimone, Harms, & DeSimone (2014).  Best practice recommendations for data screening.

Hankins (2008). The reliability of the twelve-item General Health Questionnaire (GHQ-12) under realistic assumptions.

See also: Graham, J. M(2006). Congeneric and (essentially) tau-equivalent estimates of score reliability what they are and how to use them. {Good stuff pointing to limitations with alpha and alternatives}

Savalei & Falk (2014).  Recovering substantive factor loadings in the presence of acquiescence bias: A Comparison of three approaches.

 

 

Replication Project in Personality Psychology – Call for Submissions

Richard Lucas and I are editing a special issue of the Journal of Research in Personality dedicated to replication (Click here for complete details). This blog post describes the general process and a few of my random thoughts on the special issue. These are my thoughts and Rich may or may not share my views.  I also want to acknowledge that there are multiple ways of doing replication special issues and we have no illusions that our approach is ideal or uncontroversial.  These kinds of efforts are part of an evolving “conversation” in the field about replication efforts and experimentation should be tolerated.  I also want to make it clear that JRP has been open to replication studies for several years.  The point of the special issue is to actively encourage replication studies and try something new with a variant of pre-registration.

What is the General Process?

We modeled the call for papers on procedures others have used with replication special issues and registered reports (e.g., the special issue of Social Psychology, the Registered Replication Reports at PoPS).  Here is the gist:

  • Authors will submit proposals for replication studies by 1 July 2015. These extended abstracts will be screened for methodological rigor and the importance of the topic.
  • Authors of selected proposals will then be notified by 15 August 2015.
  • There is a deadline of 15 March 2016 to submit the finished manuscript.

We are looking to identify a set of well-designed replication studies that provide valuable information about findings in personality psychology (broadly construed). We hope to include a healthy mix of pre-registered direct replications involving new data collections (either by independent groups or adversarial collaborations) and replications using existing datasets for projects that are not amenable to new data collection (e.g., long-term longitudinal studies).  The specific outcome of the replication attempt will not be a factor in selection.  Indeed, we do not want proposals to describe the actual results!

Complete manuscript will be subjected to peer-review but the relevant issues will be adherence to the proposed research plan, the quality of the data analysis, and the reasonableness of the interpretations.  For example, proposing to use a sample size of 800 but submitting a final manuscript with 80 participants will be solid grounds for outright rejection.  Finding a null result after a good faith attempt that was clearly outlined before data collection will not be grounds for rejection.  Likewise, learning that a previously used measure had subpar psychometric properties in a new and larger sample is valuable information even if it might explain a failure to find predicted effects.  At the very least, such information about how measures perform in new samples provides important technical insights.

Why Do This?

Umm, replication is an important part of science?!?! But beyond that truism, I am excited to learn what happens when we try to organize a modest effort to replicate specific findings in personality psychology. Personality psychologists use a diverse set of methods beyond experiments such as diary and panel studies.  This creates special challenges and opportunities when it comes to replication efforts.  Thus, I see this special issue as a potential chance to learn how replication efforts can be adapted to the diverse kinds of studies conducted by personality researchers.

For example, multiple research groups might have broadly similar datasets that target similar constructs but with specific differences when it comes to the measures, timing of assessments, underlying populations, sample sizes, etc. This requires careful attention to methodological similarities and differences when it comes to interpreting whether particular findings converge across the different datasets.  It would be ideal if researchers paid some attention to these issues before the results of the investigations were known.  Otherwise, there might be a tendency to accentuate differences when results fail to converge. This is one of the reasons why we will entertain proposals that describe replication attempts using existing datasets.

I also think it is important to address a perception that Michael Inzlicht described in a recent blog post.  He suggested that some social psychologists believe that some personality psychologists are using current controversies in the field as a way to get payback for the person-situation debate.  In light of this perception, I think it is important for more personality researchers to engage in formal replication efforts of the sort that have been prominent in social psychology.  This can help counter perceptions that personality researchers are primarily interested in schadenfreude and criticizing our sibling discipline. Hopefully, the cold war is over.

[As an aside, I think it the current handwringing about replication and scientific integrity transcends social and personality psychology.  Moreover, the fates of personality and social psychology are intertwined given the way many departments and journals are structured.  Social and personality psychology (to the extent that there is a difference) each benefit when the other field is vibrant, replicable, and methodologically rigorous.  Few outside of our world make big distinctions between social and personality researchers so we all stand to lose if decision makers like funders and university administrators decide to discount the field over concerns about scientific rigor.]

What Kinds of Replication Studies Are Ideal?

In a nut-shell: High quality replications of interesting and important studies in personality psychology.  To offer a potentially self-serving example, the recent replication of the association between I-words and narcissism is a good example.  The original study was relatively well-cited but it was not particularly strong in terms of sample size.  There were few convincing replications in the literature and it was often accepted as an article of faith that the finding was robust.  Thus, there was value in gaining more knowledge  about the underlying effect size(s) and testing to see whether the basic finding was actually robust.  Studies like that one as well as more modest contributions are welcome.  Personally, I would like more information about how well interactions between personality attributes and experimental manipulations tend to replicate especially when the original studies are seemingly underpowered.

What Don’t You Want to See?

I don’t want to single out too many specific topics or limit submissions but I can think of a few topics that are probably not going to be well received.  For instance, I am not sure we need to publish tons of replications showing there are 3 to 6 basic trait domains using data from college students.  Likewise, I am not sure we need more evidence that skilled factor analysts can find indications of a GFP (or general component) in a personality inventory.  Replications of well-worn and intensely studied topics are not good candidates for this special issue. The point is to get more data on interesting and understudied topics in personality psychology.

Final Thought

I hope we get a number of good submissions and the field learns something new in terms of specific findings. I also hope we also gain insights about the advantages and disadvantages of different approaches to replication in personality psychology.

More Null Results in Psychological Science — Comments on McDonald et al. (2014) and Crisp and Birtel (2014)

Full Disclosure:  I am second author on the McDonald et al. (2014) commentary.

Some of you may have seen that Psychological Science published our commentary on the Birtel and Crisp (2012) paper.  Essentially we tried to replicate two of their studies with larger sample sizes (29 versus 240 and 32 versus 175, respectively) and obtained much lower effect size estimates. It is exciting that Psychological Science published our work and I think this is a hint of positive changes for the field.  Hopefully nothing I write in this post undercuts that overarching message.

I read the Crisp and Birtel response and I had a set of responses (shocking, I know!). I think it is fair that they get the last word in print but I had some reactions that I wanted to share.  Thus, I will outlet a few in this blog post. Before diving into issues, I want to reiterate the basic take home message of McDonald et al. (2014):

“Failures to replicate add important information to the literature and should be a normal part of the scientific enterprise. The current study suggests that more work is needed before Birtel and Crisp’s procedures are widely implemented. Interventions for treating prejudice may require more precise manipulations along with rigorous evaluation using large sample sizes.” (p. xx)

1.  Can we get a mulligan on our title? We might want to revise the title of our commentary to make it clear that our efforts applied to only two specific findings in the original Birtel and Crisp (2012) paper. I think we were fairly circumscribed in the text itself but the title might have opened the door for how Crisp and Birtel (2014) responded.  They basically thanked us for our efforts and pointed out that our two difficulties say nothing about the entire imagined contact hypothesis.  They even argued that we “overgeneralized” our findings to the entire imagined contact literature.  To be frank, I do not think they were being charitable to our piece with this criticism because we did not make this claim in the text.  But titles are important and our title might have suggested some sort of overgeneralization.  I will let readers make their own judgments.  Regardless, I wish we had made the title more focused.

2.  If you really believe the d is somewhere around .35, why were the sample sizes so small in the first place?  A major substantive point in the Crisp and Birtel (2014) response is that the overall d for the imagined contact literature is somewhere around .35 based on a recent Miles and Crisp (2014) meta-analysis.  That is a reasonable point but I think it actually undercuts the Birtel and Crisp (2012) paper and makes our take home point for us (i.e., the importance of using larger sample sizes in this literature).  None of the original Birtel and Crisp (2012) studies had anywhere near the power to detect a population d of .35.  If we take the simple two-group independent t-test design, the power requirements for .80 suggest the need for about 260 participants (130 in each group).   The largest sample size in Birtel and Crisp (2012) was 32.

3. What about the ManyLabs paper?  The now famous ManyLabs paper of Klein et al. (in press) reports a replication attempt of an imagined contact study (Study 1 in Husnu & Crisp, 2010).  The ManyLabs effort yielded a much lower effect size estimate (d = .13, N = 6,336) than the original report (d = .86 or .84 as reported in Miles & Crisp, 2014; N = 33).  This is quite similar to the pattern we found in our work.  Thus, I think there is something of a decline effect in operation.  There is a big difference in interpretation between a d of .80 and a d around .15.  This should be worrisome to the field especially when researchers begin to think of the applied implications of this kind of work.

4. What about the Miles and Crisp Meta-Analysis (2014)? I took a serious look at the Miles and Crisp meta-analysis and I basically came away with the sinking feeling that much more research needs to be done to establish the magnitude of the imagined contact effects.  Many of the studies used in the meta-analysis were grossly underpowered.  There were 71 studies and only 2 had sample sizes above 260 (the threshold for having a good chance to detect a d = .35 effect using the standard between-participants design).  Those two large studies yielded basically null effects for the imagined contact hypothesis (d = .02 and .05, ns = 508 and 488, respectively). The average sample size of the studies in the meta-analysis was 81 (81.27 to be precise) and the median was 61 (Min. = 23 and Max. = 508).  A sample size of 123 was in the 90th percentile (i.e., 90% of the samples were below 123) and nearly 80% of the studies had sample sizes below 100.

Miles and Crisp (2014) were worried about sample size but perhaps not in the ways that I might have liked.   Here is what they wrote: “However, we observed that two studies had a sample size over 6 times the average (Chen & Mackie, 2013; Lai et al., 2013). To ensure that these studies did not contribute disproportionately to the summary effect size, we capped their sample size at 180 (the size of the next largest study) when computing the standard error variable used to weight each effect size.” (p. 13).  Others can weigh in about this strategy but I tend to want to let the sample sizes “speak for themselves” in the analyses, especially when using a random-effects meta-analysis model.

 What’s it all mean?

Not to bring out the cliché but I think much more work needs to be done here.  As it stands, I think the d = .35 imagined contact effect size estimate is probably upwardly biased.  Indeed, Miles and Crisp (2014) found evidence of publication bias such that unpublished studies yielded a smaller overall effect size estimate than published studies (but the unpublished studies still produce an estimate that is reliably larger than zero).  However this shakes out, researchers are well advised to use much larger sample sizes than tends to characterize this literature based on my summary of the sample sizes in Miles and Crisp (2014).  I also think more work needs to be done to evaluate the specific Birtel and Crisp (2012) effects.  We now have collected two more unpublished studies with even bigger sample sizes and we have yet to get effect sizes that approximate the original report.

I want to close by trying to clarify my position.  I am not saying that the effect sizes in question are zero or that this is an unimportant research area.  On the contrary, I think this is an incredibly important topic and thus it requires even greater attention to statistical power and precision.

 

Updated 26 Feb 2014: I corrected the sample size from study 1 from 204 to 240.

The Life Goals of Kids These Days Part II

This is a follow-up to my January 16 blog post with some new data!  Some of my former students and now colleagues have launched a longitudinal study of college students. In the Fall of 2013 we gave a large sample of first year students the Monitoring the Future goal items.  I thought it would be fun to see what these data looked like and how these goals were correlated with certain measures of personality.  These data are from a school in the Southwest and are drawn from all incoming first-year students.

Students were asked about 14 goals and could answer on a 1 to 4 point scale (1=”Not Important” whereas 4=”Extremely Important”).  Descriptive data for the 14 goals in order of the average level of endorsement are reported below.  I also included the ranking for Millennials as reported in Arnett, Trzesniewski, and Donnellan (2013) and described in my older post.

Table 1: Goals for First Year Students (Unnamed School in the Southwest) using the Monitoring the Future Goal Items

Goal

Rank in MTF for Millennials

M

SD

% Reporting Extremely Important

Having a good marriage and family life

1

3.54

.80

69.7

Being successful in my line of work

5

3.54

.64

61.3

Having strong friendships

3

3.52

.68

61.6

Being able to find steady work

2

3.51

.65

58.3

Finding a purpose and meaning in my life

6

3.35

.84

55.0

Being able to give my children better opportunities than I’ve had

4

3.32

.87

53.8

Having plenty of time for recreation and hobbies

7

3.11

.81

36.7

Making a contribution to society

9

3.11

.87

39.4

Discovering new ways to experience things

10

2.89

.91

28.3

Having lots of money

8

2.67

.91

21.3

Living close to parents and relatives

11

2.50

1.03

21.2

Working to correct social and economic inequalities

13

2.41

.99

17.3

Being a leader in my community

12

2.35

1.01

17.0

Getting away from this area of the country

14

1.83

1.01

10.1

Note: N = 1,245 to 1,254

As before, marriage and friendships was seemingly highly valued as was being successful and finding steady work. So these first year college students want it all – success in love and work.  Damn these kids — who do they think they are?

I was then able to correlate the goal responses with measures of self-esteem, narcissism, and the Big Five. Below is a table showing the relevant correlations.

Table 2: Correlations between Goal Items and Measures of Self-Esteem, Narcissism, Extraversion, and Agreeableness

Goal

Self-Esteem

NPI Total

NPI-EE

PDQ-NPD

Extraversion

Agreeableness

Having a good marriage and family life

.17

.05

-.09

-.07

.17

.29

Being successful in my line of work

.18

.18

-.01

.04

.19

.19

Having strong friendships

.16

.08

-.08

-.05

.26

.25

Being able to find steady work

.15

.09

-.03

-.02

.14

.20

Finding a purpose and meaning in my life

.04

.10

-.03

.00

.17

.15

Being able to give my children better opportunities than I’ve had

.11

.11

-.06

.03

.20

.25

Having plenty of time for recreation and hobbies

.07

.18

.08

.09

.15

.07

Making a contribution to society

.14

.18

-.03

.02

.25

.20

Discovering new ways to experience things

.15

.26

.05

.11

.27

.12

Having lots of money

.08

.34

.26

.21

.18

.03

Living close to parents and relatives

.12

.11

.01

.04

.16

.24

Working to correct social and economic inequalities

.08

.19

.03

.05

.19

.14

Being a leader in my community

.13

.36

.12

.16

.35

.18

Getting away from this area of the country

-.09

.19

.18

.18

.04

-.13

Note: Correlations ≥ |.06| are statistically significant at p < .05.  Correlations  ≥ |.20| are bolded. Self-Esteem was measured with the Rosenberg (1989) scale. The NPI (Raskin & Terry, 1988) was used so we that could compute the NPI-EE (Entitlement/Exploitativeness) subscale (see Ackerman et al., 2011) and even the total score (yuck!). The PDQ-NPD column is the Narcissistic Personality Disorder subscale of the Personality Diagnostic Questionnaire-4 (Hyler, 1994).  Extraversion and Agreeableness were measured using the Big Five Inventory (John et al., 1991).

What do I make of these results?  On the face of it, I do not see a major cause for alarm or worry.  These college students seem to want it all and it will be fascinating to track the development of these goals over the course of their college careers.  I also think Table 2 provides some reason to caution against using goal change studies as evidence of increases in narcissism but I am probably biased.  However, I do not think there is compelling evidence that the most strongly endorsed goals are strongly positively related to measures of narcissism.  This is especially true when considering the NPI-EE and PDQ correlations.

Thanks to Drs. Robert Ackerman, Katherine Corker, and Edward Witt.

Six Principles and Six Summer Readings

I helped contribute a short piece for a divisional newsletter about methodological reform issues.  I did this with three much smarter colleagues (Lucas, Fraley, and Roisman) and this is something I highly recommend.  However, I take all responsibility for the ideas in this post.

Anyways, this turned out to be an interesting chance to write about methodological reform issues and provide a “Summer Reading” list of current pieces. We tried to take a “friendly” approach by laying out the issues so individual researchers could read more and make  informed decisions on their own.  Here is a link to the ReformPrimer in draft form.  I posted the greatest hits below.

Six Principles and Practices to Consider Adopting in Your Own Work

1. Commit to Total Scientific Honesty.  See Lykken (1991) and Feynman (1985).

2. Be Aware of the Impact of Researcher Degrees of Freedom.  See Simmons et al. (2011).

3. Focus on Effect Size Estimation Rather than Statistical Significance.  See Cumming (2012) or Fraley and Marks (2007) or Kline (2013).

4. Understand that Sample Size and Statistical Power Matter.  See Cohen (1962) and Ioannidis (2005) and well a whole bunch of stuff like Francis (2013) and Schimmack (2012).

5. Review Papers for Methodological Completeness, Accuracy, and Plausibility.  See, for example, Kashy et al. (2009).  Sometimes effect sizes can just be too large, you know.  Standard errors are not the same thing as standard deviations…

6. Focus on Reproducibility and Replication. See, for example, Asendorpf et al. (2013).

Six Recent Readings

These are the perfect readings to take to the beach or local pool.  Who needs a thriller loosely tied to Dante’s Inferno or one of the books in the Song of Ice and Fire?

Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denisseen, J. J. A., Fielder, K. et al. (2013). Recommendations for increasing replicability in psychology. European Journal of Personality, 27, 108-119. DOI: 10.1002/per.1919  [Blame me for the horrible title of the Lucas and Donnellan comment.]

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth-telling. Psychological Science, 23, 524–532. DOI: 10.1177/0956797611430953

 LeBel, E. & Peters, K. R. (2011). Fearing the future of empirical psychology: Bem’s (2011) evidence of psi as a case study of deficiencies in model research practice. Review of General Psychology, 15, 371-379. DOI: 10.1037/a0025172

Pashler, H., & Wagenmakers, E-J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science, 7, 528-530. DOI: 10.1177/1745691612465253  [The whole issue is worth reading but we highlighted the opening piece.]

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551-566. DOI: 10.1037/a0029487

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 11, 1359-1366. DOI: 10.1177/0956797611417632

 Enjoy!

Just Do It!

I want to chime in about the exciting new section in Perspectives on Psychological Science dedicated to replication.  (Note: Sanjay and David have more insightful takes!). This is an important development and I hope other journals follow with similar policies and guidelines.  I have had many conversations about methodological issues with colleagues over the last several years and I am constantly reminded about how academic types can talk themselves into inaction at the drop of a hat. That fact that something this big is actually happening in a high profile outlet is breathtaking (but in a good way!).

Beyond the shout out to Perspectives, I want to make a modest proposal:  Donate 5 to 10% of your time to replication efforts.  This might sound like a heavy burden but I think it is a worthy goal. It is also easier to achieve with some creative multitasking.   Steer a few of those undergraduate honors projects toward a meaningful replication study or have first year graduate students pick a study and try to replicate it during their first semester on campus.  Then make sure to take an active role in the process to make these efforts worthwhile for the scientific community.  Beyond that, let yourself be curious!  If you read about an interesting study, try to replicate it.  Just do it.

I also want to make an additional plug for a point Richard Lucas and I make in an upcoming comment (the title of our piece is my fault):  Support those journals who value replications by reviewing for them and providing them with content (i.e., submissions) and (gasp!) consider refusing to support journals that do not support replication studies or endorse sound methodological practices. Just do it (or not).

I will end with some shameless self-promotion and perhaps a useful reminder about reporting practices. Debby Kashy and I were kind of prescient in our 2009 paper about research practices in PSPB (along with Robert Ackerman and Daniel Russell).  Here is what we wrote (see p. 1139):

“All in all, we hope that researchers strive to find replicable effects, the building blocks of a cumulative science. Indeed, Steiger (1990) noted, “An ounce of replication is worth a ton of inferential statistics” (p. 176). As we have emphasized throughout, clear and transparent reporting is vital to this aim. Providing enough details in the Method and Results sections allows other researchers to make meaningful attempts to replicate the findings. A useful heuristic is for authors to consider whether the draft of their paper includes enough information so that another researcher could collect similar data and replicate their statistical analyses.”

An Incredible Paper (and I mean that in the best way possible)

Ulrich Schimmack has a paper in press at Psychological Methods that should be required reading for anyone producing or consuming research in soft psychology (Title: “The Ironic Effect of Significant Results on the Credibility of Multiple-Study Articles”).  Sadly, I doubt this paper will get much attention in the popular press.  Uli argues that issues of statistical power are critical for evaluating a package of studies and his approach also fits very nicely with recent papers by Gregory Francis.  I am excited because it seems as if applied researchers are beginning to have access to a set of relatively easy to use tools to evaluate published papers.

(I would add that Uli’s discussion of power fits perfectly well with broader concerns about the importance of study informativeness as emphasized by Geoff Cumming in his recent monograph.)

Uli makes a number of recommendations that have the potential to change the ratio of fiction to non-fiction in our journals.  His first recommendation is to use power to explicitly evaluate manuscripts.  I think this is a compelling recommendation.  He suggests that authors need to justify the sample sizes in their manuscripts. There are too many times when I read papers and I have no clue why authors have used such small samples sizes.  Such concerns do not lend themselves to positive impressions of the work.

Playing around with power calculations or power programs leads to sobering conclusions.  If you expect a d-metric effect size of .60 for a simple two independent-groups study, you need 45 participants in each group (N=90) to have 80% power. The sample requirements only go up if the d is smaller (e.g., 200 total if d = .40 and 788 total if d = .20) or if you want better than 80% power.  Given the expected value of most effect sizes in soft psychology, it seems to me that sample sizes are going to have to increase if the literature is going to get more believable.  Somewhere, Jacob Cohen is smiling. If you hate NHST and want to think in terms of informativeness, that is fine as well.  Bigger samples yield tighter confidence intervals. Who can argue with calls for more precision?

Uli discusses other strategies for improving research practices such as the value of publishing null results and the importance of rewarding the total effort that goes into a paper rather than the number of statistically significant p-values.   It is also worth rewarding individuals and teams who are developing techniques to evaluate the credibility of the literature, actively replicating results, and making sure published findings are legitimate.  Some want to dismiss them as witch hunters.  I prefer to call them scientists.