Alpha and Correlated Item Residuals

Subtitle: Is alpha lame?

My last post was kind of stupid as Sanjay tweeted. [Sanjay was actually much more diplomatic when he pointed out the circularity in my approach.] I selected the non-careless responders in a way that guarantees a more unidimensional result. A potentially better approach is to use a different set of scales to identify the non-careless responders and repeat the analyses. This flaw aside, I think my broader points still stand. It is useful to look for ways to screen existing datasets given the literature that: a) suggests careless responders are present in many datasets; and b) careless responders often distort substantive results (see the references and additional recommendations to the original post).

Another interesting criticism came about from my off-handed reporting of alpha coefficients. Matthew Hankins (via twitter) rightly pointed out that it is a mistake to compute alpha in light of the structural analyses I conducted. I favored a particular model for the structure of the RSE that specifies a large number of correlated item residuals between the negatively-keyed and positively-keyed items. In the presence of correlated residuals, alpha is either an underestimate or overestimate of reliability/internal consistency (see Raykov 2001 building on Zimmerman, 1972).

[Note: I knew reporting alpha was a technical mistake but I thought it was one of those minor methodological sins akin to dropping an f-bomb every now and then in real life.  Moreover, I am aware of the alpha criticism literature (and the alternatives like omega). I assumed the alpha is a lower bound heuristic when blogging but this is not true in the presence of correlated residuals (see again Raykov, 2001).]

Hankins illustrated issues with alpha and the GHQ-12 in a paper he recommended (Hankins, 2008). The upshot of his paper is that alpha often makes the GHQ-12 appear to be a more reliable instrument than other methods of computing reliability based on more appropriate factor structures (say like .90 versus .75).  Depending on how reliability estimates are used, this could be a big deal.

Accordingly, I modified some Mplus syntax using Brown (2015) and Raykov (2001) as a template to compute a more appropriate reliability estimate for the RSE for my preferred model.  Output that includes the syntax is here. [I did this quickly so I might have made a mistake!]  Using this approach, I estimated reliability for my sample of 1,000 to be .699 for my preferred model.  This is compared to the .887 estimate I got with alpha. If you want a way to contextualize this drop, you can think about how this difference would impact the Standard Error of Measurement when considering the precision of estimates for individual scores.  The SD for the mean scores was .724.

I go back and forth about whether I think alpha is lame or if all of the criticism of alpha is much ado about nothing. Today I am leaning towards the alpha is lame pole of my thinking.  Correlated residuals are a reality for the scales that I typically use in research. Yikes!

Thanks to people who tweeted and criticized my last post.

Brown, T. A. (2015). Confirmatory factor analysis for applied research (2nd edition).

Hankins (2008). The reliability of the twelve-item general health questionnaire (GHQ-12) under realistic assumptions.

Raykov (2001). Bias of coefficient α for fixed congeneric measures with correlated errors.

Zimmerman (1972). Test reliability and the Kuder-Richardson formulas: Derivation from probability theory.



Careless Responders and Factor Structures

Warning: This post will bore most people.  Read at your own risk. I also linked to some  articles behind pay walls. Sorry!

I have a couple of research obsessions that interest me more than they should. This post is about two in particular: 1) the factor structure of the Rosenberg Self-Esteem Scale (RSE); and 2) the impact that careless responding can have on the psychometric properties of measures.  Like I said, this is a boring post.

I worked at the same institution as Neal Schmitt for about a decade and he once wrote a paper in 1985 (with Daniel Stults) illustrating how careless respondents can contribute to “artifact” factors defined by negatively keyed items (see also Woods, 2006).  One implication of Neal’s paper is that careless responders (e.g., people who mark a “1” for all items regardless of the content) confound the evaluation of the dimensionality of scales that include both positively and keyed items.  This matters for empirical research concerning the factor structure of the RSE.  The RSE is perfectly balanced (it has 5 positively-keyed items and 5 negatively-keyed items). Careless responders might contribute to method artifacts when evaluating the structure of the RSE.

This issue then raises a critical issue — how do you identify careless responders? There is an entire literature on this subject (see e.g., Meade & Craig, 2012) that is well worth reading. One option is to sprinkle directed response items throughout a survey (i.e., “Please mark 4 for quality control purposes”). The trick is that participants can be frustrated by too many of these so these items have to be used judiciously. A second option is to include scales developed explicitly to identify careless responders (see e.g., Marjanovic, Struthers, Cribbie, & Greenglass, 2014).  These are good strategies for new data collections. They are not suitable for identifying careless respondents from existing datasets (see Marjanovic, Holden, Struthers, Cribbie, & Greenglass, 2015).  This could be a concern as Meade and Craig found that between 10% and 12% of undergraduate participants to a long survey could be flagged as careless responders using a cool latent profile technique. My take away from their paper is that many datasets might have some degree of contamination.  Yikes!

I experimented with different methods for detecting careless responders on an ad-hoc basis several years ago for a conference talk.  One approach took advantage of the fact that the RSE is a balanced scale. Thus, I computed absolute value discrepancy scores between the positively and negatively keyed items.  [I’m sure someone had the idea before me and that I read about it but simply forgot the source. I also know that some people believe that positively and negatively keyed items reflect different constructs. I’m kind of skeptical of that argument.]

For example, image Dr. Evil responds a “1” to all 10 of the RSE items assessed on a 5 point Likert-type scale.  Given that half of the RSE items are reverse scored, 5 of Dr. Evil’s 1s will be transformed to 5s.  Her/his average for the positively keyed items will be 1 whereas the average for the negatively keyed items will be a 5.  This generates a value of 4 on the discrepancy index (the maximum in this example).  I basically found that selecting people with smaller discrepancy scores cleaned up the evaluation of the factor structure of the RSE.  I dropped the 10% of the sample with the highest discrepancy scores but this was made on a post hoc basis.

[I know there are all sorts of limitations and assumptions with this approach. For example, one obvious limitation is that Dr. Super Evil who responds a 3 to all items, regardless of her/his true feelings, earns a discrepancy score of 0 and is retained in the analysis. Dr. Super Evil is a real problem. I suspect she/he is friends with the Hamburglar.]

Marjanovic et al. (2015) recently published an interesting approach for detecting careless responding.  They propose calculating the standard deviation of the set of items designed to assess the same construct for each person (called the inter-item standard deviation or ISD).  Here the items all need to be keyed in the correct direction and I suspect this approach works best for scales with a mix of positive and negatively keyed items given issues of rectangular responding. [Note: Others have used the inter-item standard deviation as an indicator of substantive constructs but these authors are using this index as a methodological tool.]

Marjanovic et al. (2015) had a dataset with responses to Marjanovic et al. (2014) Conscientious Responders Scale (CRS) as well as responses to Big Five scales.  A composite based on the average of the ISDs for each of the Big Five scales was strongly negatively correlated with responses to the CRS (r = -.81, n = 284). Things looked promising based on the initial study. They also showed how to use a random number generator to develop empirical benchmarks for the ISD.  Indeed, I got a better understanding of the ISD when I simulated a dataset of 1,000 responses to 10 hypothetical items in which item responses were independent and drawn from a distribution whereby each of the five response options has a .20 proportion in the population.  [I also computed the ISD when preparing my talk back in the day but I focused on the discrepancy index – I just used the ISD to identify the people who gave all 3s to the RSE items by selecting mean = 3 and ISD = 0.  There remains an issue with separating those who have “neutral” feelings about the self from people like Dr. Super Evil.]

Anyways, I used their approach and it works well to help clean up analyses of the factor structure of the RSE.  I first drew a sample of 1,000 from a larger dataset of responses to the RSE (the same dataset I used for my conference presentation in 2009).  I only selected responses from European American students to avoid concerns about cultural differences.  The raw data and a  brief description are available.  The ratio of the first to second eigenvalues was 3.13 (5.059 and 1.616) and the scree plot would suggest 2 factors. [I got these eigenvalues from Mplus and this is based on the correlation matrix with 1.0s on the diagonal.  Some purists will kill me. I get it.]

I then ran through a standard set of models for the RSE.  A single factor model was not terribly impressive (e.g., RMSEA = .169, TLI = .601, SRMR = .103) and I thought the best fit was a model with a single global factor and correlated residuals for the negatively and positively keyed items minus one correlation (RMSEA = .068, TLI = .836, SRMR = .029).  I computed the internal consistency coefficient (alpha = .887, average inter-item correlation = .449). Tables with fit indices, the Mplus syntax, and input data are available.

Using the Marjanovic et al (2015) approach with random data, I identified 15% of the sample that could be flagged as random responders (see their paper for details). The RSE structure looked more unidimensional with this subset of 850 non-careless responders. The ratio of the first to second eigenvalues was 6.22 (6.145 and 0.988) and the models tended to have stronger factor loadings and comparatively better fit (even adjusting for the smaller sample size).  Consider that the average loading for the single factor model for all participants was .67 and this increased to .76 with the “clean” dataset. The single global model fit was still relatively unimpressive but better than before (RMSEA = .129, TLI = .852, SRMR = .055) and the single global model with correlated item residuals was still the best (RMSEA = .063, TLI = .964, SRMR = .019).  The alpha was even a bit better (.926, average inter-item correlation = .570).

So I think there is something to be said for trying to identify careless responders before undertaking analyses designed to evaluate the structure of the Rosenberg and other measures as well.  I also hope people continue to develop and evaluate simple ways for flagging potential careless responders for both new and existing datasets.  This might not be “sexy” work but it is important and useful.


Updates (1:30 CST; 2 June 2015): A few people sent/tweeted links to good papers.

Huang et al. (2012). Detecting and deterring insufficient effort responding to surveys.

Huang, Liu, & Bowling (2015). Insufficient effort responding: Examining an insidious confound in survey data.

Maniaci & Roggee (2014). Caring about carelessness: Participant inattention and its effects on research.

Reise & Widaman (1999). Assessing the fit of measurement models at the individual level: A comparison of item response theory and covariance structure approaches.

(1:00 CST; 3 June 2015): Even More Recommendations!  Sanjay rightly pointed out that my post was stupid. But the references and suggested readings are gold!  So even if my post wasted your time, the references should prove useful.

DeSimone, Harms, & DeSimone (2014).  Best practice recommendations for data screening.

Hankins (2008). The reliability of the twelve-item General Health Questionnaire (GHQ-12) under realistic assumptions.

See also: Graham, J. M(2006). Congeneric and (essentially) tau-equivalent estimates of score reliability what they are and how to use them. {Good stuff pointing to limitations with alpha and alternatives}

Savalei & Falk (2014).  Recovering substantive factor loadings in the presence of acquiescence bias: A Comparison of three approaches.



Is Obama a Narcissist?

Warning: For educational purposes only. I am a personality researcher not a political scientist!

Short Answer: Probably Not.

Longer Answer: There has been a fair bit of discussion about narcissism and the current president (see here for example). Some of this stemmed from recent claims about his use of first person pronouns (i.e., a purported use of greater “I-talk”). A big problem with that line of reasoning is that the empirical evidence linking narcissism with I-talk is surprisingly shaky.  Thus, Obama’s use of pronouns is probably not very useful when it comes to making inferences about his levels of narcissism.

Perhaps a better way to gauge Obama’s level of narcissism is to see how well his personality profile matches a profile typical of someone with Narcissistic Personality Disorder (NPD).  The good news is that we have such a personality profile for NPD thanks to Lynam and Widiger (2001).  Those researchers asked 12 experts to describe the prototype case of NPD in terms of the facets of the Five-Factor Model (FFM). In general, they found that someone with NPD could be characterized as having the following characteristics…

High Levels: Assertiveness, Excitement Seeking, Hostility, and Openness to Actions (i.e., a willingness to try new things)

Low Levels: Agreeableness (all aspects), Self-Consciousness, Warmth, Openness to Feelings (i.e., a lack of awareness of one’s emotional state and some elements of empathy)

The trickier issue is finding good data on Obama’s actual personality. My former students Edward Witt and Robert Ackerman did some research on this topic that can be used as a starting point.  They had 86 college students (51 liberals and 35 conservatives) rate Obama’s personality using the same dimensions Lynam and Widiger used to generate the NPD profile.  We can use the ratings of Obama averaged across the 86 different students as an informant report of his personality.

Note: I know this approach is far from perfect and it would be ideal to have non-partisan expert raters of Obama’s personality (specifically the 30 facets of the FFM). If you have such a dataset, send it my way (self-reported data from the POTUS would be welcome too)! Moreover, Witt and Ackerman found that liberals and conservatives had some differences when it came to rating Obama’s personality.  For example, conservatives saw him higher in hostility and lower in warmth than liberals.  Thus, the profile I am using might tend to have a rosier view of Obama’s personality than a profile generated from another sample with more conservatives (send me such a dataset if you have it!). An extremely liberal sample might generate an even more positive profile than what they obtained.

With those caveats out of the way, the next step is simple: Calculate the Intraclass Correlation Coefficient (ICC) between his informant-rated profile and the profile of the prototypic person with NPD. The answer is basically zero (ICC = -.08; Pearson’s r = .06).  In short, I don’t think Obama fits the bill of the prototypical narcissist. More data are always welcome but I would be somewhat surprised if Obama’s profile matched well with the profile of a quintessential narcissist in another dataset.

As an aside, Ashley Watts and colleagues evaluated levels of narcissism in the first 43 presidents and they used historical experts to rate presidential personalities. Their paper is extremely interesting and well worth reading. They found these five presidents had personalities with the highest relative approximation to the prototype of NPD: LBJ, Nixon, Jackson, Johnson, and Arthur.  The five lowest presidents were Lincoln, Fillmore, Grant, McKinley, and Monroe. (See Table 4 in their report).

Using data from the Watts et al. paper, I computed standardized scores for the estimates of Obama’s grandiose and vulnerable narcissism levels from the Witt and Ackerman profile. These scores indicated Obama was below average by over .50 SDs for both dimensions (Grandiose: -.70; Vulnerable: -.63).   The big caveat here is that the personality ratings for Obama were provided by undergrads and the Watts et al. data were from experts.  Again, however, there were no indications that Obama is especially narcissistic compared to the other presidents.

Thanks to Robert Ackerman, Matthias Mehl, Rich Slatcher, Ashley Watts, and Edward Witt for insights that helped with this post.

Postscript 1:  This is light hearted post.  However, the procedures I used could make for a fun classroom project for Personality Psychology 101.  Have the students rate a focal individual such as Obama or a character from TV, movies, etc. and then compare the consensus profile to the PD profiles. I have all of the materials to do this if you want them.  The variance in the ratings across students is also potentially interesting.

Postscript 2: Using this same general procedure, Edward Witt, Christopher Hopwood, and I concluded that Anakin Skywalker did not strongly match the profile of someone with BPD and neither did Darth Vader (counter to these speculations).  They were more like successful psychopaths.  But that is a blog post for another day!

Silly Questions to Ask Children

I have been working on a project designed to measure a certain individual difference in children as early as 5 years of age. There are a number of concerns about the use of self-reports with young children so this has been an overarching concern in this project. To partially address this issue, we came up with a handful of items that would be useful for detecting unusual responses in children. These items might be used to identify children who did not understand how to use the response scale or flag children who were giving responses that would be considered invalid.  There is a cottage industry of these kinds of scales for adult personality inventories but fewer options for kids.  (And yes I know about those controversies in the literature over these kinds of scales.)

Truth be told, I like writing items and I think this is true for many researchers. I am curious about how people respond to all sorts of questions especially silly ones.  It is even better if the silly ones tap something interesting about personality or ask participants about dinosaurs.

Here are a few sample items:

1. How do you feel about getting shots from the doctor?

2. How do you feel about getting presents for your birthday?

And my favorite item ever….

3. How would you feel about being eaten by a T-Rex?

The fact that we have asked over 800 kids this last question is sort of ridiculous but it makes me happy. I predicted that kids should report negative responses for this one. This was true for the most part but 11.3% of the sample registered a positive response. In fact, the T-Rex item sparked a heated conversation in my household this morning. My spouse (AD) is a former school teacher and AD thought some kids might think it was cool to see a T-Rex. She thought it was a bad item. My youngest child (SD) thought it would be bad to be eaten by said T-Rex even if it was cool to see one in person. I think SD was on my side.

I have had enough controversy over the past few weeks so I wanted to move on from this breakfast conversation. Thus, I did what any sensible academic would do – I equivocated. I acknowledged that items usually reflect multiple sources of variance and all have some degree of error. I also conceded that this item might pick up on sensation seeking tendencies. There could be some kids who might find it thrilling to be eaten by a T-Rex.Then I took SD to school and cried over a large cup of coffee.

But I still like this item and I think most people would think it would suck to be eaten by a T-Rex. It might also be fun to crowd source the writing of additional items. Feel free to make suggestions.

PS: I want to acknowledge my two collaborators on this project – Michelle Harris and Kali Trzesniewski. They did all of the hard work collecting these data.