Alpha and Correlated Item Residuals

Subtitle: Is alpha lame?

My last post was kind of stupid as Sanjay tweeted. [Sanjay was actually much more diplomatic when he pointed out the circularity in my approach.] I selected the non-careless responders in a way that guarantees a more unidimensional result. A potentially better approach is to use a different set of scales to identify the non-careless responders and repeat the analyses. This flaw aside, I think my broader points still stand. It is useful to look for ways to screen existing datasets given the literature that: a) suggests careless responders are present in many datasets; and b) careless responders often distort substantive results (see the references and additional recommendations to the original post).

Another interesting criticism came about from my off-handed reporting of alpha coefficients. Matthew Hankins (via twitter) rightly pointed out that it is a mistake to compute alpha in light of the structural analyses I conducted. I favored a particular model for the structure of the RSE that specifies a large number of correlated item residuals between the negatively-keyed and positively-keyed items. In the presence of correlated residuals, alpha is either an underestimate or overestimate of reliability/internal consistency (see Raykov 2001 building on Zimmerman, 1972).

[Note: I knew reporting alpha was a technical mistake but I thought it was one of those minor methodological sins akin to dropping an f-bomb every now and then in real life.  Moreover, I am aware of the alpha criticism literature (and the alternatives like omega). I assumed the alpha is a lower bound heuristic when blogging but this is not true in the presence of correlated residuals (see again Raykov, 2001).]

Hankins illustrated issues with alpha and the GHQ-12 in a paper he recommended (Hankins, 2008). The upshot of his paper is that alpha often makes the GHQ-12 appear to be a more reliable instrument than other methods of computing reliability based on more appropriate factor structures (say like .90 versus .75).  Depending on how reliability estimates are used, this could be a big deal.

Accordingly, I modified some Mplus syntax using Brown (2015) and Raykov (2001) as a template to compute a more appropriate reliability estimate for the RSE for my preferred model.  Output that includes the syntax is here. [I did this quickly so I might have made a mistake!]  Using this approach, I estimated reliability for my sample of 1,000 to be .699 for my preferred model.  This is compared to the .887 estimate I got with alpha. If you want a way to contextualize this drop, you can think about how this difference would impact the Standard Error of Measurement when considering the precision of estimates for individual scores.  The SD for the mean scores was .724.

I go back and forth about whether I think alpha is lame or if all of the criticism of alpha is much ado about nothing. Today I am leaning towards the alpha is lame pole of my thinking.  Correlated residuals are a reality for the scales that I typically use in research. Yikes!

Thanks to people who tweeted and criticized my last post.

Brown, T. A. (2015). Confirmatory factor analysis for applied research (2nd edition).

Hankins (2008). The reliability of the twelve-item general health questionnaire (GHQ-12) under realistic assumptions.

Raykov (2001). Bias of coefficient α for fixed congeneric measures with correlated errors.

Zimmerman (1972). Test reliability and the Kuder-Richardson formulas: Derivation from probability theory.

 

Careless Responders and Factor Structures

Warning: This post will bore most people.  Read at your own risk. I also linked to some  articles behind pay walls. Sorry!

I have a couple of research obsessions that interest me more than they should. This post is about two in particular: 1) the factor structure of the Rosenberg Self-Esteem Scale (RSE); and 2) the impact that careless responding can have on the psychometric properties of measures.  Like I said, this is a boring post.

I worked at the same institution as Neal Schmitt for about a decade and he once wrote a paper in 1985 (with Daniel Stults) illustrating how careless respondents can contribute to “artifact” factors defined by negatively keyed items (see also Woods, 2006).  One implication of Neal’s paper is that careless responders (e.g., people who mark a “1” for all items regardless of the content) confound the evaluation of the dimensionality of scales that include both positively and keyed items.  This matters for empirical research concerning the factor structure of the RSE.  The RSE is perfectly balanced (it has 5 positively-keyed items and 5 negatively-keyed items). Careless responders might contribute to method artifacts when evaluating the structure of the RSE.

This issue then raises a critical issue — how do you identify careless responders? There is an entire literature on this subject (see e.g., Meade & Craig, 2012) that is well worth reading. One option is to sprinkle directed response items throughout a survey (i.e., “Please mark 4 for quality control purposes”). The trick is that participants can be frustrated by too many of these so these items have to be used judiciously. A second option is to include scales developed explicitly to identify careless responders (see e.g., Marjanovic, Struthers, Cribbie, & Greenglass, 2014).  These are good strategies for new data collections. They are not suitable for identifying careless respondents from existing datasets (see Marjanovic, Holden, Struthers, Cribbie, & Greenglass, 2015).  This could be a concern as Meade and Craig found that between 10% and 12% of undergraduate participants to a long survey could be flagged as careless responders using a cool latent profile technique. My take away from their paper is that many datasets might have some degree of contamination.  Yikes!

I experimented with different methods for detecting careless responders on an ad-hoc basis several years ago for a conference talk.  One approach took advantage of the fact that the RSE is a balanced scale. Thus, I computed absolute value discrepancy scores between the positively and negatively keyed items.  [I’m sure someone had the idea before me and that I read about it but simply forgot the source. I also know that some people believe that positively and negatively keyed items reflect different constructs. I’m kind of skeptical of that argument.]

For example, image Dr. Evil responds a “1” to all 10 of the RSE items assessed on a 5 point Likert-type scale.  Given that half of the RSE items are reverse scored, 5 of Dr. Evil’s 1s will be transformed to 5s.  Her/his average for the positively keyed items will be 1 whereas the average for the negatively keyed items will be a 5.  This generates a value of 4 on the discrepancy index (the maximum in this example).  I basically found that selecting people with smaller discrepancy scores cleaned up the evaluation of the factor structure of the RSE.  I dropped the 10% of the sample with the highest discrepancy scores but this was made on a post hoc basis.

[I know there are all sorts of limitations and assumptions with this approach. For example, one obvious limitation is that Dr. Super Evil who responds a 3 to all items, regardless of her/his true feelings, earns a discrepancy score of 0 and is retained in the analysis. Dr. Super Evil is a real problem. I suspect she/he is friends with the Hamburglar.]

Marjanovic et al. (2015) recently published an interesting approach for detecting careless responding.  They propose calculating the standard deviation of the set of items designed to assess the same construct for each person (called the inter-item standard deviation or ISD).  Here the items all need to be keyed in the correct direction and I suspect this approach works best for scales with a mix of positive and negatively keyed items given issues of rectangular responding. [Note: Others have used the inter-item standard deviation as an indicator of substantive constructs but these authors are using this index as a methodological tool.]

Marjanovic et al. (2015) had a dataset with responses to Marjanovic et al. (2014) Conscientious Responders Scale (CRS) as well as responses to Big Five scales.  A composite based on the average of the ISDs for each of the Big Five scales was strongly negatively correlated with responses to the CRS (r = -.81, n = 284). Things looked promising based on the initial study. They also showed how to use a random number generator to develop empirical benchmarks for the ISD.  Indeed, I got a better understanding of the ISD when I simulated a dataset of 1,000 responses to 10 hypothetical items in which item responses were independent and drawn from a distribution whereby each of the five response options has a .20 proportion in the population.  [I also computed the ISD when preparing my talk back in the day but I focused on the discrepancy index – I just used the ISD to identify the people who gave all 3s to the RSE items by selecting mean = 3 and ISD = 0.  There remains an issue with separating those who have “neutral” feelings about the self from people like Dr. Super Evil.]

Anyways, I used their approach and it works well to help clean up analyses of the factor structure of the RSE.  I first drew a sample of 1,000 from a larger dataset of responses to the RSE (the same dataset I used for my conference presentation in 2009).  I only selected responses from European American students to avoid concerns about cultural differences.  The raw data and a  brief description are available.  The ratio of the first to second eigenvalues was 3.13 (5.059 and 1.616) and the scree plot would suggest 2 factors. [I got these eigenvalues from Mplus and this is based on the correlation matrix with 1.0s on the diagonal.  Some purists will kill me. I get it.]

I then ran through a standard set of models for the RSE.  A single factor model was not terribly impressive (e.g., RMSEA = .169, TLI = .601, SRMR = .103) and I thought the best fit was a model with a single global factor and correlated residuals for the negatively and positively keyed items minus one correlation (RMSEA = .068, TLI = .836, SRMR = .029).  I computed the internal consistency coefficient (alpha = .887, average inter-item correlation = .449). Tables with fit indices, the Mplus syntax, and input data are available.

Using the Marjanovic et al (2015) approach with random data, I identified 15% of the sample that could be flagged as random responders (see their paper for details). The RSE structure looked more unidimensional with this subset of 850 non-careless responders. The ratio of the first to second eigenvalues was 6.22 (6.145 and 0.988) and the models tended to have stronger factor loadings and comparatively better fit (even adjusting for the smaller sample size).  Consider that the average loading for the single factor model for all participants was .67 and this increased to .76 with the “clean” dataset. The single global model fit was still relatively unimpressive but better than before (RMSEA = .129, TLI = .852, SRMR = .055) and the single global model with correlated item residuals was still the best (RMSEA = .063, TLI = .964, SRMR = .019).  The alpha was even a bit better (.926, average inter-item correlation = .570).

So I think there is something to be said for trying to identify careless responders before undertaking analyses designed to evaluate the structure of the Rosenberg and other measures as well.  I also hope people continue to develop and evaluate simple ways for flagging potential careless responders for both new and existing datasets.  This might not be “sexy” work but it is important and useful.

 

Updates (1:30 CST; 2 June 2015): A few people sent/tweeted links to good papers.

Huang et al. (2012). Detecting and deterring insufficient effort responding to surveys.

Huang, Liu, & Bowling (2015). Insufficient effort responding: Examining an insidious confound in survey data.

Maniaci & Roggee (2014). Caring about carelessness: Participant inattention and its effects on research.

Reise & Widaman (1999). Assessing the fit of measurement models at the individual level: A comparison of item response theory and covariance structure approaches.

(1:00 CST; 3 June 2015): Even More Recommendations!  Sanjay rightly pointed out that my post was stupid. But the references and suggested readings are gold!  So even if my post wasted your time, the references should prove useful.

DeSimone, Harms, & DeSimone (2014).  Best practice recommendations for data screening.

Hankins (2008). The reliability of the twelve-item General Health Questionnaire (GHQ-12) under realistic assumptions.

See also: Graham, J. M(2006). Congeneric and (essentially) tau-equivalent estimates of score reliability what they are and how to use them. {Good stuff pointing to limitations with alpha and alternatives}

Savalei & Falk (2014).  Recovering substantive factor loadings in the presence of acquiescence bias: A Comparison of three approaches.