Free Advice about the Subject Pool

Around here, the Fall semester starts in just a few weeks. This means the MSU subject pool will soon be teeming with “volunteers” eager to earn their research participation credits. Like many of my colleagues, I have often wondered about the pros and cons of relying so heavily on college sophomores in the laboratory (e.g., Sears, 1986, 2008). Regardless of your take on these issues, it is hard to imagine that subject pools will go away in the near future. Thus, I think it is important to try to learn more about the characteristics of participants in these subject pools and to think more carefully about issues that may impact the generalizability of these types of studies. I still think college student subject pools generate convenience samples even if a certain researcher disagrees.

I did a paper with my former graduate student Edward Witt and our undergraduate assistant (Matthew Orlando) about differences in the characteristics of subject pool members who chose to participate at different points in the semester (Witt, Donnellan, & Orlando, 2011). We also tested for selection effects in the chosen mode of participation by offering an online and in-person version of the same study (participants were only allowed to participate through one mode).  We conducted that study in the Spring of 2010 with a total sample size of 512 participants.

In the original report, we found evidence that more extraverted students selected the in-person version of the study (as opposed to the online version) and that average levels of Conscientiousness were lower at the end of the semester compared to the beginning. In other words, individuals with relatively lower scores on this personality attribute were more likely to show up at the end of term. We also found that we had a greater proportion of men at the end of the term compared to the start. To be clear, the effect sizes were small and some might even say trivial. Nonetheless, our results suggested to us that participants at the start of the semester are likely to be different than participants at the end of the term in some ways. This result is probably unsurprising to anyone who has taught a college course and/or collected data from a student sample (sometimes naïve theories are credible).

We repeated the study in the Fall semester of 2010 but never bothered to publish the results (Max. N with usable data = 594). (We try to replicate our results when we can.) It is reassuring to note that the major results were replicated in the sense of obtaining similar effect size estimates and levels of statistical significance. We used the same personality measure (John Johnson’s 120-item IPIP approximation of the NEO PI-R) and the same design. Individuals who self-selected into the online version of the study were less extraverted than those who selected into the in-person version (d = -.18, t = 2.072, df = 592, p = .039; Witt et al., 2011: d = -.26).   This effect held controlling for the week of the semester and gender. Likewise, we had a greater proportion of men at the end of the term compared to the start (e.g., roughly 15% of the participants were men in September versus 43% in December).

The more interesting result (to me) was that average levels of Conscientiousness were also lower at the end of the semester rather than at the beginning (standardized regression coefficient for week = -.12, p = .005; model also includes gender). Again, the effect sizes were small and some might say trivial.  However, a different way to understand this effect is to standardize Conscientiousness within-gender (women self-report higher scores) and then plot average scores by week of data collection.

The average for the first two weeks of data collection (September of 2010) was .29 (SD = 1.04) whereas the average for the last three weeks (December of 2010) was -.18 (SD = 1.00).  Viewed in this light, the difference between the beginning of the semester and the end of the semester starts to look a bit more substantial.

So here is my free advice:  If you want more conscientiousness participants, be ready to run early in the term.  If you want to have an easier time recruiting men, wait till the end of the term. (Controlling for C does not wipe out the gender effect).

I would post the data but I am going to push Ed to write this up. We have a few other interesting variables that tried to pick up on careless responding that we need to think through.

Note: Edward Witt helped me prepare this entry.

Advertisements

Replicability as a Publication Criterion

I wanted to re-read Cronbach (1957) and I stumbled across this letter to the American Psychologist in 1957 from Ardie Lubin with the title of this post: Replicability as a publication criterion.

Just a quick excerpt: “Replicability and generalizability, of course, are not new criteria, and assuredly all editors employ them now in judging the soundness of an article. The only novelty here is the weight which would be placed on the question of whether the results are replicable. Every author would be asked to show some attempt in this direction. Articles using replication designs which are not satisfactory to the editor could be given lowest publication priority. Articles with no attempt at replication would be rejected.”

An Incredible Paper (and I mean that in the best way possible)

Ulrich Schimmack has a paper in press at Psychological Methods that should be required reading for anyone producing or consuming research in soft psychology (Title: “The Ironic Effect of Significant Results on the Credibility of Multiple-Study Articles”).  Sadly, I doubt this paper will get much attention in the popular press.  Uli argues that issues of statistical power are critical for evaluating a package of studies and his approach also fits very nicely with recent papers by Gregory Francis.  I am excited because it seems as if applied researchers are beginning to have access to a set of relatively easy to use tools to evaluate published papers.

(I would add that Uli’s discussion of power fits perfectly well with broader concerns about the importance of study informativeness as emphasized by Geoff Cumming in his recent monograph.)

Uli makes a number of recommendations that have the potential to change the ratio of fiction to non-fiction in our journals.  His first recommendation is to use power to explicitly evaluate manuscripts.  I think this is a compelling recommendation.  He suggests that authors need to justify the sample sizes in their manuscripts. There are too many times when I read papers and I have no clue why authors have used such small samples sizes.  Such concerns do not lend themselves to positive impressions of the work.

Playing around with power calculations or power programs leads to sobering conclusions.  If you expect a d-metric effect size of .60 for a simple two independent-groups study, you need 45 participants in each group (N=90) to have 80% power. The sample requirements only go up if the d is smaller (e.g., 200 total if d = .40 and 788 total if d = .20) or if you want better than 80% power.  Given the expected value of most effect sizes in soft psychology, it seems to me that sample sizes are going to have to increase if the literature is going to get more believable.  Somewhere, Jacob Cohen is smiling. If you hate NHST and want to think in terms of informativeness, that is fine as well.  Bigger samples yield tighter confidence intervals. Who can argue with calls for more precision?

Uli discusses other strategies for improving research practices such as the value of publishing null results and the importance of rewarding the total effort that goes into a paper rather than the number of statistically significant p-values.   It is also worth rewarding individuals and teams who are developing techniques to evaluate the credibility of the literature, actively replicating results, and making sure published findings are legitimate.  Some want to dismiss them as witch hunters.  I prefer to call them scientists.