Reviewing Papers in 2016

[Preface: I am bit worried that this post might be taken the wrong way concerning my ratio of reject to total recommendations. I simply think it is useful information to know about myself. I also think that keeping more detailed records of my reviewing habits was educational and made the reviewing processes even more interesting. I suspect others might have the same reaction.]

Happy 2017! I collected more detailed data on my reviewing habits in 2016. Previously, I had just kept track of the outlets and total number of reviews to report on annual evaluation documents.  In 2016, I started tracking my recommendations and the outcomes of the papers I reviewed. This was an interesting exercise and I plan to repeat it for 2017.  I also have some ideas for extensions that I will outline in this post.

Preliminary Data:

I provided 51 reviews from 1 Jan 2016 to 29 Dec 2016. Of these 51 reviews, 38 were first time submissions (74.5%) whereas 13 (25.5%) were revisions of papers that I had previously reviewed.  For the 38 first time submissions, I made the follow recommendations to the Editor:

Decision

Frequency Percentage

Accept

1 2.6%
R&R 13

34.2%

Reject 24

63.2%

 

Yikes! I don’t think of myself as a terribly harsh reviewer but it looks like I recommended “Reject” about 2 out of 3 times that I submitted reviews. (I score below the mean on measures of A so perhaps this is consistent?)  I was curious about my base rate tendencies and now I have data. I feel a little bit guilty.

I will say that my recommendation is tailored to the journal in terms of my perception of the selectivity of an outlet. I might have high expectations for papers published in one of the so-called top outlets and I might have a slight bias to saying yes to those outlets more so than a less selective outlet (I am going to track this data in 2017).  I should also note that I never say whether a paper should be accepted or not in my comments to the authors.  I know that can create an awkward situation for Editors (at least it does for me when I am placed in that role).

For the revisions, I made the following recommendations to the Editor:

Decision

Frequency Percentage

Accept

9 69.2%
R&R 2

15.4%

Reject 2

15.4%

I had previously made reject recommendations on the initial submissions in the two cases above. My opinion was unchanged by the revision.  I can say that the Editor ultimately rejected those two papers and that the initial letter was frank about chances of those paper.  I know we all hate having revisions rejected.

I was most interested in how many times my initial recommendations predicted the ultimate outcome of a paper. Here is a crosstab for my reviews of first time submissions:

Ultimate Decision

Accept

Reject

Unknown

Total
My Recommendation

Accept

1 0 0 1

R&R

6 2 5 13
Reject 4 18 2

24

Total 11 20 7

38

Note: Unknown refers to decisions that were in progress at the end of the calendar year for 2016.

This suggests that my reject decisions are usually consistent with ultimate outcome for that paper at that outlet. My decision was inconsistent with ultimate outcome for that paper in 4 out of 22 known cases (18%).  In 18 of the 22 known cases, my decision was concordant with the final decision.  (Yes, I know I should compute kappas here to deal with base rate differences but I am lazy.)

In the end, I think this was a good exercise as it has made me slightly more aware of my recommendations and helped my gauge agreement.  As noted above, I am going to add information to the 2017 iteration of this exercise.  Foremost, I plan to track how many reviews that I decline in 2017 and note my personal reasons for declining.  Categories will include: Conflict of Interest; Too Many Existing Ad Hoc Reviews (X Number on My Desk); Outside of My Area of Expertise; Issue with the Journal (e.g., I won’t review for certain outlets because of their track record on publishing papers that I trust); Other.  I will also track whether the submission was blinded and the number of words in my review.

I try to accept as many reviews as I can but I sometimes feel overwhelmed by the workload. Indeed, I struggle with the right level of involvement in peer review. I believe reviewing is an important service to the field but it is time consuming. My intuition is that an academic should review a minimum of three to four times the number of papers they submitted for peer review per year. I want to make sure that I meet this standard moving forward.

Anyways, I think that was a fairly interesting exercise and I think others might think so as well.

My View on the Connection between Theory and Direct Replication

I loved Simine’s blog post on flukiness and I don’t want to hijack the comments section of her blog with my own diatribe. So here it goes…

I want to comment on the suggestion that researchers should propose an alternative theory to conduct a useful or meaningful close/exact/direct replication. In practice, I think most replicators draw on the same theory that original authors used for the original study.  Moreover, I worry that people making this argument (or even more extreme variants) sometimes get pretty darn close to equating a theory with a sort of religion.  As in, you have to truly believe (deep in your heart) the theory or else the attempt is not valid.  The point of a direct replication is to make sure the results of a particular method are robust and obtainable by independent researchers.

My take:

Original authors used Theory P to derive Prediction Q (If P then Q). This is the deep structure of the Introduction of their paper.  They then report evidence consistent with Q using a particular Method (M) in the Results section.

A replicator might find the theoretical reasoning more or less plausible but mostly just think it is a good idea to evaluate whether repeating M yields the same result (especially if the original study was underpowered).* The point of the replication is to redo M (and ideally improve on it using a larger N to generate more precise parameter estimates) to test Prediction Q.  Some people think this is a waste of time.  I do not.

I don’t see how what is inside the heads of the replicators in terms of their stance about Theory P or some other Theory X as relevant to this activity. However, I am totally into scenarios that approximate the notion of a critical test whereby we have two (or more) theories that make competing predictions about what should be observed.  I wish there were more cases like that to talk about.

* Yes, I know about the hair splitting diatribes people go through to argue that you literally cannot duplicate the exact same M to test the same prediction Q in a replication study (i.e., the replication is literally impossible argument). I find that argument simply unsatisfying. I worry that this kind of argument slides into some postmodernist view of the world  in which there is no point in doing empirical research (as I understand it).

How Do You Feel When Something Fails To Replicate?

Short Answer: I don’t know, I don’t care.

There is an ongoing discussion about the health of psychological science and the relative merits of different research practices that could improve research. This productive discussion occasionally spawns a parallel conversation about the “psychology of the replicators” or an extended mediation about their motives, emotions, and intentions. Unfortunately, I think that parallel conversation is largely counter-productive. Why? We have limited insight into what goes on inside the minds of others. More importantly, feelings have no bearing on the validity of any result. I am a big fan of this line from Kimble (1994, p. 257): How you feel about a finding has no bearing on its truth.

A few people seem to think that replicators are predisposed to feeling ebullient (gleeful?) when they encounter failures to replicate. This is not my reaction. My initial response is fairly geeky.  My impulse is to calculate the effect size estimate and precision of the new study to compare to the old study. I do not get too invested when a small N replication fails to duplicate a large N original study. I am more interested when a large N replication fails to duplicate a small N original study.

I then look to see if the design was difficult to implement or fairly straightforward to provide a context for interpreting the new evidence. This helps to anticipate the reactions of people who will argue that replicators lacked the skill and expertise to conduct the study or that their motivations influenced the outcome.  The often vague “lack of expertise” and “ill-intentioned” arguments are more persuasive when critics offer a plausible account of how these factors might have biased a particular replication effort.  This would be akin to offering an alternative theory of the crime in legal proceedings. In many cases, it seems unlikely that these factors are especially relevant. For example, a few people claimed that we lacked the expertise to conduct survey studies of showering and loneliness but these critics failed to offer a well-defined explanation for our particular results besides just some low-level mud-slinging. A failure to detect an effect is not prima facie evidence of a lack of expertise.

After this largely intellectual exercise is concluded, I might experience a change in mood or some sort of emotional reaction. More often this amounts to feelings of disappointment about the quality of the initial study and some anxiety about the state of affairs in the field (especially if the original study was of the small N, big effect size variety). A larger N study holds more weight than smaller N study.  Thus, my degree of worry scales with the sample size of the replication.  Of course, single studies are just data points that should end up as grist for the meta-analytic mill.  So there might be some anticipation over the outcome of future studies to learn what happens in yet another replication attempt.

Other people might have different modal emotional reactions. But does it matter?  And does it have anything at all to do with the underlying science or the interpretation of the replication?  My answers are No, No, and No. I think the important issues are the relevant facts – the respective sample sizes, effect size estimates, and procedures.

(Hopefully) The Last Thing We Write About Warm Water and Loneliness

Our rejoinder to the Bargh and Shalev response to our replication studies has been accepted for publication after peer-review. The Bargh and Shalev response is available here. A pdf of our rejoinder is available here.  Here are the highlights of our piece:

  1. An inspection of the size of the correlations from their three new studies suggests their new effect size estimates are closer to our estimates than to those reported in their 2012 paper. The new studies all used larger sample sizes than the original studies.
  2. We have some concerns about the validity of the Physical Warmth Extraction Index and we believe the temperature item is the most direct test of their hypotheses. If you combine all available data and apply a random-effects meta-analytic model, the overall correlation is .017 (95% CI = -.02 to .06 based on 18 studies involving 5,285 participants).
  3. We still have no idea why 90% of the participants in their Study 1a responded that they took less than 1 shower/bath per week. No other study using a sample from the United States even comes close to this distribution. Given this anomaly, we think results from Study 1a should be viewed with extreme caution.
  4. Acquiring additional data from outside labs is probably the most constructive step forward. Additional cross-cultural data would also be valuable.

This has been an interesting adventure and we have learned a lot about self-reported bathing/showering habits. What more could you ask for?

 

Apology

There has been a lot of commentary about the tone of my 11 December 2013 blog post. I’ve tried to keep a relatively low profile during the events of the last week.  It has been one of the strangest weeks of my professional life. However, it seems appropriate to make a formal apology.

1. I apologize for the title.  I intended it as a jokey reference for the need to conduct high power replication studies. It was ill advised.

2. I apologize for the now infamous “epic fail” remark (“We gave it our best shot and pretty much encountered an epic fail as my 10 year would say”). It was poor form and contributed to hurt feelings. I should have been more thoughtful.

I will do better to make sure that I uphold the virtues of civility in future blog postings.

-brent donnellan

Warm Water and Loneliness Again?!?!

Call me Captain Ahab…

This is a dead horse but I got around to writing up some  useful new data in this saga.  Researchers at the University of Texas, Austin tried to replicate the basic survey findings in a large Introductory Psychology course back in the Fall of 2013.  They emailed me the results back in November and they were consistent with the general null effects we had been getting in our work.  I asked them if I could write it up for the Psychology File Drawer and they were amenable.  Here is a link to a more complete description of the results and here is a link to the PFD reference.

The basic details…

There was no evidence for an association between loneliness (M = 2.56; SD = .80, alpha = .85) and the Physical Warmth Index (r = -.03, p = .535, n = 365; 95% CI = -.14 to .07).  Moreover, the hypothesis relevant correlation between the water temperature item and the loneliness scale was not statistically distinguishable from zero (r = -.08, p = .141, n = 365, 95% CI = -.18 to .03).

One possible issue is that the U of T crew used a short 3 item measure of loneliness developed for large scale survey work whereas the other studies have used longer measures.  Fortunately, other research suggests this short measure is correlated above .80 with the parent instrument so I do not think this is a major limitation. But I can see others holding a different view.

One of the reviewers of the Emotion paper seemed concerned about our motivations.  The nice thing about these data is that we had nothing to do with the data collection so this criticism is not terribly valid.  Other parties can try this study too — the U of T folks figured a way to study this issue with 6 items!

 

Go Big or Go Home – A Recent Replication Attempt

I decided to adopt my 5% suggestion to dedicate a relatively small percentage of one’s research time to replication efforts.  In other words, I spent time this summer and fall working on several replication studies.  Truth be told, these efforts amounted to more than 5% of my time but these efforts have been fruitful in terms of papers. Moreover, these replication attempts were collaborative efforts with graduate students (and some undergraduates).  My major role was often as data analyst.  Indeed, I like to independently analyze the data to make sure that I come up with the same results. I also find data analysis inherently enjoyable and far more exciting than writing papers, editing papers, or dealing with committees. So basically I got to have fun and perhaps make a few modest contributions to the literature.

One replication effort concerned whether we could duplicate the results of 2008 Psychological Science paper about the impact of cleanliness on moral judgments. The original finding was that participants primed with cleanliness were less harsh in their moral judgments than control participants. The first two authors on this paper are outstanding graduate students in my department (Johnson, Cheung, & Donnellan, in press). For those not interested in gory details: We obtained much smaller estimates than the original paper.

Basic design:

Prime/induce cleanliness in one group and use a second group as a control condition. Compare ratings on a series of six moral vignettes that are aggregated into a composite rating.  In sum, the IV is cleanliness/no cleanliness and the DV is ratings of wrongness on moral vignettes. The scale for the moral ratings is such that higher scores reflect harsher judgments. Negative effect size estimates indicate that those in the clean group were less harsh (on average) than those in the control group. Materials were obtained from the original researchers and we tried to follow procedures as close as possible. All participants were college students.

Study 1:

Cleanliness was primed with a scrambled sentence task in the original study (N = 40).  The cleanliness priming effect was close to the magic p < .05 threshold for the composite DV (d = -.60, 95% CI = -1.23 to .04).  It was hard to see that the overall effect was p = .064 in the published paper because the authors reported the dreadful p-rep statistic but we computed the standard p-value for our report. Those in the cleanliness prime condition rated the vignettes as less wrong than the control group at p < .065.  We used the same materials with 208 participants and got an effect size estimate very close to zero (d = -.01, 95% CI = -.28 to .26).  One caveat is that there is an additional study on the PsychFile drawer website that found an effect consistent with the original using a one-tailed significance test (estimated d = -.47, N = 60).

Study 2:

Cleanliness was induced with hand washing in the original study (N = 43). Participants were shown a disgusting video clip and then assigned to condition.  Participants completed ratings of the same vignettes used in Study 1 but with a slightly different response scale.  The effect on the overall composite passed the p < .05 threshold in the original publication (d = -.85, 95% CI = -1.47 to -.22) but not in our replication study with 126 participants (d = .01, 95% CI = -34 to .36).

We conducted both studies in person like the original authors and the package is now in press for the special issue of Social Psychology about replication studies. This means that everything was preregistered and all of the data, materials, and proposal are posted here: http://osf.io./project/zwrxc/.

Study 1 Redux (Unpublished):

After finishing the package, we decided to try to conduct a much larger replication attempt of Study 1 but using the internet to facilitate data collection. I like narrow confidence intervals so I wanted to try for a very large sample size.  A larger sample size would also facilitate tests of moderation. Using the internet would make data collection easier but it might impair data quality.  (I have ideas about dealing with that issue but those will appear in another paper, hopefully!). We selected out anyone who said they made up their responses or did not answer honestly. Our sample size for this replication attempt was 731 college students.  As with our in press study, the effect of the sentence unscrambling task on the composite DV was not statistically detectable (t = 0.566, df = 729, p = .578). The d was quite small (d = .04, 95% = -.10 to .19) and the 95% CI fell below the “magic” |.20| threshold for a so-called small effect.   The most interesting finding (to me) was that the Honesty-Humility scale of the HEXACO (Lee & Ashton, 2004) was the best predictor of the moral composite ratings across conditions (r = .356, 95% CI = .291 to .418, p < .05, N = 730) out of the four individual difference measures we included at the end of the study (a disgust scale, a bodily self-consciousness scale, and a single-item liberalism-conservatism scale).  No individual difference we included moderated the null effect of condition (same for gender).   So we tried to find moderators, honestly.

Summary:
We have struck out twice now to find the original Study 1 effect.  I did a quick and dirty random effects meta-analysis using all four attempts to duplicate Study 1 (the original, the one on the PsychFile drawer, and our two studies).  The point estimate for the d was -.130 (95% CI = -.377 to .117, p = .303) and the estimate was even closer to zero using a common effect (or fixed effect) model (d = -.022, 95% CI = -.144 to .099, p = .718).  I will provide an excerpt from the in press paper as I think the message is useful to ward off the critics who will blast our research skills and accuse me of some sort of motivated bias…

The current studies suggest that the effect sizes surrounding the impact of cleanliness on moral judgments are probably smaller than the estimates provided by the original studies…It is critical that our work is not considered the last word on the original results…. More broadly, we hope that researchers will continue to evaluate the emotional factors that contribute to moral judgments.

 So more research is probably needed to better understand this effect [Don’t you just love Mom and Apple Pie statements!].  However, others can dedicate their time and resources to this effect.  We gave it our best shot and pretty much encountered an epic fail as my 10 year old would say. My free piece of advice is that others should use very large samples and plan for small effect sizes.  [Note: The epic fail comment was ill advised. I apologize.]

Note: Thanks to David Johnson and Felix Cheung.  I deserve any criticism and they deserve any credit.

Update (2 Jan 2014).  David placed the write-up on the Psych File Drawer website. Plus the data are available as well.

David Johnson, Felix Cheung, Brent Donnellan. Cleanliness primes do not influence moral judgment. (2014, January 01). Retrieved 10:05, January 02, 2014 from http://www.PsychFileDrawer.org/replication.php?attempt=MTcy

Update (23 May 2014). Identified passages that were problematic