Reviewing Papers in 2016

[Preface: I am bit worried that this post might be taken the wrong way concerning my ratio of reject to total recommendations. I simply think it is useful information to know about myself. I also think that keeping more detailed records of my reviewing habits was educational and made the reviewing processes even more interesting. I suspect others might have the same reaction.]

Happy 2017! I collected more detailed data on my reviewing habits in 2016. Previously, I had just kept track of the outlets and total number of reviews to report on annual evaluation documents.  In 2016, I started tracking my recommendations and the outcomes of the papers I reviewed. This was an interesting exercise and I plan to repeat it for 2017.  I also have some ideas for extensions that I will outline in this post.

Preliminary Data:

I provided 51 reviews from 1 Jan 2016 to 29 Dec 2016. Of these 51 reviews, 38 were first time submissions (74.5%) whereas 13 (25.5%) were revisions of papers that I had previously reviewed.  For the 38 first time submissions, I made the follow recommendations to the Editor:


Frequency Percentage


1 2.6%
R&R 13


Reject 24



Yikes! I don’t think of myself as a terribly harsh reviewer but it looks like I recommended “Reject” about 2 out of 3 times that I submitted reviews. (I score below the mean on measures of A so perhaps this is consistent?)  I was curious about my base rate tendencies and now I have data. I feel a little bit guilty.

I will say that my recommendation is tailored to the journal in terms of my perception of the selectivity of an outlet. I might have high expectations for papers published in one of the so-called top outlets and I might have a slight bias to saying yes to those outlets more so than a less selective outlet (I am going to track this data in 2017).  I should also note that I never say whether a paper should be accepted or not in my comments to the authors.  I know that can create an awkward situation for Editors (at least it does for me when I am placed in that role).

For the revisions, I made the following recommendations to the Editor:


Frequency Percentage


9 69.2%
R&R 2


Reject 2


I had previously made reject recommendations on the initial submissions in the two cases above. My opinion was unchanged by the revision.  I can say that the Editor ultimately rejected those two papers and that the initial letter was frank about chances of those paper.  I know we all hate having revisions rejected.

I was most interested in how many times my initial recommendations predicted the ultimate outcome of a paper. Here is a crosstab for my reviews of first time submissions:

Ultimate Decision




My Recommendation


1 0 0 1


6 2 5 13
Reject 4 18 2


Total 11 20 7


Note: Unknown refers to decisions that were in progress at the end of the calendar year for 2016.

This suggests that my reject decisions are usually consistent with ultimate outcome for that paper at that outlet. My decision was inconsistent with ultimate outcome for that paper in 4 out of 22 known cases (18%).  In 18 of the 22 known cases, my decision was concordant with the final decision.  (Yes, I know I should compute kappas here to deal with base rate differences but I am lazy.)

In the end, I think this was a good exercise as it has made me slightly more aware of my recommendations and helped my gauge agreement.  As noted above, I am going to add information to the 2017 iteration of this exercise.  Foremost, I plan to track how many reviews that I decline in 2017 and note my personal reasons for declining.  Categories will include: Conflict of Interest; Too Many Existing Ad Hoc Reviews (X Number on My Desk); Outside of My Area of Expertise; Issue with the Journal (e.g., I won’t review for certain outlets because of their track record on publishing papers that I trust); Other.  I will also track whether the submission was blinded and the number of words in my review.

I try to accept as many reviews as I can but I sometimes feel overwhelmed by the workload. Indeed, I struggle with the right level of involvement in peer review. I believe reviewing is an important service to the field but it is time consuming. My intuition is that an academic should review a minimum of three to four times the number of papers they submitted for peer review per year. I want to make sure that I meet this standard moving forward.

Anyways, I think that was a fairly interesting exercise and I think others might think so as well.


Updating a Graduate Level Personality Psychology Course

Help! I am teaching graduate Personality Psychology in a few weeks and I want to update my syllabus. I last taught the course in Fall of 2013 so there are new readings and updates to be included. I have some ideas (e.g., the fourth law of behavior genetics piece) but I am suspicious of my ability to identify all of the relevant papers/chapters in the field.  In case you are interested, Brent Roberts maintains a repository of graduate syllabuses (or sittybes?). You can see my reading list from previous years at that location.

Here is a little contest…

1. Identify references to recent papers/chapters (publication date 2012 to current) that you think should be included in a graduate personality psychology course. I try to keep the course broad (it is not just traits 101) and I am interested in both substantive and methodological pieces. Preprints are fine if you provide me the complete reference.

2. Email suggestions by 11:59 pm on 12 January 2016 (see below).

3. I will enter the names of all recommenders into a random drawing and donate $25 to the charity of choice to one winner. Just sending one recommendation is enough to qualify for this fabulous prize!

4. I may or may not include suggestions in my formal course. I like shorter pieces and chapters that are accessible and likely to stimulate an interesting discussion during course meetings. I also like readings that show how personality psychology intersects with other areas such as clinical and industrial/organizational psychology. In case you are interested, I am thinking of a week on the intersections with political science for this term.

5. Regardless of #4, I will compile the suggestions and arrange them thematically as an addendum to my official course syllabus. I hope this is a good resource for graduate students and other instructors. I will then ask the other Brent to post my updated syllabus and the addendum on his repository. I will also link to it here.

I plan to blog about teaching personality psychology this term.

Thanks and Happy 2016! My email is mbdonnellan + that silly location sign + tamu + dot + edu

Alpha and Correlated Item Residuals

Subtitle: Is alpha lame?

My last post was kind of stupid as Sanjay tweeted. [Sanjay was actually much more diplomatic when he pointed out the circularity in my approach.] I selected the non-careless responders in a way that guarantees a more unidimensional result. A potentially better approach is to use a different set of scales to identify the non-careless responders and repeat the analyses. This flaw aside, I think my broader points still stand. It is useful to look for ways to screen existing datasets given the literature that: a) suggests careless responders are present in many datasets; and b) careless responders often distort substantive results (see the references and additional recommendations to the original post).

Another interesting criticism came about from my off-handed reporting of alpha coefficients. Matthew Hankins (via twitter) rightly pointed out that it is a mistake to compute alpha in light of the structural analyses I conducted. I favored a particular model for the structure of the RSE that specifies a large number of correlated item residuals between the negatively-keyed and positively-keyed items. In the presence of correlated residuals, alpha is either an underestimate or overestimate of reliability/internal consistency (see Raykov 2001 building on Zimmerman, 1972).

[Note: I knew reporting alpha was a technical mistake but I thought it was one of those minor methodological sins akin to dropping an f-bomb every now and then in real life.  Moreover, I am aware of the alpha criticism literature (and the alternatives like omega). I assumed the alpha is a lower bound heuristic when blogging but this is not true in the presence of correlated residuals (see again Raykov, 2001).]

Hankins illustrated issues with alpha and the GHQ-12 in a paper he recommended (Hankins, 2008). The upshot of his paper is that alpha often makes the GHQ-12 appear to be a more reliable instrument than other methods of computing reliability based on more appropriate factor structures (say like .90 versus .75).  Depending on how reliability estimates are used, this could be a big deal.

Accordingly, I modified some Mplus syntax using Brown (2015) and Raykov (2001) as a template to compute a more appropriate reliability estimate for the RSE for my preferred model.  Output that includes the syntax is here. [I did this quickly so I might have made a mistake!]  Using this approach, I estimated reliability for my sample of 1,000 to be .699 for my preferred model.  This is compared to the .887 estimate I got with alpha. If you want a way to contextualize this drop, you can think about how this difference would impact the Standard Error of Measurement when considering the precision of estimates for individual scores.  The SD for the mean scores was .724.

I go back and forth about whether I think alpha is lame or if all of the criticism of alpha is much ado about nothing. Today I am leaning towards the alpha is lame pole of my thinking.  Correlated residuals are a reality for the scales that I typically use in research. Yikes!

Thanks to people who tweeted and criticized my last post.

Brown, T. A. (2015). Confirmatory factor analysis for applied research (2nd edition).

Hankins (2008). The reliability of the twelve-item general health questionnaire (GHQ-12) under realistic assumptions.

Raykov (2001). Bias of coefficient α for fixed congeneric measures with correlated errors.

Zimmerman (1972). Test reliability and the Kuder-Richardson formulas: Derivation from probability theory.


Careless Responders and Factor Structures

Warning: This post will bore most people.  Read at your own risk. I also linked to some  articles behind pay walls. Sorry!

I have a couple of research obsessions that interest me more than they should. This post is about two in particular: 1) the factor structure of the Rosenberg Self-Esteem Scale (RSE); and 2) the impact that careless responding can have on the psychometric properties of measures.  Like I said, this is a boring post.

I worked at the same institution as Neal Schmitt for about a decade and he once wrote a paper in 1985 (with Daniel Stults) illustrating how careless respondents can contribute to “artifact” factors defined by negatively keyed items (see also Woods, 2006).  One implication of Neal’s paper is that careless responders (e.g., people who mark a “1” for all items regardless of the content) confound the evaluation of the dimensionality of scales that include both positively and keyed items.  This matters for empirical research concerning the factor structure of the RSE.  The RSE is perfectly balanced (it has 5 positively-keyed items and 5 negatively-keyed items). Careless responders might contribute to method artifacts when evaluating the structure of the RSE.

This issue then raises a critical issue — how do you identify careless responders? There is an entire literature on this subject (see e.g., Meade & Craig, 2012) that is well worth reading. One option is to sprinkle directed response items throughout a survey (i.e., “Please mark 4 for quality control purposes”). The trick is that participants can be frustrated by too many of these so these items have to be used judiciously. A second option is to include scales developed explicitly to identify careless responders (see e.g., Marjanovic, Struthers, Cribbie, & Greenglass, 2014).  These are good strategies for new data collections. They are not suitable for identifying careless respondents from existing datasets (see Marjanovic, Holden, Struthers, Cribbie, & Greenglass, 2015).  This could be a concern as Meade and Craig found that between 10% and 12% of undergraduate participants to a long survey could be flagged as careless responders using a cool latent profile technique. My take away from their paper is that many datasets might have some degree of contamination.  Yikes!

I experimented with different methods for detecting careless responders on an ad-hoc basis several years ago for a conference talk.  One approach took advantage of the fact that the RSE is a balanced scale. Thus, I computed absolute value discrepancy scores between the positively and negatively keyed items.  [I’m sure someone had the idea before me and that I read about it but simply forgot the source. I also know that some people believe that positively and negatively keyed items reflect different constructs. I’m kind of skeptical of that argument.]

For example, image Dr. Evil responds a “1” to all 10 of the RSE items assessed on a 5 point Likert-type scale.  Given that half of the RSE items are reverse scored, 5 of Dr. Evil’s 1s will be transformed to 5s.  Her/his average for the positively keyed items will be 1 whereas the average for the negatively keyed items will be a 5.  This generates a value of 4 on the discrepancy index (the maximum in this example).  I basically found that selecting people with smaller discrepancy scores cleaned up the evaluation of the factor structure of the RSE.  I dropped the 10% of the sample with the highest discrepancy scores but this was made on a post hoc basis.

[I know there are all sorts of limitations and assumptions with this approach. For example, one obvious limitation is that Dr. Super Evil who responds a 3 to all items, regardless of her/his true feelings, earns a discrepancy score of 0 and is retained in the analysis. Dr. Super Evil is a real problem. I suspect she/he is friends with the Hamburglar.]

Marjanovic et al. (2015) recently published an interesting approach for detecting careless responding.  They propose calculating the standard deviation of the set of items designed to assess the same construct for each person (called the inter-item standard deviation or ISD).  Here the items all need to be keyed in the correct direction and I suspect this approach works best for scales with a mix of positive and negatively keyed items given issues of rectangular responding. [Note: Others have used the inter-item standard deviation as an indicator of substantive constructs but these authors are using this index as a methodological tool.]

Marjanovic et al. (2015) had a dataset with responses to Marjanovic et al. (2014) Conscientious Responders Scale (CRS) as well as responses to Big Five scales.  A composite based on the average of the ISDs for each of the Big Five scales was strongly negatively correlated with responses to the CRS (r = -.81, n = 284). Things looked promising based on the initial study. They also showed how to use a random number generator to develop empirical benchmarks for the ISD.  Indeed, I got a better understanding of the ISD when I simulated a dataset of 1,000 responses to 10 hypothetical items in which item responses were independent and drawn from a distribution whereby each of the five response options has a .20 proportion in the population.  [I also computed the ISD when preparing my talk back in the day but I focused on the discrepancy index – I just used the ISD to identify the people who gave all 3s to the RSE items by selecting mean = 3 and ISD = 0.  There remains an issue with separating those who have “neutral” feelings about the self from people like Dr. Super Evil.]

Anyways, I used their approach and it works well to help clean up analyses of the factor structure of the RSE.  I first drew a sample of 1,000 from a larger dataset of responses to the RSE (the same dataset I used for my conference presentation in 2009).  I only selected responses from European American students to avoid concerns about cultural differences.  The raw data and a  brief description are available.  The ratio of the first to second eigenvalues was 3.13 (5.059 and 1.616) and the scree plot would suggest 2 factors. [I got these eigenvalues from Mplus and this is based on the correlation matrix with 1.0s on the diagonal.  Some purists will kill me. I get it.]

I then ran through a standard set of models for the RSE.  A single factor model was not terribly impressive (e.g., RMSEA = .169, TLI = .601, SRMR = .103) and I thought the best fit was a model with a single global factor and correlated residuals for the negatively and positively keyed items minus one correlation (RMSEA = .068, TLI = .836, SRMR = .029).  I computed the internal consistency coefficient (alpha = .887, average inter-item correlation = .449). Tables with fit indices, the Mplus syntax, and input data are available.

Using the Marjanovic et al (2015) approach with random data, I identified 15% of the sample that could be flagged as random responders (see their paper for details). The RSE structure looked more unidimensional with this subset of 850 non-careless responders. The ratio of the first to second eigenvalues was 6.22 (6.145 and 0.988) and the models tended to have stronger factor loadings and comparatively better fit (even adjusting for the smaller sample size).  Consider that the average loading for the single factor model for all participants was .67 and this increased to .76 with the “clean” dataset. The single global model fit was still relatively unimpressive but better than before (RMSEA = .129, TLI = .852, SRMR = .055) and the single global model with correlated item residuals was still the best (RMSEA = .063, TLI = .964, SRMR = .019).  The alpha was even a bit better (.926, average inter-item correlation = .570).

So I think there is something to be said for trying to identify careless responders before undertaking analyses designed to evaluate the structure of the Rosenberg and other measures as well.  I also hope people continue to develop and evaluate simple ways for flagging potential careless responders for both new and existing datasets.  This might not be “sexy” work but it is important and useful.


Updates (1:30 CST; 2 June 2015): A few people sent/tweeted links to good papers.

Huang et al. (2012). Detecting and deterring insufficient effort responding to surveys.

Huang, Liu, & Bowling (2015). Insufficient effort responding: Examining an insidious confound in survey data.

Maniaci & Roggee (2014). Caring about carelessness: Participant inattention and its effects on research.

Reise & Widaman (1999). Assessing the fit of measurement models at the individual level: A comparison of item response theory and covariance structure approaches.

(1:00 CST; 3 June 2015): Even More Recommendations!  Sanjay rightly pointed out that my post was stupid. But the references and suggested readings are gold!  So even if my post wasted your time, the references should prove useful.

DeSimone, Harms, & DeSimone (2014).  Best practice recommendations for data screening.

Hankins (2008). The reliability of the twelve-item General Health Questionnaire (GHQ-12) under realistic assumptions.

See also: Graham, J. M(2006). Congeneric and (essentially) tau-equivalent estimates of score reliability what they are and how to use them. {Good stuff pointing to limitations with alpha and alternatives}

Savalei & Falk (2014).  Recovering substantive factor loadings in the presence of acquiescence bias: A Comparison of three approaches.



A Partial Defense of the Pete Rose Rule

I tweeted this yesterday: Let’s adopt a Pete Rose Rule for fakers = banned for life.  Nothing questionable about fraud.  Jobs and funds are too scarce for 2nd chances.

My initial thought was that people who have been shown by a preponderance of the evidence to have passed faked datasets as legitimate should be banned from receiving grants and publishing papers for life.   [Pete Rose was a baseball player and manager in professional baseball who bet on games when he was a manager. This made him permanently ineligible to participate in the activities of professional baseball.]

Nick Brown didn’t like this suggestion and provided a thoughtful response on his blog.  My post is an attempt to defend my initial proposal. I don’t want to hijack his comments with a lengthy rejoinder. You can get banned for life from the Olympics for doping so I don’t think it is beyond the pale to make the same suggestion for science.  As always, I reserve the right to change my mind in the future!

At the outset, I agree with his suggestion that it is not 100% feasible given that there is no overall international governing body for scientific research like there is for professional sports or the Olympics. However, the research world is often surprisingly small and I think it would be possible to impose an informal ban that would stick. And I think it could be warranted because faking data is exceptionally damaging to science. I also think it is rare so perhaps it is not worth thinking about too much.

Fakers impose huge costs on the system.  First, they make journals and scientists look bad in the eyes of the public. This is unfortunate because the “public” ultimately pays for a considerable amount of scientific research.  Faked data undermine public confidence in scientists and this often bleeds over into discussions about unrelated issues such as climate change or whether vaccines cause autism.  Likewise, as Dr. Barr pointed out in a comment on Nick’s blog, there is a case to be made for taking legal action for fraud in some cases.

Second, it takes resources to investigate the fakers. At the risk of speaking in broad generalities, I suspect that huge amounts of time are invested when it comes to the investigation of faked data. It takes effort to evaluate the initial charge and then determine what was and was not faked for people with long CVs. Efforts also need to be expended determine whether co-authors were innocent or co-conspirators.  This is time and money NOT spent on new research, teaching students, reviewing papers, etc.

Third, fakers impose costs on their peers.  Academics is a competitive enterprise.  We are judged by the number and quality of our work.  I suspect it is much easier to pump out papers based on fake data than real data.  This matters because there are limited numbers of positions and grant dollars.  A grad student faker who gets a paper in say Science will have a huge advantage on the job market.  There are far more qualified people than university positions.  Universities that have merit-based systems end up paying superstars more than mere mortals.  A superstar who faked her/his/their way to an impressive CV could easily have a higher salary than an honest peer who can’t compete with faked data.  Likewise, fakers cause their peers to waste limited resources when researchers attempt to extend (or just replicate) interesting results.

To my mind, faking data is the worst crime in science because it undermines the integrity of the system.  Thus, I believe that it warrants a serious punishment once it is established after a thorough judicial process or a confession.  You might think a lifetime ban is too severe but I am not so sure.

Moreover, let’s say the field decides to let a faker back in the “game” after some kind of rehabilitation.  Is this wise? I worry that it would impose additional and never-ending costs on the system.  The rehabilitated faker is going to continue to drain the system until retirement. For example, it would cost resources to double-check everything she or he does in the future.  How am I supposed to treat a journal submission from a known faker? It would require extra effort, additional reviews, and a lot of teeth gnashing. I would think a paper from a faker would need to independently replicated before it was taken seriously (I think this is true of all papers, but that is a topic for another day).  Why should a known faker get grants when so many good proposals are not funded because of a lack of resources? Would you trust a rehabilitated faker to train grad students in your program?

So my solution is to kick the “convicted” faker out of the game forever.  There are lots of talented and bright people who can’t get into the game as it stands.  There are not enough resources to go around for deserving scientists who don’t cheat.  I know that I would personally never vote to hire a faker in my department.

But I am open-minded and I know it sounds harsh. I want to thank Nick for forcing me to think more about this. Comments are welcome!

Replication Project in Personality Psychology – Call for Submissions

Richard Lucas and I are editing a special issue of the Journal of Research in Personality dedicated to replication (Click here for complete details). This blog post describes the general process and a few of my random thoughts on the special issue. These are my thoughts and Rich may or may not share my views.  I also want to acknowledge that there are multiple ways of doing replication special issues and we have no illusions that our approach is ideal or uncontroversial.  These kinds of efforts are part of an evolving “conversation” in the field about replication efforts and experimentation should be tolerated.  I also want to make it clear that JRP has been open to replication studies for several years.  The point of the special issue is to actively encourage replication studies and try something new with a variant of pre-registration.

What is the General Process?

We modeled the call for papers on procedures others have used with replication special issues and registered reports (e.g., the special issue of Social Psychology, the Registered Replication Reports at PoPS).  Here is the gist:

  • Authors will submit proposals for replication studies by 1 July 2015. These extended abstracts will be screened for methodological rigor and the importance of the topic.
  • Authors of selected proposals will then be notified by 15 August 2015.
  • There is a deadline of 15 March 2016 to submit the finished manuscript.

We are looking to identify a set of well-designed replication studies that provide valuable information about findings in personality psychology (broadly construed). We hope to include a healthy mix of pre-registered direct replications involving new data collections (either by independent groups or adversarial collaborations) and replications using existing datasets for projects that are not amenable to new data collection (e.g., long-term longitudinal studies).  The specific outcome of the replication attempt will not be a factor in selection.  Indeed, we do not want proposals to describe the actual results!

Complete manuscript will be subjected to peer-review but the relevant issues will be adherence to the proposed research plan, the quality of the data analysis, and the reasonableness of the interpretations.  For example, proposing to use a sample size of 800 but submitting a final manuscript with 80 participants will be solid grounds for outright rejection.  Finding a null result after a good faith attempt that was clearly outlined before data collection will not be grounds for rejection.  Likewise, learning that a previously used measure had subpar psychometric properties in a new and larger sample is valuable information even if it might explain a failure to find predicted effects.  At the very least, such information about how measures perform in new samples provides important technical insights.

Why Do This?

Umm, replication is an important part of science?!?! But beyond that truism, I am excited to learn what happens when we try to organize a modest effort to replicate specific findings in personality psychology. Personality psychologists use a diverse set of methods beyond experiments such as diary and panel studies.  This creates special challenges and opportunities when it comes to replication efforts.  Thus, I see this special issue as a potential chance to learn how replication efforts can be adapted to the diverse kinds of studies conducted by personality researchers.

For example, multiple research groups might have broadly similar datasets that target similar constructs but with specific differences when it comes to the measures, timing of assessments, underlying populations, sample sizes, etc. This requires careful attention to methodological similarities and differences when it comes to interpreting whether particular findings converge across the different datasets.  It would be ideal if researchers paid some attention to these issues before the results of the investigations were known.  Otherwise, there might be a tendency to accentuate differences when results fail to converge. This is one of the reasons why we will entertain proposals that describe replication attempts using existing datasets.

I also think it is important to address a perception that Michael Inzlicht described in a recent blog post.  He suggested that some social psychologists believe that some personality psychologists are using current controversies in the field as a way to get payback for the person-situation debate.  In light of this perception, I think it is important for more personality researchers to engage in formal replication efforts of the sort that have been prominent in social psychology.  This can help counter perceptions that personality researchers are primarily interested in schadenfreude and criticizing our sibling discipline. Hopefully, the cold war is over.

[As an aside, I think it the current handwringing about replication and scientific integrity transcends social and personality psychology.  Moreover, the fates of personality and social psychology are intertwined given the way many departments and journals are structured.  Social and personality psychology (to the extent that there is a difference) each benefit when the other field is vibrant, replicable, and methodologically rigorous.  Few outside of our world make big distinctions between social and personality researchers so we all stand to lose if decision makers like funders and university administrators decide to discount the field over concerns about scientific rigor.]

What Kinds of Replication Studies Are Ideal?

In a nut-shell: High quality replications of interesting and important studies in personality psychology.  To offer a potentially self-serving example, the recent replication of the association between I-words and narcissism is a good example.  The original study was relatively well-cited but it was not particularly strong in terms of sample size.  There were few convincing replications in the literature and it was often accepted as an article of faith that the finding was robust.  Thus, there was value in gaining more knowledge  about the underlying effect size(s) and testing to see whether the basic finding was actually robust.  Studies like that one as well as more modest contributions are welcome.  Personally, I would like more information about how well interactions between personality attributes and experimental manipulations tend to replicate especially when the original studies are seemingly underpowered.

What Don’t You Want to See?

I don’t want to single out too many specific topics or limit submissions but I can think of a few topics that are probably not going to be well received.  For instance, I am not sure we need to publish tons of replications showing there are 3 to 6 basic trait domains using data from college students.  Likewise, I am not sure we need more evidence that skilled factor analysts can find indications of a GFP (or general component) in a personality inventory.  Replications of well-worn and intensely studied topics are not good candidates for this special issue. The point is to get more data on interesting and understudied topics in personality psychology.

Final Thought

I hope we get a number of good submissions and the field learns something new in terms of specific findings. I also hope we also gain insights about the advantages and disadvantages of different approaches to replication in personality psychology.

My View on the Connection between Theory and Direct Replication

I loved Simine’s blog post on flukiness and I don’t want to hijack the comments section of her blog with my own diatribe. So here it goes…

I want to comment on the suggestion that researchers should propose an alternative theory to conduct a useful or meaningful close/exact/direct replication. In practice, I think most replicators draw on the same theory that original authors used for the original study.  Moreover, I worry that people making this argument (or even more extreme variants) sometimes get pretty darn close to equating a theory with a sort of religion.  As in, you have to truly believe (deep in your heart) the theory or else the attempt is not valid.  The point of a direct replication is to make sure the results of a particular method are robust and obtainable by independent researchers.

My take:

Original authors used Theory P to derive Prediction Q (If P then Q). This is the deep structure of the Introduction of their paper.  They then report evidence consistent with Q using a particular Method (M) in the Results section.

A replicator might find the theoretical reasoning more or less plausible but mostly just think it is a good idea to evaluate whether repeating M yields the same result (especially if the original study was underpowered).* The point of the replication is to redo M (and ideally improve on it using a larger N to generate more precise parameter estimates) to test Prediction Q.  Some people think this is a waste of time.  I do not.

I don’t see how what is inside the heads of the replicators in terms of their stance about Theory P or some other Theory X as relevant to this activity. However, I am totally into scenarios that approximate the notion of a critical test whereby we have two (or more) theories that make competing predictions about what should be observed.  I wish there were more cases like that to talk about.

* Yes, I know about the hair splitting diatribes people go through to argue that you literally cannot duplicate the exact same M to test the same prediction Q in a replication study (i.e., the replication is literally impossible argument). I find that argument simply unsatisfying. I worry that this kind of argument slides into some postmodernist view of the world  in which there is no point in doing empirical research (as I understand it).