Another reason why it is hard ‘to get those results’ when we replicate a study: the flexibility-ambiguity problem

Today I found an excellent article that explains what is the flexibility-ambiguity problem and how we can solve it with simple requirements for authors and guidelines for reviewers:

The core of the flexibility-ambiguity problem is what the researchers call: “researcher degrees of freedom”. As the authors explain:

In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?

The most frequent and costly error of exploring various alternatives to search for a combination that yields a significant p-value is a false positive.

I think that two important findings of this article are:

  • researchers who start collecting 10 observations per conditions and then test for significance after every new per-condition observation find a significant effect 22% of the time
  • it is wrong to think that if an effect is significant with a small effect size then it would be necessarily significant with a larger one
  • the false-positive rate if the researcher uses all of the common degrees of freedom is 61%: a researcher is more likely than not to falsely detect a significant effect by just using these four common researcher degrees of freedom (i.e. collecting multiple dependent variables, analyzing results while collecting data, controlling for covariates or interactions, dropping -or not – one of these three conditions)

As a solution of the flexibility-ambiguity problem the authors propose the following

The requirements for authors should be:
1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article.
2. Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data-collection justification.
3. Authors must list all variables collected in a study.
4. Authors must report all experimental conditions, including failed manipulations.
5. If observations are eliminated, authors must also report what the statistical results are if those observations are included.
6. If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate.

The guidelines for reviewers are:
1. Reviewers should ensure that authors follow the requirements.
2. Reviewers should be more tolerant of imperfections in results.
3. Reviewers should require authors to demonstrate that their results do not hinge on arbitrary analytic decisions.
4. If justifications of data collection or analysis are not compelling, reviewers should require the authors to conduct an exact replication.

The best quote from this article is: “Our goal as scientists is not to publish as many articles as we can, but to discover and disseminate truth.” (p.1365)

Of course the “truth” is a problematic notion, and critical thinkers would argue against the same existence and knowledgeability of the “truth”. Yet, I think that truth is (at least) honesty, and the exercise to publish as many articles as we can sometimes pushes scholars to be less honest.

The full reference of the article is: Joseph P. Simmons, Leif D. Nelson, and Uri Simonsohn (2011) “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”. Psychological Science 22(11) 1359-1366

A very good article about “The Extent and Consequences of P-Hacking in Science”

Yesterday I found this article about p-hacking in science. You find the article here:

Just let me summarize here two premises of the article.

What is p-hacking?

P-hacking (also known as “inflation bias”, “selective reporting” … and “cherry-picking”) is the misreporting of true effect sizes. It occurs when researchers:

  • conduct analyses midway through experiments to decide whether to continue collecting data,
  • record many response variables and decide which to report postanalysis,
  • decide whether to include or drop outliers postanalyses,
  • exclude, combine, or split treatment groups postanalysis,
  • include or exclude covariates postanalysis,
  • stop data exploration if an analysis yields a significant p-value.

Why should we care about p-hacking?

Meta-analyses are compromised if the studies being synthesized do not reflect the true distribution of effect sizes … and meta-analyses guide the application of medical treatments and policy decisions, and influence future research directions.

Answers may be: “Let him who is without sin cast the first stone” or “Most of us did some form of p-hacking because the system encourages us to do it”. However … we really need to find a way out.

This for example seems a good idea:

Why collecting the moderator after the manipulation?

I refer here to the same article that I mentioned in my first blog post: the 2006 article “The Great Satan Versus the Axis of Evil” published in the Personality and Social Psychology Bulletin.

In Study 2 127 students from a University in the United States were randomly assigned to Mortality Salience (MS), Terrorism Salience (TS) or control condition (intense physical pain), and then evaluated questionnaires supposedly completed by a fellow student supporting the use of extreme military force and the support for the Patriot Act. As the authors write “participants then completed a final page of the booklet soliciting demographic information; specifically, gender, ethnicity, religion, and political orientation (“How would you describe your political orientation?” on a 9- point scale; 1 = very conservative; 5 = moderate; 9 = very liberal).” (p.531).

The results showed that both MS and TS manipulations had a large effect on both dependent variables only among politically conservatives but not among liberals (see figure below).


The authors explained the choice of using a moderator such as political orientation, which was collected after the MS manipulation, simply by writing: “an ANOVA revealed no effect of the priming manipulation on political orientation, F(2, 125) = 1.23, p = .30; consequently, political orientation was used as a predictor in the primary analyses.”

However I think that the choice to use political orientation as a predictor in this study is still very problematic for at least two reasons:

  1. Shouldn’t be better to conduct deeper analyses (or at least show more data such as means and standard deviations) to convince the readers that the manipulation did not affect political orientation at all? A 2011 article showed that MS actually had an effect on political orientation: this suggests that the authors were perhaps too quick in dismissing the risks of having a manipulation effect on political orientation.
  2. The authors’ choice suggests a wrong research practice. While introducing Study 2 the authors stated: “We also addressed an additional question in Study 2: Does MS affect support for extreme military force among all people, or does it primarily affect those with political orientations or personality characteristics that are associated with support for such measures?” (p.530). If it really was a theoretically driven research question, why did they collect the moderator after the manipulation? Why did they choose to risk MS to affect the moderator instead of collecting this measure beforehand?

A realistic answer would be that the authors did not expect political orientation to moderate the manipulation effect when they designed the questionnaire, and they collected it (perhaps together with a list of other measures) after the manipulation as a potential outcome variable. As the manipulation did not get any effect, they decided to dig into the data until they found a meaningful result.

What do you think about it? Is this good research practice?

Does this sound right?

I believe that one of the most interesting findings of Terror Management Theory research is that death anxiety increases out-group aggression (and potentially support for extremism and terrorism). One of the most cited articles at this regard was written in 2006 by Pyszczynski, Abdollahi, Solomon, Greenberg, Cohen and Weise (basically the academics who created TMT). The study was published in the Personality and Social Psychology Bulletin: you can find it here. The article has been cited 278 times (Google Scholar, 6/05/2015).

In Study 1 40 undergraduates from 2 Iranian Universities  were randomly assigned to Mortality Salience (MS) or control (dental pain), and then they were asked to indicate their opinions about questionnaires supposedly completed by fellow students one supporting and one opposing martyrdom attacks. The authors found that while students exposed to dental pain rated about 1.5 out of 9 their willingness to join pro-martyrdom causes, students exposed to death reminders rated their willingness to join pro-martyrdom causes on average 6 out of 9. Also, while students in the control condition evaluated on average about 2 out of 9 a person supporting martyrdom attacks, students in the MS condition evaluated the same person almost 7 out of 9 (see figure below).

Featured image

The effect of the manipulation was impressive and led the authors to write “thoughts of death led young people in the Middle East who ordinarily preferred a person who took a pacifist stance to switch their allegiance to a person who advocated suicide bombings” (p.530).

This statement may be too bold as it is based on a single study. But what makes it even more problematic is the design and analyses.

The authors described the experiments procedures as it follows:

  • the design was a 2 × 2 mixed factorial.” (p.528)
  • After reading each questionnaire (presented in counterbalanced order), participants indicated their impressions of the student” (p.529)

The analyses were presented as it follows:

A 2 (MS vs. control) × 2 (pro- vs. antimartyrdom) ANOVA yielded a significant main effect for MS, F(1, 38) = 19.86, p < 0001, and more important, a significant MS × Martyrdom Attitude interaction, F(1, 38) = 66.04, p < .0001. Pairwise comparisons revealed that although participants
preferred the student who opposed martyrdom attacks over the one who supported martyrdom attacks in the dental pain control condition, t(38) = 5.47, p < .0001, MS led to a dramatic reversal of this pattern such that after being reminded of their mortality participants preferred the student who supported martyrdom attacks over the one who opposed them, t(38) = 6.02, p < .0001.” (p.529)

However, I find the following point unclear:

  • If the pro and anti-martyrdom conditions were presented to ALL participants, why do the authors talk about a 2 x 2 design and analyses? How many people per cell did they consider, 10 or 20? The 2 x 2 design would suggest 10, but the analysis 20. Does the counterbalanced order mean anything in the analyses and, if yes, why is it unclear? Shouldn’t be incumbent on authors and editors to be clear on details such as how many manipulations they used and how many people per cell they got?


  • Can we really ground a statement like “thoughts of death led young people in the Middle East to switch their allegiance to a person who advocate suicide bombings” on an experiment with 10/20 people per cell?
  • Who are the participants, what are their demographic, psychological, social and political characteristics? We only know that 14 were women and 26 men, mean age 22.46. Is this enough to describe the participants?

What do you think about it?

Have you ever been in the situation in which …

Have you ever been in a situation in which you are testing a compelling and fascinating political, social or psychological theory, which has been tested multiple times … Everyone got large and positive results, but no matter how hard you try and how thoroughly you follow the procedures, you cannot replicate the results found in the literature?

If the answer is YES … this blog is for you !!!