What’s all
the fuss about?
While replication has always been an important
aspect psychological research (as with any science), the recent fervor was
energized by the uncovering of several major cases of fraud in social
psychology. In particular, the case of Diederik Stapel, who committed years of
scientific fraud, raised questions about our tendency to dismiss the importance
of direct replications and discard failures to replicate as uninformative.
These cases of fraud compounded other attempts to
call attention to the irreproducibility of research, including Ioannidis‘ (2005)
exposition on research in biology. While psychology may not be any worse off than
other “harder” sciences, problems of reproducibility have implications for
attitudes about the utility and credibility of scientific research more
generally.
Why is it
difficult to replicate research?
Data “cleaning”
and Unethical Practices
In some cases, effects may be difficult to
replicate because researchers manipulated data in ways that were not fully
described in the literature. For example, researchers may find an effect when
outliers are deleted and no effect when everyone is included in analyses. If
the treatment of outliers is not mentioned in the original research article,
there is no way for other researchers to identify this as an important factor
in finding the effect even if the choice to delete outliers is valid. In the
worst cases, the effects might not replicate because of fraudulent practices on
the part of the researcher, as in Stapel case.
The Nature
of our Statistics
Most research in psychology uses the Null
Hypothesis Significance Testing (NHST) method of inferential statistics. One
serious downside of NHST is the tendency to think in terms of a dichotomy in
which effects exist (p < .05) or
effects don’t exist (p > .05),
ignoring the inherent uncertainty in psychological effects. It is tempting to
use p-values as indicators of the reliability of an effect, but the problems
with this kind of thinking are nicely illustrated in the Dance of the
p-values video from psychologist Geoff Cumming. Understanding the
variability in p-values suggests that unreliability in “significant”
statistical tests is not surprising, and psychologists benefit from thinking in
terms of estimation (e.g., confidence intervals) when trying to evaluate replication
attempts.
Sample Size
Another pervasive problem in psychological
research is our generally small sample sizes. Many of our studies continue to be
very underpowered given the typically small-to-moderate effect sizes we find in
our research (Nelson et al., 2013). As such, we are less likely to replicate
findings in the literature. However, another problem with small sample sizes is
the implication for our estimates of effect sizes. Estimates of effect size are
going to be more unreliable with small samples, which can lead to overestimation
of the true size of an effect in the population. As such, it might be much more
difficult to replicate a finding than a reported effect size (if there is one)
would suggest.
What is a
good replication?
The best replications begin with transparency and collaboration between both the replicator and original authors. With this in mind, Brandt et al. (2014) also outlined important considerations that make for good replications. These authors argue that a good replication has five main ingredients, including the following:
The best replications begin with transparency and collaboration between both the replicator and original authors. With this in mind, Brandt et al. (2014) also outlined important considerations that make for good replications. These authors argue that a good replication has five main ingredients, including the following:
1) Carefully defining the effects and methods that the researcher intends to replication
2) Following exactly the methods of the original study
3) Having high statistical power
4) Making complete details about the replication available
5) Evaluating replication results and comparing them critically
The article goes into greater detail in relation to each of
these “ingredients,” but the question of how to evaluate replications deserves
special attention. Brandt et al. (2014) recommend evaluating the replication of
effects in two ways: 1) reporting the size, direction and confidence interval
of the target effect (tells us whether the effect is different from the null)
and 2) testing whether the effect is different from the original effect. Another
approach to evaluate the success of replications is to apply a meta-analytic aggregation
if the replication and original study effects. There are many other approaches
to evaluating replications (Simohnson, 2013), but it is clear that evaluating
the significance of results is insufficient.
What can I do if I am
interested in conducting replications?
If you are interested in conducting replications, the Open Science Framework’s Reproducibility
Project is seeking partners to conduct
replications of studies found in 2008 issues of Journal of Personality and
Social Psychology, Psychological Science, and Journal of Experimental
Psychology: Learning, Memory, and Cognition. In doing so, the OSF aims to learn
more about the overall reproducibility in the psychology literature. The OSF
also provides assistance with carrying out the replications and provides
workflow resources that can make it easier for others to replication your own
research.
References
Brandt, M. J., IJzerman, H., Dijksterhuis, A., Farach, F.
J., Geller, J., Giner-Sorolla, R., ... & Van't Veer, A. (2014). The
replication recipe: What makes for a convincing replication?. Journal of
Experimental Social Psychology, 50, 217-224
Ioannidis, J. P. (2005). Why most published research findings are false. PLoS medicine, 2(8), e124.
Pashler, H., & Harris, C. R. (2012). Is the replicability crisis overblown? Three arguments examined. Perspectives on Psychological Science, 7(6), 531-536.
Simonsohn, U. (2013). Evaluating replication results. Available at SSRN: http://ssrn.com/
abstract=2259879