Statistics are often called an “art” just as much as a
“science.” This maxim is perhaps most apt when it comes to power analysis.
Power analysis is tricky because it involves coming up with an estimate for
what the predicted effect size will be, so they are often qualitative (meaning,
subjective estimates of effect sizes) in addition to being quantitative
(meaning, calculated estimates of effect sizes). For those who are unfamiliar
with the general concept of power analysis, one type essentially involves planning
how much data you will need to collect in order to have enough statistical power to
detect it in your study, if the effect truly does exist in the world. This must
be done before data collection begins. [The other type of power analysis involves
determining the
amount of power/effect size of your study after data collection. This article will focus on
the former.] If you don’t collect enough data, you may not have enough
power to detect the true effect (meaning, achieve statistical significance). This
is one cause of a Type-2 error, assuming the effect is actually present in the population. Typically,
adequate power is 80% which means that if an effect does truly exist, there is
an 80% chance of finding it (and a 20% chance of missing it).
Earlier this year at the annual conference for the Society
of Personality and Social Psychology (SPSP 2013), Joseph Simmons, Uri Simonsohn, and
Leif Nelson (3 leading experts on statistics and strong advocates for proper
research methodology), held a symposium and spoke about a variety of
topics including p-values, fraud detection, and direct replications. They also
included a piece of practical advice for researchers preparing to conduct
studies. It involved qualitative power analysis.
Their argument boiled down to a few points:
1)
In psychology, there is a strong tendency for
researchers to emphasize the presence of an effect (whether it is significant
or not) rather than the size of the effect (medium, large, small). But effect
size estimates are needed to determine sample sizes.
2)
Often researchers do not bother to conduct power analyses at
all.
3)
Researchers instead often base their final
sample sizes on preliminary p-values with incomplete data, and stop collecting
data only when their key result passes the p
< .05 threshold (this is sometimes called “p-hacking”). This results in a
lot of false-positives (Type-1 errors).
4)
One key reason that researchers don’t conduct
power analyses is because they are poor at estimating effect sizes (even expert
researchers perform poorly at estimating effect sizes, and even when they know an effect will be significant
because it’s obvious).
So what is the solution? Simmons, Simonsohn, and Nelson
offered a new tool that may make it easier for scientists to conduct better
power analysis estimates. They first provided a few examples of effects (mean
differences) in the social/natural sciences that are painfully obvious even
without doing any statistical analyses. One example was gender and body size.
On average, men are physically larger (measured by height/weight) than women.
Another example (more relevant to psychological science) was political ideology
(liberal vs. conservative identification) and liking Barack Obama. On average,
liberals like Obama more than conservatives do. Even though these effects are obvious
(and large in size), any study that attempts to demonstrate them will need to
have enough statistical power. When people were asked to estimate how many data
points would be needed in order to detect significance for these effects, many
people gave estimates in the 15-30 (per cell) range. But that is wrong. It
turns out that you need to collect 50 data points per cell to have enough
statistical power to detect whether liberals like Obama more than
conservatives, or whether men are physically larger than women.
All of this culminated in a heuristic that they referred to
as “sample size estimates by analogy.” The argument is that if your
hypothesized effect is less intuitive/obvious than the fact that liberals like
Obama more than conservatives, or that men are physically larger than women,
you must collect at least 50 data
points per cell in order to have adequate statistical power. This doesn’t mean
if you collect 51 data points per cell, you are guaranteed to have enough
power. It’s just a rule of thumb, and probably a good one, since so many
researchers collect samples with n < 30
and (as we saw before) they are not doing any kind of a priori power analysis
at all. So, the next time you’re planning a study, keep this heuristic in mind.
Unless your predicted effect is more obvious than the ones mentioned above,
plan for a sample with at least n =
50 if not greater.
Happy holidays everyone!
Nelson, L., Simonsohn, U., & Simmons, J. (2013). False positive II: Effect sizes too small,
too large, or just right. Presented
at the annual meeting of the Society for Personality and Social Psychology. New
Orleans, LA.
You may also enjoy checking out their blog: http://datacolada.org/
No comments:
Post a Comment