Saturday, December 28, 2013

Power analysis - "sample size by analogy"

Statistics are often called an “art” just as much as a “science.” This maxim is perhaps most apt when it comes to power analysis. Power analysis is tricky because it involves coming up with an estimate for what the predicted effect size will be, so they are often qualitative (meaning, subjective estimates of effect sizes) in addition to being quantitative (meaning, calculated estimates of effect sizes). For those who are unfamiliar with the general concept of power analysis, one type essentially involves planning how much data you will need to collect in order to have enough statistical power to detect it in your study, if the effect truly does exist in the world. This must be done before data collection begins. [The other type of power analysis involves determining the amount of power/effect size of your study after data collection. This article will focus on the former.] If you don’t collect enough data, you may not have enough power to detect the true effect (meaning, achieve statistical significance). This is one cause of a Type-2 error, assuming the effect is actually present in the population. Typically, adequate power is 80% which means that if an effect does truly exist, there is an 80% chance of finding it (and a 20% chance of missing it).

Earlier this year at the annual conference for the Society of Personality and Social Psychology (SPSP 2013), Joseph Simmons, Uri Simonsohn, and Leif Nelson (3 leading experts on statistics and strong advocates for proper research methodology), held a symposium and spoke about a variety of topics including p-values, fraud detection, and direct replications. They also included a piece of practical advice for researchers preparing to conduct studies. It involved qualitative power analysis.

Their argument boiled down to a few points:
1)    In psychology, there is a strong tendency for researchers to emphasize the presence of an effect (whether it is significant or not) rather than the size of the effect (medium, large, small). But effect size estimates are needed to determine sample sizes.
2)    Often researchers do not bother to conduct power analyses at all. 
3)    Researchers instead often base their final sample sizes on preliminary p-values with incomplete data, and stop collecting data only when their key result passes the p < .05 threshold (this is sometimes called “p-hacking”). This results in a lot of false-positives (Type-1 errors).
4)    One key reason that researchers don’t conduct power analyses is because they are poor at estimating effect sizes (even expert researchers perform poorly at estimating effect sizes, and even when they know an effect will be significant because it’s obvious).

So what is the solution? Simmons, Simonsohn, and Nelson offered a new tool that may make it easier for scientists to conduct better power analysis estimates. They first provided a few examples of effects (mean differences) in the social/natural sciences that are painfully obvious even without doing any statistical analyses. One example was gender and body size. On average, men are physically larger (measured by height/weight) than women. Another example (more relevant to psychological science) was political ideology (liberal vs. conservative identification) and liking Barack Obama. On average, liberals like Obama more than conservatives do. Even though these effects are obvious (and large in size), any study that attempts to demonstrate them will need to have enough statistical power. When people were asked to estimate how many data points would be needed in order to detect significance for these effects, many people gave estimates in the 15-30 (per cell) range. But that is wrong. It turns out that you need to collect 50 data points per cell to have enough statistical power to detect whether liberals like Obama more than conservatives, or whether men are physically larger than women.

All of this culminated in a heuristic that they referred to as “sample size estimates by analogy.” The argument is that if your hypothesized effect is less intuitive/obvious than the fact that liberals like Obama more than conservatives, or that men are physically larger than women, you must collect at least 50 data points per cell in order to have adequate statistical power. This doesn’t mean if you collect 51 data points per cell, you are guaranteed to have enough power. It’s just a rule of thumb, and probably a good one, since so many researchers collect samples with n < 30 and (as we saw before) they are not doing any kind of a priori power analysis at all. So, the next time you’re planning a study, keep this heuristic in mind. Unless your predicted effect is more obvious than the ones mentioned above, plan for a sample with at least n = 50 if not greater.

Happy holidays everyone!

Nelson, L., Simonsohn, U., & Simmons, J. (2013).  False positive II: Effect sizes too small, too large, or just right. Presented at the annual meeting of the Society for Personality and Social Psychology. New Orleans, LA.

You may also enjoy checking out their blog: http://datacolada.org/

No comments:

Post a Comment