Saturday, December 28, 2013

Power analysis - "sample size by analogy"

Statistics are often called an “art” just as much as a “science.” This maxim is perhaps most apt when it comes to power analysis. Power analysis is tricky because it involves coming up with an estimate for what the predicted effect size will be, so they are often qualitative (meaning, subjective estimates of effect sizes) in addition to being quantitative (meaning, calculated estimates of effect sizes). For those who are unfamiliar with the general concept of power analysis, one type essentially involves planning how much data you will need to collect in order to have enough statistical power to detect it in your study, if the effect truly does exist in the world. This must be done before data collection begins. [The other type of power analysis involves determining the amount of power/effect size of your study after data collection. This article will focus on the former.] If you don’t collect enough data, you may not have enough power to detect the true effect (meaning, achieve statistical significance). This is one cause of a Type-2 error, assuming the effect is actually present in the population. Typically, adequate power is 80% which means that if an effect does truly exist, there is an 80% chance of finding it (and a 20% chance of missing it).

Earlier this year at the annual conference for the Society of Personality and Social Psychology (SPSP 2013), Joseph Simmons, Uri Simonsohn, and Leif Nelson (3 leading experts on statistics and strong advocates for proper research methodology), held a symposium and spoke about a variety of topics including p-values, fraud detection, and direct replications. They also included a piece of practical advice for researchers preparing to conduct studies. It involved qualitative power analysis.

Their argument boiled down to a few points:
1)    In psychology, there is a strong tendency for researchers to emphasize the presence of an effect (whether it is significant or not) rather than the size of the effect (medium, large, small). But effect size estimates are needed to determine sample sizes.
2)    Often researchers do not bother to conduct power analyses at all. 
3)    Researchers instead often base their final sample sizes on preliminary p-values with incomplete data, and stop collecting data only when their key result passes the p < .05 threshold (this is sometimes called “p-hacking”). This results in a lot of false-positives (Type-1 errors).
4)    One key reason that researchers don’t conduct power analyses is because they are poor at estimating effect sizes (even expert researchers perform poorly at estimating effect sizes, and even when they know an effect will be significant because it’s obvious).

So what is the solution? Simmons, Simonsohn, and Nelson offered a new tool that may make it easier for scientists to conduct better power analysis estimates. They first provided a few examples of effects (mean differences) in the social/natural sciences that are painfully obvious even without doing any statistical analyses. One example was gender and body size. On average, men are physically larger (measured by height/weight) than women. Another example (more relevant to psychological science) was political ideology (liberal vs. conservative identification) and liking Barack Obama. On average, liberals like Obama more than conservatives do. Even though these effects are obvious (and large in size), any study that attempts to demonstrate them will need to have enough statistical power. When people were asked to estimate how many data points would be needed in order to detect significance for these effects, many people gave estimates in the 15-30 (per cell) range. But that is wrong. It turns out that you need to collect 50 data points per cell to have enough statistical power to detect whether liberals like Obama more than conservatives, or whether men are physically larger than women.

All of this culminated in a heuristic that they referred to as “sample size estimates by analogy.” The argument is that if your hypothesized effect is less intuitive/obvious than the fact that liberals like Obama more than conservatives, or that men are physically larger than women, you must collect at least 50 data points per cell in order to have adequate statistical power. This doesn’t mean if you collect 51 data points per cell, you are guaranteed to have enough power. It’s just a rule of thumb, and probably a good one, since so many researchers collect samples with n < 30 and (as we saw before) they are not doing any kind of a priori power analysis at all. So, the next time you’re planning a study, keep this heuristic in mind. Unless your predicted effect is more obvious than the ones mentioned above, plan for a sample with at least n = 50 if not greater.

Happy holidays everyone!

Nelson, L., Simonsohn, U., & Simmons, J. (2013).  False positive II: Effect sizes too small, too large, or just right. Presented at the annual meeting of the Society for Personality and Social Psychology. New Orleans, LA.

You may also enjoy checking out their blog: http://datacolada.org/

Monday, December 23, 2013

Fitting Probability Distributions (in <code>R</code>)

Fitting Probability Distributions (in R)

What is Curve Fitting?

The purpose of curve fitting is to find parameter values for a function that best 'fit' a set of data points. One type of curve fitting you probably already know about, though perhaps not using these terms, is regression analysis.

For example, suppose we had one predictor and one criterion variable, and wanted to conduct a simple regression including an intercept term. Our regression equation would look like the following equation.

\[ Y_i= \beta_0 + \beta_1X_i + \epsilon_i \mbox{, where } \epsilon_i \stackrel{iid}{\sim} N(0,\sigma^2) \]

Regression determines \( b_0 \) and \( b_1 \), sample estimates for the population values \( \beta_0 \) and \( \beta_1 \), by minimizing the sum of squared residuals between \( Y_i \) and \( \hat{Y}_i \), where \( \hat{Y}_i=b_0 + b_1X_i \)

Probability Distribution Fitting Example

The purpose of fitting probability distributions is to model the probability or frequency of a process. For this tutorial, we're going to model probability estimates elicited from subjects guessing the age of a man in photograph. In other words, we're going to estimate a continuous subjective probability distribution using only a few observed estimates. It is imprtant to note that while we are directly modeling probabilities in this example, one could fit probability distributions across a variety of contexts.

Scenario

Assume we have asked participants to estimate three probability intervals \( \{p_1,p_2,p_3\} \) of the age of a man in a photograph, which we'll represent with \( A \). These intervals are determined by two cutpoints, that is two candidate ages, for the age of a man: \( \{a_1,a_2\} \):

\[ \begin{align*} p_1&=P(A \leq a_1) \\ p_2&=P(A \leq a_2) \\ p_3&=P(A > a_2) \\ \end{align*} \]

Suppose these candidate ages were \( {a_1=30,a_2=50} \), then in words, the we could elicit \( \{p_1,p_2,p_3\} \) with the following questions:

  • What is the probability that the man is less than 30 years old? (\( p_1 \))
  • What is the probability that the man is less than 50 years old? (\( p_2 \))
  • What is the probability that the man is more than 50 years old? (\( p_3 \))

Assume that our participant gave responses of .3, .8, and .2 for the three questions above. We can express our three observed probability estimates as two values of a cumulative distribution function \( F_A(a) \). For this example we're using the normal distribution to represent the estimates, though there are many different probability distributions one could use depending on the properties of the data you are fitting.

\[ \begin{align*} F_A(a_1)&=P(A\leq a_1)=p_1=.3 \\ F_A(a_2)&=P(A\leq a_2)=p_2=.8 \\ \end{align*} \]

Now suppose we wanted to model how old a participant believed the man in photograph actually was. While we didn't ask this question, if we knew the true underlying distribution of \( A \), we could use its mean, \( E(A) \), as the subject's estimate (or median, that is \( Q_A(.5) \) where \( Q_A \) is the inverse cumulative, or quantile, distribution function for \( A \)).

While we don't know what the true distribution of \( A \) is, we can take a known distribution and find parameter values that provide the best fit between our observed values of \( A \) and the values predicted by a fitted distribution \( \hat{A} \). What constitutes 'best fit' is defined by whatever optimizing rule, or objective function, we choose to apply. A common rule is to minimize squared error between our observed and predicted values. Since in this case what we observed were probabilities, we will choose to minimize the squared error between the observed and predicted probabilities.

\[ \displaystyle\sum\limits_{i=1}^n (F_A(a_i)-F_\hat{A}(a_i,\theta))^2 \]

Where \( \theta \) is the vector of parameters for our fitted distribution. Since we're using a normal distribution in this example, \( \theta \) is the mean and variance.

\[ (F_A(a_1)-F_\hat{A}(a_1,\theta))^2+ (F_A(a_2)-F_\hat{A}(a_2,\theta))^2 \]

We already know \( F_A(a_1) \) and \( F_A(a_2) \) (.3 and .8 respectively) and we picked a normal distribution for \( \hat{A} \), where \( A \sim N(\theta) \) and \( \theta = \{\hat{\mu},\hat{\sigma}^2\} \).

\[ (.3-F_\hat{A}(a_1;{\hat{\mu},\hat{\sigma}^2}))^2+ (.8-F_\hat{A}(a_2;{\hat{\mu},\hat{\sigma}^2}))^2 \]

Now that we have selected a distribution to fit and an objective function, we need find the values for \( \hat{\mu} \) and \( \hat{\sigma}^2 \) that minimize our objective function. As the example R code below will demonstrate, these estimated mean will be approximately 37.7. Precisely how optimization routines work is beyond the scope of this tutorial, but generally, optimization routines like those in R take some user-defined starting values for \( \theta \) and then repeatedly evaluate the objective function with a different \( \theta \) until some decision rule is reached, for example \( n \) iterations with no decrease in value for the objective function. Because it is possible for these routines to get 'stuck' in local minima, one should fit the candidate distribution with a range of values for \( \theta \). For brevity's sake, we'll only use one set of starting values in this example.

Fitting One Subject's Judgements in R

For this example, assume we have asked one subject to estimate the probabilities that a man in a photograph is less than 30, between 30 and 50, and older than 50. The subject's observed responses are {.3,.8,.2}. We need to then convert his responses to cumulative probabilites.

# Assign quantile values to variable 'q'
q <- c(30,50)                   

# Assign subject's observed probability estimates to variable 'p' 
p <- c(.3,.8,.2)              

# Retain first 2 cumulative probabilities
# the last last probability is redundant
p.cum <- p[1:2]     

Now we need to pick a distribution to fit, starting parameter values for that distribution, and define our objective function. We'll use the normal for this example, and a mean of 40 and variance of 10 for our starting values. Remember, you'll typically want to use many different starting values in order to avoid local minima.

# Define Optimization Function
optimFn <- function(theta) sum(abs(p.cum - pnorm(q, theta[1], theta[2]))^2)
# This is an R version of the same function described in the math section
# pnorm() is the distribution function for the normal distribution theta[1]
# and theta[2] are values for the mean and variance that will be passed to
# the function

# Define starting values
mu <- 40
sigma <- sqrt(10)

# Fit the distribution
A.hat <- optim(c(mu, sigma), optimFn)

# Print the parameter values (mean and standard deviation)
A.hat$par
## [1] 37.68 14.64

So the mean of the distribution we just fit to our subjects responses is 37.6782, with a standard deviation 14.6419. If you wanted to check the fit with a plot, you could do that as well. Since we are fitting a two parameter distribution to two points, the curve should almost always pass exactly through the points.

fitted.mean <- A.hat$par[1]
fitted.sd <- A.hat$par[2]

# CDF
plot(x <- seq(fitted.mean - 3 * fitted.sd, fitted.mean + 3 * fitted.sd, by = 1), 
    pnorm(x, fitted.mean, fitted.sd), type = "l", col = 2, xlab = "Age of Man in Picture", 
    ylab = "Probability")
points(q, p.cum)
legend("topleft", col = c("red"), lty = 1, legend = c("Fitted Distribution"))

plot of chunk unnamed-chunk-4

Sunday, December 15, 2013

Comparing Models using BIC and AIC

As George Box once said, “All models are wrong but some are useful.” There are no true models that perfectly reflect the data but AIC and BIC numbers can help us to make a judgment on which models better reflect the data than others.

AIC & BIC are both measures of the relative quality of a statistical model. AIC & BIC numbers do not provide a test of a model in the sense of testing a null hypothesis. They can tell nothing about the quality of the model in an absolute sense. They will not give any indication of poor fitting models.

What are AIC & BIC numbers?

Akaike Information Criterion (AIC) is an estimator of the relative expectation of Kullback-Leibler distance based on Fisher’s maximum likelihood. AIC estimates a constant plus the relative distance between the unknown true likelihood function of the data and the fitted likelihood function of the model. A smaller AIC number means a model is considered to be closer to the true model.

Bayesian Information Criterion (BIC) is an estimate of a function of the posterior probability of a model being true, under a certain Bayesian setup, so that a lower BIC means that a model is considered to be more likely to be the true model.

What’s the difference between AIC and BIC numbers?
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.

Which one to use?
AIC and BIC are both approximately correct according to a different goal and a different set of asymptotic assumptions. In order to compare AIC and BIC, we need to take a close look at the nature of the data generating model (such as having many tapering effects or not), whether the model set contains the generating model, and the sample sizes considered. Moreover, it also depends on weather we want to select the true model or select the K-L best approximating model. Thus, it is the unknown context and objectives of use (we want the “true” model or the “best” model) that is critical for deciding which measure is better.

In general, it might be best to use both AIC and BIC in model selection. AIC is better in situations when a false negative finding would be considered more misleading than a false positive, and BIC is better in situations where a false positive is as misleading as, or more misleading than, a false negative.

References:
Ask a Methodologist: AIC vs. BIC (2007, Spring). Retrieved from http://methodology.psu.edu/eresources/ask/sp07
Burnham, K. P., & Anderson, D. R. (2002). Model selection and multi-model inference: a practical information-theoretic approach. Springer.


Sunday, December 8, 2013

Aggregating Data to Higher Levels of Analysis

Are you interested in looking at the relationships between variables at different levels of analysis? Perhaps, you interested in measuring emergent social phenomenon or want to justify aggregating individual data to a higher level of analysis. In any case, the use of aggregation statistics (rwg’s and intraclass correlation coeffecients, or ICCs) can be helpful and are commonly employed in fields concerned with multi-level theories, such as organizational and cross-cultural psychology. The following paragraphs will provide a brief discussion concerning (1) the purpose and usefulness of aggregation statistics, (2) the most widely used aggregation statistics and the interpretive information they provide, and (3) specify a resource that will further explain how these various aggregation statistics are calculated.

Purpose and Usefulness

To begin, aggregation statistics can be used to test assumptions inherent in the definition of particular constructs of interest and allow researchers to: (1) bolster claims that a particular variable does in fact reflect a particular construct, and (2) justify the aggregation of data from a lower level to a higher level of analysis. As an example, cultural values are often theoretically defined as a set of beliefs, motivations, and norms that are highly shared by a set of individuals comprising a particular cultural group; consequently, individuals are either aware of or have internalized these particular values and can identify them or exhibit them, respectively, on measures requisite to their assessment. However, while these values are dependent on and exist within to the minds of individuals, they cannot be altered or adjusted by singular individuals and exhibit a force on individual behavior that is socially expected due to their widespread sharedness among members of a cultural group. Therefore, cultural values practically and theoretically exist at a higher level of analysis separate from the level of the individual mind.

However, despite the fact that cultural variables theoretically exist beyond the individual level, measurement of cultural variables necessarily takes place at the individual level, as researchers have to administer surveys or experiments to singular people. Thus, there exists a common problem that arises in such research: the divergence between the level of measurement and the level of analysis. So how do we bridge this gap and make the claim that our individual-level measurements can be aggregated to represent a higher level, cultural construct? Aggregation statistics—rwg’s and ICCs—are the tools that will allow us to do so. But what are these aggregation statistics and how do we interpret them?

Rwg (within-group agreement)

Rwg is a measure of within-group agreement and is calculated for each particular group of interest; in other words, a sample that consists of five groups would necessitate five separate Rwg’s, with some groups exhibiting high within-group agreement and others low within-group agreement. Rwg is calculated by comparing observed within-group variance to some expected distribution of random variance. Typically, this expected distribution is represented as a uniform (or rectangular) distribution (where each possible response is equally likely as any other); however, other distributions are also used for theoretical reasons, including resampling methods (Bliese, 2000). The typical cut-off point for claiming within-group agreement used in most of the methodological and experimental literature tends to be .70. Consequently, groups that have an Rwg over this amount are considered to exhibit within-group agreement on a particular variable of interest, justifying aggregation of data to a higher level of analysis. For example, it would justify the claim that Cultural Group A has a shared perception of collectivism, allowing a researcher to aggregate individual-level scores into an overall collectivism score for Cultural Group A as a whole. This may allow a researcher to not only make the claim that Group A exhibits a shared cultural value for some construct, but also test the effect that this shared value has on other individual level behaviors, cognitions, etc.

ICC1 (non-independence)

ICC1, an intraclass correlation coefficient, is typically used as a measure of non-independence on a DV of interest; in other words, ICC1 allows researchers to determine the degree to which variance on an outcome variable or DV of interest is due to group membership. Specifically, it indicates the extent to which group members are interchangeable. When used with a dependent variable, it informs the researcher that there are group differences, or high between-group variability, on some variable of interest. It should be noted that ICC1 values above .30 are extremely rare, with values commonly ranging from .05 to .20, with a median of .12 (Bliese, 2000). Thus, a non-zero ICC1 value is useful for determining whether or not testing between-group differences (such as the differences between groups on collectivism) is justified or not.

It is also important to note that ICC1 can be an indicator of reliability when used with an independent (rather than dependent) variable of interest.

ICC2 (reliability)

ICC2, another intraclass correlation coefficient, is used as an indicator of the reliability—or consistency—of group means. Due to the method of calculation, ICC2 values are much higher than ICC1 values, typically above .70 if reliability of the group mean is high.

Summary

In conclusion, all three of the aggregation statistics mentioned above are distinct and often mutually reinforcing, providing both different and necessary information for justifying aggregation of individual level data to a higher level. One may have high ICC2, or reliability, but low agreement if individuals are proportionally consistent in their rating on a measure (one person typically uses 1, 2, and 3, while another uses 5, 6, and 7) but do not exhibit ratings that converge on a particular shared score. The reverse is also possible. In all, both Rwg and ICC2 support the aggregation of individual scores to a higher construct of interest, typically if both are high. While ICC1 is not necessary for justifying aggregation per se, if ICC1 is low (despite high reliability and high within-group agreement) it indicates low between-group variability. This often nullifies the most common research questions, which commonly concern differences between groups on a particular construct. Aggregation may therefore become pointless from a practical standpoint, as lack of variability leads to an inability to predict any meaningful differences between groups (this may not be true depending on one’s question of interest, of course).

            Finally, for the purposes of calculating these various aggregation statistics and for a longer discussion concerning their usefulness and application, please refer to the chapter “Within-Group Agreement, Non-Independence, and Reliability: Implications for Data Aggregation and Analysis” by Bliese (2000).

References

Bliese, P. D. (2000). Within-Group Agreement, Non-Independence, and Reliability: Implications for Data Aggregation and Analysis. In K. J. Klein & S. W. J. Kozlowski (Eds.), Multi-level theory, research, and methods in organizations (pp. 349-381). San Francisco: Jossey-Bass.  

Wednesday, December 4, 2013

Polynomial Regression for Different Scores

Many research questions are intended to addressed by different scores, such as:
  • Can patients provide self-psychological evaluations that agree with psychology professionals’ ratings? 
  • How well do the attributes of teachers fit the needs or desires of the students?
  •  Are values of the supervisors of an organization match the values of the subordinates?
  • What are the degrees of similarity between perceived stigma of mental ill patients and their caregivers? 

We can start answering the questions by looking at the fit, similarity or agreement between two constructs as a predicting factor (Edwards, 1994). However, different scores can be problematic if we do not handle them in a right way. For instance, different scores are less reliable than either of their component measures. Moreover, people often reduce three-dimensional relationships between two component measures and the outcome measure to two dimensions.

In order to measure the congruence between two constructs and how that affects the outcome, Jeff Edwards (2002) proposed polynomial regression and three-dimensional response surface methodology as a solution for the problem. Polynomial regression approach consists the difference and the higher order terms such as the quadratic terms and the interaction of two measures. The polynomial regression analysis also allows researchers to conceptualize the joint effects of the two constructs on an outcome as a three-dimensional surface.

Let us walk you through how to construct polynomial regression. There are a few steps we can follow:
  • Fit in a linear model by using two construct measures (X & Y) as predictors and Z is the outcome. Calculate R Square and if R Square is not significant, we stop the procedure.



Z=b0+b1X+b2Y+e

  • If the R square is significant, then we added quadratic terms X2 &Y2 and interaction terms XY as set into the model. We test the increment in R Square and if the increment in R Square is not significant, we return back to the linear model.


Z=b0+b1X+b2Y+ b3X2+ b4 XY+b5Y2+e

  • If the increment in R Square is significant, we added cubic terms X3, X2Y, XY2 andY3. Again, we test the increment in R Square and if the increment in R Square is not significant, we return back to the quadratic model.


Z=b0+b1X+b2Y+ b3X2+ b4 XY+b5Y2 b6X3, b7X2Y, b8XY2 + b9Y3+e

After we established our model, now it is time to draw some fancy 3D response surface graphs. Jeff Edwards made the process very easy. You can download the Excel file for plotting response surfaces and plug in the coefficients of your model. The excel file will generate the graph automatically for you.

Here are some samples of the response surfaces graphs:
Linear:



















Quadratic:




















Reference:
Edwards, J. R.  (1994).  The study of congruence in organizational behavior research: Critique and a proposed alternative.  Organizational Behavior and Human Decision Processes, 58, 51-100 (erratum, 58, 323-325).
Edwards, J. R.  (2002).  Alternatives to difference scores: Polynomial regression analysis and response surface methodology. In F. Drasgow & N. W. Schmitt (Eds.), Advances in measurement and data analysis (pp. 350-400).  San Francisco: Jossey-Bass.