Tuesday, November 19, 2013

General monotone model for data analysis

Summary

Summary

Use gemmR! General Monotone Model in R, available on github.

Background

Empirical data in the social sciences are rarely well-behaved. We often collect data that are skewed, have changing variance over the observed range, or related to other observed variables non-linearly. Generally, the advice for dealing with these problems involves transforming data, dropping data, or ignoring violations of assumptions. In the former cases, transformation and outlier deletion are justified by claiming that parametric statistics require well-behaved data that conform to their assumptions. In the latter case, we are required to believe that parametric linear models are robust to violations of assumptions. At a minimum, one of these assertions must be wrong.

More than our data conform poorly to our tests. The hypotheses we test tend to be underspecified relative to the results of parametric statistics. When we examine the effect of an intervention, for example, we assume that the itervened-upon objects will respond with higher or lower measurements on average, relative to some unaltered group. In general, we predict the order of the means but not the distance between them, (there are some exceptions to this, but they are rare in experimental social science research). We then proceed to test our hypotheses by using data to decide on the expected difference between on our groups on some outcome. Only then do we reduce the level of our inference back to a direction, (one group does more or less of a thing), relevant to our original research question.

Order statistics and non-parametric approximations

For simple cases, order statistics already exist. Kendall's \( \tau \) is the ordinal equivalent of the Pearson correlation, while the Mann-Whitney \( U \) and Kruskal-Wallis test correspond to Student's \( t \)-test an the one-way analysis of variance, respectively. With multiple predictors, however, parametric approximations are required. These approximations tend to estimate additional parameters to govern the non-linearities in the relationship between our predictors and outcome variable, (an example is GAM). Especially for sparse data, this can be intractable.

Dougherty and Thomas (2012) proposed the General Monotone Model (GeMM) as a solution for this problem. GeMM is a genetic search-based, all-subsets ordinal regression algorithm that maximizes the rank correspondence between a model and criterion variable. GeMM basically follows three steps:

  1. Produce a random set of regression weights.
  2. Calculate a penalized \( \tau \) between those predictors and the criterion.
  3. Repeat many times, select the set of weights than provide the best ordinal fit.

The result is a set of coefficients that maximize ordinal fit between some set of predictors and an outcome. Weights of zero indicate that a given predictor does not explain sufficient paired-comparisons at the model level, (essentially an ordinal equivalent to squared error), to offset the penalty for including that predictor. Notably, this procedure applies whether you thought the linear model was robust to assumption violations or you intended to transform your data to meet those assumptions. With GeMM, there are fewer decision to make: the results are theoretically invariant to monotonic transformation. This is a relatively complex set of calculations to program from scratch, however, and the only previous implementation of GeMM required extensive familiarization with a MATLAB script. Ideally, GeMM would be easily implemented by anyone without extensive code interpretation.

Introducing gemmR

This lead us to develop gemmR, a GeMM library for the R statistical language. gemmR improves on the previously-available code in a number of ways:

  • gemmR is simple.
    • Fewer assumptions mean there are fewer diagnostic and cleanup operations to perform. Non-normal data and outliers are not problems to consider when using GeMM.
  • gemmR is standardized.
    • Calls to gemm use the existing R framework for specifying models. You no longer have to reshape your formatted data in order to run GeMM. If you can run a linear model in R, you can run GeMM.
  • gemmR is fast.
    • Searching through candidate regression coefficients using Kendall's \( \tau \) is a computationally-expensive process and necessarily takes some time. Our R package uses a faster implementation of \( \tau \) calculation and calls these functions in C++ to speed this process. We also rely on R's existing parallelization structure for the potential to go even faster.

You can find gemmR and instructions for installation on github. A more thorough vignette and set of examples are currently in the works, but the package maintainer would be happy to deal with any questions you might have.

No comments:

Post a Comment