Design and Statistics Analysis Laboratory: October 2013

Thursday, October 24, 2013

DaSAL Seminar: Analyzing Reaction Time Data

We are pleased to announce that DaSAL will be presenting the seminar Analyzing Reaction Time Data on Friday, November 22, 2013 in BPS 1142 2pm-3:30pm.

The seminar will cover every step in the analysis process including data cleaning, testing, interpretation, and presentation. The seminar will address the challenges of working with RT data, review conventional RT analysis procedures, introduce techniques that can handle the non-normality of raw RT data, and give the opportunity to try the techniques with a set of data. The goal of the seminar is to give students the tools needed to independently analyze reaction time data and successfully defend their analyses to reviewers and committee members.

Light refreshments will be provided.

If you have any questions about the seminar event, please email umdconsulting@gmail.com.

We look forward to seeing you there!

Tuesday, October 15, 2013

How Big Should My Sample Be?

One of the most common questions encountered by researchers is "how big does my sample need to be?" Unfortunately, there are no hard and fast rules for determining an ideal sample size, although there are some guidelines that can help researchers navigate this question.

Commonly, researchers will conduct a “power analysis” to investigate how many participants they would need to find a relationship of a given expected size. Unless there is substantive prior research in their area of interest, it can be difficult to know how large an association between or among variables a researcher should expect. The size of the association -- or the effect size -- can vary from small (i.e. a subtle relationship, which is common with many psychological and social processes) to large. Since there is some guesswork as to the size of the effect of interest in the population, power analyses necessarily only yield estimates of the appropriate sample size. So, to the extent that a researcher overestimates the effect size in the population, he/she may obtain a sample that is too small to detect their effect of interest, and thus waste time and money. On the other hand, to the extent that a researcher underestimates the effect size in the population, he/she may obtain as sample that is excessively large to detect their effect of interest, and thus waste time and money. Despite these drawbacks, power analysis can be a helpful guide to point researchers in the right “ballpark” with respect to their sample size. To improve the usefulness of power analysis, researchers should incorporate effect sizes calculated from prior research into their estimate of effect size for their own research.

The size of the population will also guide sample size. When a researcher is dealing with a small population (say, 50 people), a 30 person sample would provide very generalizable results. However, when a researcher is dealing with a larger population (say, 5,000,000 people), a 30 person sample may be too small for adequate generalizability. So, the larger the population, typically, the larger the sample a researcher might desire for the purposes of generalizability. However, the relationship between population size and sample size is not linear, so population size only provides a rough guideline for approximately what sample size might be desirable.

When determining sample size, a researcher must also consider their research design and the associated costs of that design. Interviews and qualitative research tend to be time- and monetarily-intensive. Consequently, researchers conducting qualitative research tend to work with smaller samples (30-100 participants), as larger samples can be prohibitively expensive and difficult to acquire. Laboratory experiments employing quantitative data can be somewhat time-consuming, as participants must schedule times to come into the lab, so laboratory experiments may generally employ mid-sized samples (100-250 participants). Finally, survey research is very inexpensive to implement and takes very little time. Consequently, large samples of over 500 participants are not impractical to obtain through survey methodology.

Another design consideration researchers need to wrestle with when determining sample size is whether they are conducting a “between subjects” or a “within subjects” design. In a “between subjects” design, only one datapoint for a given variable or relationship of interest is collected per participant. In a “within subjects” design, more than one datapoint for a given variable or relationship of interest is collected per participant. Because researchers collect more data points per person in a within- than in a between-subjects design, within-subjects designs require data to be collected from fewer participants overall relative to between-subjects designs.

In addition to population size, cost, and design, researchers must also consider their analytical plans when determining an appropriate sample size. The first question researchers must wrestle with is the complexity of the statistical models they intend to assess. If a researcher is testing a simple main-effects model, where only one outcome and one predictor is used, a relatively smaller sample is required. The more effects are added to a statistical model of interest (e.g., additional outcomes or predictors, interaction terms, etc.), the more participants will be needed to obtain estimates of each effect. So, the more complex the statistical model, the more participants should be sampled.

Finally, researchers must consider the statistical estimation technique they intend to employ when determining their sample size. Statistics are calculated using one of many possible estimation methods. Descriptive statistics, such as means and variances, are the least sophisticated, and require the lowest number of participants. Similarly, some estimation methods, such as Ordinary Least Squares, involve a relatively straightforward equation where unknowns are acquired in one step. These techniques are the basis of most common inferential statistics, including linear regression, correlation, and so on, which are most commonly accessed through the statistical software programs SPSS, SAS, and stata. According to the central limit theorem, estimates of means from samples from the same population approach a normal distribution for sample sizes of 30 or above. Since a normal distribution is assumed by most OLS parametric analyses, for analyses of simple models using OLS estimation procedures or descriptive statistics, smaller sample sizes might be sufficient.

In contrast, many modern estimation procedures use estimation methods that require the iterative maximization of a function. These estimation procedures, including maximum likelihood (ML) estimation and its variants, underlie advanced modeling techniques such as Structural Equation Modeling, Mixed Models, Generalized Estimating Equations, Item Response Theory, and Random Coefficient Modeling, which are typically run through the statistical software programs SPSS, SAS, stata, MPLUS, LISREL, AMOS, HLM, and so on. Iterative estimate procedures are sometimes called “large sample” procedures -- and for a reason. For some of these advanced models, parameter estimates will be biased if the sample size is below 100. So, for ML estimation techniques and other iterative estimation procedures, a researcher will need to collect samples of 100 participants, 250 participants, or in some cases, even more than 500 participants to estimate parameters of interest.

The issue of determining sample size is obviously not an easy one to resolve. Researchers must consider the size of their effect of interest, the size of their population, the costs of research, their research design, their statistical model, and the types of analyses they intend to use, when deciding how large of a sample they need to collect. Some of their answers to the above deliberations may provide contradicting advice for sample size. Ultimately, as with all research, the researcher needs to weigh all of these factors and come up with their “best answer” that minimizes cost and maximizes the likelihood that they will find a significant result if one indeed exists in the population.

Monday, October 7, 2013

How Can Researchers Claim Support for the Null Hypothesis?

Traditional null hypothesis significance testing (NHST) does not allow researchers to claim support for the null hypothesis given a finding of non-significance (i.e., p > .05). This is due to the fact that NHST methods only signify the probability of a set of data given the null model (D | H₀), and it does necessarily follow that the null model is probable given the data (H₀ | D). Rather, researchers using NHST methods are only able to state that the null “was not able to be rejected.” This, of course, belies the spirit of scientific pursuit and goes against the desires of most researchers, who wish to make substantive claims based on their research findings. If this is the case, how might researchers proceed?

Unlike NHST methods, Bayesian methods allow researchers to approximate the probability of a hypothesis given the data (H₀ | D) using a model comparison approach. Specifically, Bayes Factors are used to compute the ratio of two models (e.g., the null and the alternative hypothesis) to determine which is more supported by the statistical evidence. This allows researchers to make claims in support of either the null or alternative model. Consequently, Bayesian methods offer a useful alternative to NHST that is more in line with the goals of the scientific enterprise, which seek to support the validity of competing hypotheses. However, Bayesian methods are often computationally complex and many researchers in the field of psychology may not have the skills necessary appropriately employ them. Yet, easily computable alternatives exist.

In addition to reviewing the drawbacks of NHST in greater detail, Masson (2011) offers a reasonable method whereby researchers can appropriate the benefits of Bayesian methods in a way that requires little computational complexity. Instead of computing Bayes Factors in a traditional manner, he builds on a method specified by Wagenmakers (2007) that computes the “Bayes Information Criterion” or BIC. The BIC uses results computed from traditional NHST tests (such as the sums-of-squares values generated by ANOVA’s) to easily produce a model comparison ratio that approximates the Bayes Factor. Researchers are urged to read Masson (2011) in order to understand the full methodological processes. Additionally, those wishing for a useful categorization scheme to describe the magnitude of the resulting ratio should refer to Raftery (1995). Finally, researchers interested in reading more about the drawbacks of NHST and Bayesian methods in general will find articles by Wagenmakers (2007), Gallistel (2009), and Kruscke (2010, 2013) helpful.

In conclusion, researchers often seek to demonstrate that some experimental manipulations have no demonstrable effect on an outcome variable. Moreover, even when a researcher has no a prior motivation to demonstrate a null effect, it is scientifically responsible to test the extent to which a null model is more or less supported relative to a specified alternative model, rather than simply making assumptions about its validity. This benefits the quality of research that comes out of the psychological research community and provides important information for other researchers that may preserve valuable resources, including time and money. In addition, although most psychological journals still advocate the reporting of NHST results, it is easy enough for researchers to compute BICs and report them alongside more traditional tests.

References

Gallistel, C. R. (2009). The importance of proving the null. Psychological Review, 116,

439-453.

Kruschke, J. K. (2010). Bayesian data analysis. Wiley Interdisciplinary Reviews:

Cognitive Science, 1, 658-676.

Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of

Experimental Psychology: General, 142, 573-603.

Masson, E. J. (2011). A tutorial on a practical Bayesian alternative to null-hypothesis

significance testing. Behavior Research Methods, 43, 679-690.

Raftery, A. E. (1995). Bayesian model selection in social research. In P. V. Marsden

(Ed.), Sociological methodology 1995 (pp. 111-196). Cambridge: Blackwell.

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p

values. Psychonomic Bulletin & Review, 14, 779-804.

Wednesday, October 2, 2013

A (very brief) primer on reliability

Reliability

Classical Test Theory (CTT) is the measurement theory implicitly employed in most psychological research. It is such a commonly-employed theory that it is the default; unless a different measurement theory (e.g., Item Response Theory, Signal Detection Theory, Generalizability Theory) is mentioned, it can be assumed that a researcher operated under a CTT perspective.

According to CTT, observed scores are a function of two random variables: “true scores” and random error. This equation is often denoted as follows:

O = T + E

True scores are the expected value of an individual on a given attribute. CTT -- as you may have noticed -- only addresses random errors (e.g., errors in measurement due to random fluctuations in mood, the testing environment, etc.). Systematic errors, including social desirability and other response biases and styles, are ignored in the CTT framework. This is because systematic errors only affect true score (e.g., expected value), not random error. Since systematic errors only affect the mean/true score, and since the majority of social science (especially psychological) measurement does not have a “true zero,” systematic errors do not really interfere with measurement -- well, at least according to CTT theory.

In CTT, then, reliability is conceptualized as the extent to which differences in observed scores are consistent with differences in true scores. Ideally, every measurement you make should be supported with reliability evidence -- which means that you will need to have two or more data points for each construct you measure. These measurements may come from two or more questions designed to measure the same construct (internal consistency reliability; as in a scale/survey), from two or more measurements using the same question or scale over time (test-retest reliability; in longitudinal research), or from two or more raters (inter-rater reliability; in observational research, for example). Depending on how the measures are acquired, different estimates of reliability can be calculated, and indeed, are preferred. Note the use of the term "estimate" in the prior sentence -- since reliability is a theoretical concept, it can never be calculated directly. Instead, it must be estimated. So, the phrase, “the reliability of this scale is . . . “ would be inaccurate. In contrast, the phrase, “the estimated reliability of this scale is . . . “ would be more appropriate.

Estimating Reliability

Internal consistency reliability. Cronbach's alpha -- which provides an estimate of the extent to which the items “cling” or covary together -- is the most commonly employed estimate of the internal consistency reliability of a scale.

Test-retest reliability. When estimating test-retest reliability (or reliability over time), a simple correlation between scores provides a reliability estimate. For this type of reliability, researchers need to ensure that they are spacing their measures appropriately in time; measures should be far enough apart in time that participants can’t recall their answers from the first test, but not so far apart in time that participants have substantively changed in their true scores on the construct of interest.

Inter-rater reliability. Inter-rater reliability can be estimated using any number of indices, including % agreement, Interclass Correlation Coefficients, Fleiss’s or Cohen’s Kappa, or rwg. The type of estimate you use for inter-rater reliability will depend largely on whether your data is dichotomous (e.g., % agreement), categorical (e.g., Fleiss's Kappa), or continuous (e.g., rwg), as well as on whether or not you have the same raters evaluating each observation.

All reliability coefficients detailed above should be bounded between 0 and 1. The higher the coefficient – that is, the closer to 1 – the more reliable the test. Of course, standards for “appropriate” reliability vary by the estimate employed. For example, a test-retest correlation of 0.5 might be sufficient to support reliability, whereas a Cronbach alpha below 0.8 might not be sufficient for some purposes. However, while “guidelines” for reliability estimates are common throughout the literature, there is no empirical justification behind these guidelines. They are truly “rules of thumb.”

Limitations

Reliability theory, while elegant, has several drawbacks. First, CTT relies on weak assumptions, which means that no empirical test can provide that it doesn’t hold for any given measure. Second, reliability theory assumes that tests measuring the same construct are parallel, which is typically too strong an assumption for most scales. Finally, reliability theory necessarily lumps all sources of error into one error term. Given these drawbacks, it should not be surprising that researchers are increasingly turning to alternative measurement theories, such as Item Response Theory.

Recommended Books

Nunnally, J. C. & Bernstein, I. H. (1994). Psychometric Theory (Third Edition). NY: McGraw-Hill.

Embretson, S.E. & Reise, S.P. (2000). Item Response Theory for Psychologists. Mahwah, NJ: Lawrence Erlbaum Associates.

Pages