Wednesday, October 2, 2013

A (very brief) primer on reliability


Classical Test Theory (CTT) is the measurement theory implicitly employed in most psychological research. It is such a commonly-employed theory that it is the default; unless a different measurement theory (e.g., Item Response Theory, Signal Detection Theory, Generalizability Theory) is mentioned, it can be assumed that a researcher operated under a CTT perspective.

According to CTT, observed scores are a function of two random variables: “true scores” and random error. This equation is often denoted as follows:

O = T + E

True scores are the expected value of an individual on a given attribute. CTT -- as you may have noticed -- only addresses random errors (e.g., errors in measurement due to random fluctuations in mood, the testing environment, etc.). Systematic errors, including social desirability and other response biases and styles, are ignored in the CTT framework. This is because systematic errors only affect true score (e.g., expected value), not random error. Since systematic errors only affect the mean/true score, and since the majority of social science (especially psychological) measurement does not have a “true zero,” systematic errors do not really interfere with measurement -- well, at least according to CTT theory.

In CTT, then, reliability is conceptualized as the extent to which differences in observed scores are consistent with differences in true scores. Ideally, every measurement you make should be supported with reliability evidence -- which means that you will need to have two or more data points for each construct you measure. These measurements may come from two or more questions designed to measure the same construct (internal consistency reliability; as in a scale/survey), from two or more measurements using the same question or scale over time (test-retest reliability; in longitudinal research), or from two or more raters (inter-rater reliability; in observational research, for example). Depending on how the measures are acquired, different estimates of reliability can be calculated, and indeed, are preferred. Note the use of the term "estimate" in the prior sentence -- since reliability is a theoretical concept, it can never be calculated directly. Instead, it must be estimated. So, the phrase, “the reliability of this scale is . . . “ would be inaccurate. In contrast, the phrase, “the estimated reliability of this scale is . . . “ would be more appropriate.

Estimating Reliability

Internal consistency reliability. Cronbach's alpha -- which provides an estimate of the extent to which the items “cling” or covary together -- is the most commonly employed estimate of the internal consistency reliability of a scale.

Test-retest reliability. When estimating test-retest reliability (or reliability over time), a simple correlation between scores provides a reliability estimate. For this type of reliability, researchers need to ensure that they are spacing their measures appropriately in time; measures should be far enough apart in time that participants can’t recall their answers from the first test, but not so far apart in time that participants have substantively changed in their true scores on the construct of interest. 

Inter-rater reliability. Inter-rater reliability can be estimated using any number of indices, including % agreement, Interclass Correlation Coefficients, Fleiss’s or Cohen’s Kappa, or rwg. The type of estimate you use for inter-rater reliability will depend largely on whether your data is dichotomous (e.g., % agreement), categorical (e.g., Fleiss's Kappa), or continuous (e.g., rwg), as well as on whether or not you have the same raters evaluating each observation.

All reliability coefficients detailed above should be bounded between 0 and 1. The higher the coefficient – that is, the closer to 1 – the more reliable the test. Of course, standards for “appropriate” reliability vary by the estimate employed. For example, a test-retest correlation of 0.5 might be sufficient to support reliability, whereas a Cronbach alpha below 0.8 might not be sufficient for some purposes. However, while “guidelines” for reliability estimates are common throughout the literature, there is no empirical justification behind these guidelines. They are truly “rules of thumb.”

Reliability theory, while elegant, has several drawbacks. First, CTT relies on weak assumptions, which means that no empirical test can provide that it doesn’t hold for any given measure. Second, reliability theory assumes that tests measuring the same construct are parallel, which is typically too strong an assumption for most scales. Finally, reliability theory necessarily lumps all sources of error into one error term. Given these drawbacks, it should not be surprising that researchers are increasingly turning to alternative measurement theories, such as Item Response Theory.

Recommended Books
Nunnally, J. C. & Bernstein, I. H. (1994).  Psychometric Theory (Third Edition).  NY: McGraw-Hill.

Embretson, S.E. & Reise, S.P.  (2000). Item Response Theory for Psychologists.  Mahwah, NJ: Lawrence Erlbaum Associates.

No comments:

Post a Comment