Reliability
Classical
Test Theory (CTT) is the measurement theory implicitly employed in most psychological research. It is such a commonly-employed
theory that it is the default; unless a different measurement theory (e.g.,
Item Response Theory, Signal Detection Theory, Generalizability Theory) is
mentioned, it can be assumed that a researcher operated under a CTT
perspective.
According
to CTT, observed scores are a function of two random variables: “true scores”
and random error. This equation is often denoted as follows:
O = T +
E
True
scores are the expected value of an individual on a given attribute. CTT -- as you may have noticed -- only addresses random errors (e.g., errors in measurement due to
random fluctuations in mood, the testing environment, etc.). Systematic errors,
including social desirability and other response biases and styles, are ignored
in the CTT framework. This is because systematic errors only affect true score
(e.g., expected value), not random error. Since systematic errors only affect
the mean/true score, and since the majority of social science (especially
psychological) measurement does not have a “true zero,” systematic errors do
not really interfere with measurement -- well, at least according to CTT theory.
In CTT,
then, reliability is conceptualized as the extent to which differences in
observed scores are consistent with differences in true scores. Ideally, every
measurement you make should be supported with reliability evidence -- which means that you will need to have two or more data points for each construct you measure. These measurements
may come from two or more questions designed to measure the same construct
(internal consistency reliability; as in a scale/survey), from two or more
measurements using the same question or scale over time (test-retest
reliability; in longitudinal research), or from two or more raters (inter-rater
reliability; in observational research, for example). Depending on how the
measures are acquired, different estimates of reliability
can be calculated, and indeed, are preferred. Note the use of the term "estimate" in the prior sentence -- since reliability is a theoretical concept, it can never be calculated
directly. Instead, it must be estimated. So, the phrase, “the reliability of
this scale is . . . “ would be inaccurate. In contrast, the phrase, “the
estimated reliability of this scale is . . . “ would be more appropriate.
Estimating Reliability
Internal consistency reliability. Cronbach's alpha -- which provides an estimate of the extent to which the items “cling”
or covary together -- is the most commonly employed estimate of the internal consistency reliability of a scale.
Test-retest reliability. When estimating test-retest reliability (or reliability
over time), a simple correlation between scores provides a reliability
estimate. For this type of reliability, researchers need to ensure that they
are spacing their measures appropriately in time; measures should be far enough apart in time that participants
can’t recall their answers from the first test, but not so far apart in time
that participants have substantively changed in their true scores on the
construct of interest.
Inter-rater reliability. Inter-rater reliability can be estimated
using any number of indices, including % agreement, Interclass Correlation
Coefficients, Fleiss’s or Cohen’s Kappa, or rwg. The type of estimate you use for inter-rater reliability will depend largely on whether your data is dichotomous (e.g., % agreement), categorical (e.g., Fleiss's Kappa), or continuous (e.g., rwg), as well as on whether or not you have the same raters evaluating each observation.
All
reliability coefficients detailed above should be bounded between 0 and 1. The
higher the coefficient – that is, the closer to 1 – the more reliable the test.
Of course, standards for “appropriate” reliability vary by the estimate
employed. For example, a test-retest correlation of 0.5 might be sufficient to
support reliability, whereas a Cronbach alpha below 0.8 might not be sufficient
for some purposes. However, while “guidelines” for reliability estimates are
common throughout the literature, there is no empirical justification behind
these guidelines. They are truly “rules of thumb.”
Limitations
Reliability
theory, while elegant, has several drawbacks. First, CTT relies on weak
assumptions, which means that no empirical test can provide that it doesn’t
hold for any given measure. Second, reliability theory assumes that tests
measuring the same construct are parallel, which is typically too strong an
assumption for most scales. Finally, reliability theory necessarily lumps all
sources of error into one error term. Given these drawbacks, it should not be
surprising that researchers are increasingly turning to alternative measurement
theories, such as Item Response Theory.
Recommended Books
Nunnally, J. C. & Bernstein, I. H. (1994). Psychometric Theory (Third Edition). NY: McGraw-Hill.
Embretson, S.E. & Reise,
S.P. (2000). Item Response Theory for Psychologists. Mahwah,
NJ: Lawrence Erlbaum Associates.
No comments:
Post a Comment