Test Reliability: Should You Care?
by Steven B. Just

If you were a lab scientist doing an experiment you would run the experiment and record your measurements. And then, if your measurements were important, what’s the first thing you would do? You would repeat the experiment, perhaps several times, to ensure that your results were accurate. If you then published your results in an academic journal other scientists would again try to replicate your results before accepting them. If your results could be replicated there would then be a general consensus that your measurements were reliable.

A test, of course, is a form of measurement—sometimes used for important corporate personnel decisions. Yet, most corporate trainers accept their measurements based on a single administration of the test, without questioning the reliability of their results. This is not true for standardized tests, which are rigorously analyzed, but it is certainly true for most of the types of testing done by our clients: a single summative result that measures the outcome of a one-time learning experience (instructor-led, print, eLearning, etc.).

The reasons for this are clear: We work under budget and time constraints and we don’t have the luxury of administering a test multiple times to comparable groups over a period of time to assure ourselves of the reliability of our results.

Fortunately there are statistical methods that allow us to measure test reliability based on a single administration of a test. These measures are called internal consistency reliability measures and the two most common statistics are called Kuder-Richardson Formula 20 (K-R 20) and Chronbach Alpha. For typical knowledge-based assessments where items are scored dichotomously (i.e. right or wrong) these two measures are equivalent.

How do these reliability measures work? Imagine that we arbitrarily divide a single test in half (say odd questions and even questions), score each half independently and correlate one half with the other. In theory, if the test is internally consistent, the scores should correlate. Then we divide the test in half in a different way (say first half of the test and second half of the test) and correlate these two halves. And we keep doing this. In effect, these two reliability measures take all possible split-half correlations of the test and average them to give one reliability estimate. The reliability estimate is a correlation that will vary between 0 and 1, the closer to one the better.

An Important Caveat

These reliability estimates were developed for norm-referenced tests (the type that give a nice wide distribution of scores). The type of testing most corporations do is criterion-referenced (passing is set at a high cut score, typically 90, and most students pass, so the grades tend to bunch up at the high end of the curve). For statistical reasons beyond the scope of this article, reliability scores for criterion-referenced tests tend to be low. Many psychometricians feel that these reliability measures are therefore not meaningful for criterion-referenced tests.

The Bottom Line

If you are doing criterion-referenced testing, run reliability statistics, but view the results critically.

Print Article