| Subjectivity and
the Angoff Process
by Steven B. Just
Readers of these articles and attendees at our
workshop on “Best Practices in Test Development”
know that the Angoff process is the most commonly recommended
method for setting legally defensible passing scores on criterion-referenced
tests. The Angoff method is known as a “conjectural”
method because it involves recording the judgments of subject
matter experts as to the difficulty of test items. For those
not familiar with the Angoff process it works like this (there
are a number of variations on this process):
- Gather together subject
matter experts (a minimum of three).
- For each item on the
test have them rate the probability that a minimally competent
test taker would get the question correct.
- Sum each judge’s
score and convert to a percent of 100.
- Average the judges’
scores.
- This is the cut score.
So for example:
Angoff
Method: Example |
|
Item |
Judge
1 |
Judge2 |
Judge
3 |
| 1 |
.75 |
.80 |
.85 |
| 2 |
.80 |
.90 |
1.00 |
| 3 |
.75 |
.75 |
.90 |
| 4 |
.90 |
.90 |
.80 |
| 5 |
.95 |
.75 |
.85 |
|
TOTAL |
4.15 |
4.10 |
4.40 |
|
PRECENT |
83% |
82% |
88% |
While it is possible to strive for “scientific
accuracy” through rigorous training, practice and discussion,
ultimately we are asking humans to make judgments -- a process
fraught with subjectivity.
I have run Angoff processes
in two ways, each with its own pros and cons:
Method 1: Have the group as a whole discuss
each item prior to the individuals making their ratings or
have the discussion after each individual makes his/her rating
and give the individuals a chance to change their scores based
on the discussion.
Pros: The process
tends to generate fewer outliers. The individuals tend to
reach consensus.
Cons: A single
dominant individual (either because he/she knows more, has
more seniority or simply has a stronger personality) can easily
sway the group.
Method 2: Have each individual rate the items without discussion.
Pros:
You get an honest rating from each individual unbiased
by group dynamics.
Cons: You tend
to get wildly disparate estimates. If you have only three
judges for example, merely taking the average of three very
different estimates does not imply accuracy of the resulting
mean. (If it’s 30 degrees today and 90 degrees tomorrow,
you are not living in a pleasant 60 degree climate.)
Here are some ways to address these problems:
- Train the judges, practice,
and discuss the results, so everyone understands what you
are estimating. Admittedly “estimating how many minimally
competent test takers out of 100 will get the item right”
is a difficult concept to get one’s mind around.
- Have five or more judges,
especially if you are using Method 2. Then if you want you
can throw out the high and low outliers for each item.
- Pilot the test!! I
repeatedly implore my clients to do this, but under time
pressure to get the test out they often don’t. Often
the results of the pilot are fascinating, and disturbing.
I recently went through a multi-test Angoff process with
a client. In theory the students who took the different
tests were from a homogenous group and all of the passing
scores we derived from the Angoff process were within five
points of one another. We found:
- The average scores on the pilot tests
were very different from one another (much more than the
five point Angoff spread).
- A much higher percentage of students
failed one of the tests than the others (i.e. the actual
scores on this one test were much lower than the judges
predicted they would be).
- Despite our best efforts at item review
an incorrect answer slipped through.
Because we had the pilot results we
were able to modify the tests and the cut scores to deliver
consistent outcomes across all of the tests.
- Ask the Angoff
judges one additional question: “What percentage of
students who take this test would you expect to pass first
try?” Use this number as a reasonableness check against
the pilot results and the original Angoff scores to arrive
at a final cut score (this is known as the Beuk adjustment).
The Angoff process is inherently subjective.
There are steps you can take to minimize this but it’s
a fact you must accommodate in your testing process.
|