In Testing, How Reliable Are Year-to-Year Comparisons?
A central tenet of the federal No Child Left Behind Act is that educational improvement at a school can be measured by comparing student scores on standards-based tests from one year to the next. An important question about such a strategy, one that has gotten surprisingly little attention, is this: How accurate are such year-to-year comparisons? The answer is that they are much less accurate than people assume—and in some cases, wildly inaccurate.
Psychometric methods are used to equate proficiency cut-scores from year to year, so that a consistent level of knowledge is required over time, irrespective of a particular year’s test. The margins of error in these equated proficiency cut-scores are impossible to compute, and equating calculations can easily be off by a point or two, and sometimes much more.
Suppose a school is expected to show an annual 10 percent improvement in the proficiency rate (the percentage of students scoring at or above the proficiency cut-score) on an 8th grade math test; say, the proficiency rate of 40 percent last year must rise to at least 44 percent this year. Unfortunately, the proficiency rate rises from 40 percent to only 42 percent, which represents just a 5 percent improvement, a failing performance. Suppose the proficiency cut-score was calculated to be 28 out of 50 on last year’s test and 30 out of 50 on this year’s test. If the proficiency cut-score this year had been set a point lower at 29, the school would have exceeded the 44 percent proficiency target level, because a 1-point drop in the proficiency cut-score on a 50-point test would increase the percentage of proficient students from 42 percent to 45 percent (or higher).
At a recent conference at the Mathematical Sciences Research Institute in Berkeley, Calif., I presented an analysis of a state test with a huge error in the proficiency cut-score. Flawed psychometric equating over the past four years on the New York Math A graduation test set the proficiency cut-score about 20 points too high, at 51 out of 85, instead of about 30 out of 85. If the No Child Left Behind law were tracking Math A proficiency rates (graduation tests are not yet mandatory), most New York high schools would probably have been labeled as "needing improvement" on the Math A test. Its high cut-score led to a huge failure rate on the June 2003 Math A test, which in turn led New York to rework all its state math tests. However, many other state tests likely have less drastic problems with their test-equating calculations that could lead some schools to be unfairly labeled as needing improvement under the No Child Left Behind Act.
Often, the margin of error in setting the proficiency rate is too large to permit any meaningful assessment of improvement.
Most standards-based tests are based on a technical psychometric methodology called Item Response Theory, or IRT. Item-response theory makes many critical assumptions, both of a practical nature—in getting all the technical details of test development right—and of a theoretical nature—in its one-dimensional model for assessing student knowledge. Few states have the resources to implement IRT-based tests with the attention to detail they require. Such tests should be reliable for assessing standard procedural skills, such as solving a quadratic equation. Unfortunately, the more thoughtful, and thus unpredictable, a test, the more likely it is that equating methods will misperform.
A big problem with New York’s Math A tests over time was that teachers’ instruction evolved as students’ skills improved. The Math A equating calculations were missing year-to-year improvements, because they used a dated set of "anchor" questions assessing skills that were no longer emphasized.
Item-response theory assumes that a single "ability value" can be assigned to each student, and that this value accurately predicts, within small bounds, how that student will perform on a future question. Coaching is known to undermine this assumption. On the New York state math test, students’ performance on a question appeared, not surprisingly, to be a function of whether they were drilled on that type of question, as much as of their general mathematical ability.
The use of IRT-based tests for high-stakes, year-to-year comparisons has been controversial in the educational testing community. The New York Math A test crisis resulted in the first well-documented analysis of what can go wrong with year-to-year equating on such tests—and how badly it can go wrong. (For a nontechnical analysis, see the New York State Regents Math A Panel report; for a more technical analysis, see http://www.ams.sunysb.edu/~tucker/MathA.html.)
Tests have a role to play in efforts to improve our schools. But great care is needed in annual comparisons of test performance. Often, the margin of error in setting the proficiency rate is too large to permit any meaningful assessment of how much year-to-year improvement has occurred.
Alan Tucker is a distinguished teaching professor in the department of applied mathematics and statistics at the State University of New York at Stony Brook.
INDEX OF NCLB OUTRAGES