Putting Assessments to the Test
One in an occasional series looking at the culture of testing.
by Valerie Strauss
No Child Left Behind, President Bush's signature education law, requires
that millions of students across the country be tested annually and that
the tests produce "reliable and valid" data to measure how well they --
and their schools -- are doing.
Testing experts say that one part of that equation is fairly easy to do,
but the other . . . not so much.
Reliability essentially means that a test is, well, reliable; perfect
reliability would mean that a student performs the same way on a test
every time it is given. Things get in the way -- including the health or
frame of mind of the test-taker, the sampling of content on the test and
scoring errors -- but it is possible to quantify those mistakes and put
error bands around a score that say how much it might vary.
Many of the standardized tests being used can be considered reliable,
experts say. But reliability alone doesn't mean much, said Bob
Schaeffer, public education director of the National Center for Fair and
Open Testing, a nonprofit group that advocates against standardized testing.
"If you got on a scale, and every time you got on, it said it was 237
pounds, it would be reliable, even if you weighed 120," he said. "You
could rely on it to say 237 pounds. But it's not accurate or meaningful."
And that's where the problem with validity comes into play, some
Broadly, experts say, a valid test is one that measures what its authors
say it will measure. Tests assess children in many different areas;
validity is all about the specific purpose of the test.
"A test itself is not valid or invalid," said Daniel Koretz, a professor
of education at Harvard University. "The conclusion you base on the
result is valid or invalid."
That means, for example, that under the standard of validity:
- A test designed to screen students for learning disabilities is not
used to measure student progress in reading acquisition.
- A test that says it predicts college performance actually does. The
old SAT said it did, but experts said the test had limited ability to
predict a student's performance in the first year and none beyond that.
The test has been changed and, experts say, does not intend to predict
- A test is not used to guide curriculum.
"There has been an explosion of mandates for more and more standardized
tests with very little evidence to support their use," said Walter Haney
of Boston College's Center for the Study of Testing, Evaluation and
The No Child Left Behind program has ushered in an unprecedented era of
high-stakes standardized testing, which has dramatically changed what
goes on in classrooms across the United States and caused fierce debate
over the approach.
The issue of what the tests actually measure has become more important
than ever because the results do, indeed, have high stakes, with jobs of
teachers and administrators sometimes riding on the single
administration of a test. Many experts say that, in this environment,
there should be much more effort to ensure that tests are valid.
"If indeed in the long run No Child Left Behind and the accountability
movement is going to really have traction in improving education for
kids in the United States, I think it's going to have to subject itself
to a serious level of scrutiny," said Robert Pianta, director of the
University of Virginia's Center for Advanced Study of Teaching and Learning.
What does validity actually mean in the context of student testing?
Testing experts generally refer to three major areas of validity:
- Content validity deals with, not surprisingly, content. A key
component, curricular validity, demands that a test actually cover
material in the curriculum (especially important in high school
- Criterion-related validity includes predictive validity. Gerald
Bracey, an educational researcher and author of "Reading Educational
Research: How to Avoid Getting Statistically Snookered," said that he
does not know of any state that has tried to validate its tests against
what happens in the future.
- Construct validity deals with the broad picture of whether a test
assesses exactly what it is intended to measure; a science test trying
to measure knowledge of geologic time might have questions that are so
difficult to understand that what really is being measured is vocabulary
and reading skills.
Another form of validity, identified in the 1990s, is "consequential
validity," which says that a test's validity is determined by how the
results are used. It has the testing world in a verbal brawl because
some experts think it is essentially nonsense.
"You can have a good test of, say, mathematics, and have school boards
make ridiculous policy decisions based on the scores," Bracey said. "To
me, that says nothing about the test."
Complicating matters, educators say, is the fact that the pipeline of
newly trained testing experts charged with improving standardized tests
is nowhere close to keeping pace with the skyrocketing demand.
Training started falling 25 years ago, and there has been no big
resurgence. And the capacity of the commercial sector to produce the
vastly increased number of tests has significantly lagged, experts say.
Roger Farr is director of the Center for Innovation and Assessment at
Indiana University, a special consultant on testing and assessment to
the education company Harcourt, and an author several standardized
tests. He said he thinks the country is placing too much emphasis on
"Teach children to read and write well and the . . . tests will take
care of themselves," he said. "What we've got to do is know what to
teach kids. The goal of education is not coming up with answers. The
goal of education is how you find answers.
INDEX OF NCLB OUTRAGES