The untested theories behind No Child Left Behind
President Bush’s signature education bill is up for renewal. Two Boston College education professors suggest that all this testing may be doing more harm than good.
But be careful of what you wish for: Those short, computer-based tests are already driving teachers and kids crazy. Ask people who suffer from the Kaplan domination in New York City.
By George Madaus
and Michael Russell
The 2002 passage of the No Child Left Behind Act (NCLB) was designed to reduce the gap in achievement between specific groups of students and to ensure that all students develop basic skills in reading and mathematics. NCLB is the culmination of testing’s extraordinary growth that began in the 1950s. Now, five years later, NCLB is up for reauthorization.
To measure progress toward reducing the achievement gap and improving school quality, states must annually administer tests in language arts and mathematics to all students in grades 3 to 8 and in high school. NCLB requires each state to establish its own testing programs and criteria for student performance. These requirements result in testing of more than 30 million public school children each year. Additionally, all students for whom English is a second language (ESL) must also be assessed in listening, speaking, reading, and writing. Under these state and federal requirements, students entering kindergarten today must take a minimum of sixteen state tests before graduating. The cost to develop, administer, score and report all of these tests ranges from $3 billion to $7 billion a year.
NCLB requires states to classify all students into one of the following categories: Advanced, Proficient, Basic, and, by default, Failure. The federal government uses these classifications to hold schools accountable for improving student performance each year for each of the following categories of students: ESL; those with disabilities; American Indian/Alaska Native; Asian, African American; Hispanic; and White. Schools that fail to meet these improvement goals for one or more sub-groups face closure or state-takeover. It is consequences such as these that make these high-stakes tests.
It is important to recognize that the NCLB test-based approach to accountability is not a full-scale reform plan designed to transform our system of public education. Instead, test-based accountability is a fallible tactic that produces paradoxical outcomes. For example, it is relatively easy to increase tests scores without improving what students know and can do. For this reason alone, it is misleading to equate school quality with student test performance.
From its inception NCLB has been criticized for inadequate funding and for too much testing, among other concerns. But even if adequate funds were provided and the amount of testing was reduced, several important questions about test-based accountability remain. As the President, Congress, and the nation debate the reauthorization of NCLB this year, the following questions need serious consideration.
Q. What are the unintended, predictable, negative consequences of the test-driven accountability provisions of NCLB?
The term iatrogenic refers to doctor induced illness; a negative, unanticipated effect on a patient of a well-intended treatment. The paradox of high-stakes testing can be called peiragenics – test induced illness -- the unintended negative consequences of well-intentioned test policies. These include: narrowing the curriculum; ignoring non-tested subjects; test preparation and tutoring that may increase test scores without actually improving students’ knowledge and skills; cheating; giving extra attention to students close to the cut score at the expense of those seen to have little chance of moving to the next performance level; retaining students in grade; dropping out; and decreasing motivation to learn. Paradoxically and most importantly, these unintended consequences corrupt the truthfulness of the inferences and decisions about student achievement and school quality based on changes in test scores. Further, these negative consequences are chronic and predictable; they have occurred over centuries and across continents.
There are three predictable reasons why a high-stakes test produces negative consequences.
First, within a subject field (e.g., English), teachers give greater attention to topics most likely to appear on the test (e.g., grammar and persuasive writing) and decrease coverage of non-tested topics (e.g., poetry and creative writing). Students then adjust their focus accordingly. This combined effect narrows the content and skills taught and learned within a subject.
Second, a high-stakes test preempts time and coverage from subject not tested. Art, physical education, science, foreign languages, and social studies are short-changed in favor of the tested subjects, math and language arts. This narrows the curriculum across subject fields.
Third is a “trickle down” effect on lower grades not directly subject to a high-stakes test. The content and skills covered on the high-stakes tests diminishes the content and skills in the non-tested lower grades thus altering the curriculum across grades.
Q. Why aren’t testing companies and state departments of education subject to independent monitoring?
The nation has long needed, but never had, an independent means to monitor high-stakes testing. What other institution would contemplate a nearly universal treatment for children without moving slowly and systematically, with an independent mechanism to monitor the consequences? There is independent monitoring in a variety of fields including medicine, the stock market, the work of tradesmen, transportation, food, even pet food. There is no comparable independent group that evaluates a testing program before adoption or monitors test use and impact after implementation.
There are three reasons for monitoring high-stakes testing programs. First, such a body is long overdue. Since the end of the 19th century, there have been repeated calls for such oversight. The benefits and risks to institutions and individuals that result from high-stakes testing policies are real and serious. Currently, examinees, educators, parents and the public only have the assurances of those that build the test or control testing programs that the tests, procedures, uses, and classification of students are fair and valid.
Second, policy makers rely on test results to insure students are learning and that taxpayers receive value for expenditures on education. As George W Bush proclaimed, “We've got to hold people accountable…. It is so important to have an accountability system become the cornerstone of [educational] reform in America.” However, the testing programs and tests used to hold students, teachers, and schools accountable are themselves not subject to independent, transparent accountability to ensure test quality, validity, and proper use. How do we really know any of these tests measure what they’re intended to measure?
Third, testing is a useful but fallible technology. All technology is subject to errors, misuse and unintended consequences. This does not mean you abandon a useful technology but instead work to make it better by minimizing its shortcomings. Because of testing’s importance and imperfections, oversight and monitoring is needed to assure the public that test scores are as accurate as possible, and that the benefits of the test far outweigh any harm. We do not have such assurances now, and can only acquire them through independent monitoring.
Q. How timely and useful is the information teachers receive from the tests mandated by NCLB?
Not very. Most high-stakes testing occurs in the spring and results are not available until after school begins the following fall. By then students have moved on to the next grade, often to a different school. When testing occurs in the fall, the test focuses on content covered the previous year. When the fall test results are received later that year, they have little relevance to the content and skills being taught in the current grade. Proponents of NCLB testing claim that the tests provide accurate diagnostic information that can help teachers improve and individualize their instruction. The tests can tell teachers that a student is not doing well in mathematics – something they already know—but not why. Put simply, high-stakes tests are not built to provide diagnostic information in a timely manner.
Q. What other assessment technologies can be used to help teachers tailor their instruction?
Today’s high-stakes tests generally contain forty to sixty items sampling knowledge and skills students are expected to develop over the year. The majority of items are multiple-choice, with perhaps a few short-answer questions and one essay item. This test format is nearly identical to that of eighty years ago. Since then several advances in testing’s technology have occurred, but current testing policies inhibit their use.
Consider three examples. First, rather than have all examinees take the same fixed set of test items, computer-adaptive tests tailor the test so that the examinee does not take items that are too easy or too difficult. By tailoring, fewer items are needed to obtain a reliable test score. For a high-stakes test that covers an entire year, the increased efficiency enables test developers to more broadly sample the year’s work. Initially, federal policy required all students within a state to take the same fixed test. After considerable controversy, this restriction was only recently reversed.
Second, the costs of NCLB high-stakes tests led to the disappearance of two advantageous methods of assessing student achievement: performance assessments and student portfolios. Performance assessments consist of complex problems that require the use of materials and equipment to solve. Portfolios contain samples of student work collected over the year. Both assessment techniques increase the cost of testing. Given the need and costs to develop multiple tests for grades 3-8 and high school to satisfy NCLB, states stopped investing in these alternate forms of assessment.
Finally, an alternative to the NCLB single test administered once a year is to take smaller, more specific samples at multiple points in time. Short computer-based tests designed to measure a sub-set of content and skills collectively result in a broader sampling of attainment over the course of the year. By measuring a smaller set of skills and knowledge, each test provides teachers with more precise information about each student’s learning in a timely manner. Finally, being computer-based, the tests could be adaptive, allowing the test to probe misconceptions and misunderstandings held by low performing students. This broader body of information, timely return of results, and ability to identify reasons for low performance would greatly improve the instructional value of the tests while increasing the information used to hold schools accountable for student learning.
Q How realistic is the goal of having all students proficient by the year 2014?
The 2014 goal was viewed by many in the testing community as unrealistic and unattainable. These warnings were ignored. And even if the goal was met what would it mean? Each state uses different tests keyed to their state curriculum; each state sets its own proficiency scores for each performance level; there is no way to equate “proficient” performance across 50 states. How valid, then, is the label “proficient”? It is accurate, if and only if, “proficient” is whatever each states decides it means – a classic circular argument.
Q Is NCLB leaving some students behind?
One indicator of whether no child is left behind is school completion rates. One might think that if scores were improving, more students should be graduating high school. But the opposite pattern has occurred in several states. Texas saw improved test scores, but a noticeable decrease in its graduation rates, particularly for minority students. When Massachusetts introduced a test-based graduation requirement, a similar pattern occurred. This pattern occurs in several other states and large urban areas as well. These consequences are especially severe for many minority students.
The use of test results in many schools also leads to students being held back, either because they did poorly on the state test or so that they do not take the state test required in the next year. Holding students back due to low performance might appear to be educationally sound, but research shows that such practices leads to negative consequences. There is a persuasive relationship between grade retention, being overage for grade, disengagement, and dropout rates. Being overage for grade predicts dropping out better than do below-average test scores.
George Madaus and Michael Russell
INDEX OF NCLB OUTRAGES