Orwell Award Announcement SusanOhanian.Org Home


The Pending Reauthorization of NCLB: An Opportunity to Rethink the Basic Strategy

Susan Notes: Daniel Koretz asks schools to take on the real challenges of test-based educational accountability and its effects.


by Daniel Koretz

Harvard Graduate School of Education

The pending reauthorization of NCLB is generating intense debate about possible modifications of many of its provisions, such as the requirements for disaggregated reporting, AYP, the draconian requirements for the assessment of students with disabilities, and the provisions for testing students with limited proficiency in English.

But as important as it is, the debate about the specifics of NCLB obscures three more important problems that the civil rights community cannot afford to ignore:

* First, we know far too little about how to hold schools accountable for improving student performance. NCLB and its state-level forebears â" dating back to the first minimum-competency testing programs more than three decades ago â" have been based on a shifting combination of common sense and professional judgment, not hard evidence.

* Second, some important aspects of NCLB (and its antecedent state programs) are inconsistent with the evidence we do already have.

* Third, much of the apparent progress generated by NCLB and similar programs is spurious, a comforting illusion that we maintain for ourselves â" at a great cost to students â" by failing to perform appropriate evaluations.

In this chapter, I will briefly sketch a few of the most important things we know â" and donât know â" about test-based educational accountability and its effects. I will end with a plea that we use the coming reauthorization as an opportunity to belatedly ramp up the hard work of research, development, and evaluation needed to create effective accountability systems â" not as a substitute for alterations to the requirements for AYP and disaggregated reporting and the like, but as an essential complement to them.

As a teacher, parent, and educational researcher for more than a quarter of a century, I remain convinced that the educational system needs more effective accountability systems and that achievement testing has to be one element of them. But research has shown that we are making a hash of it. It is our obligation to children â" particularly to those faring poorly in the current system â" to do better.

What the evidence does and does not tell us

Clues to more productive approaches to educational accountability â" in particular, approaches that are most likely to benefit the students whose well-being is the focus of the civil rights community â" lie both in what research has found and in the questions it has not yet answered.

Does High-Stakes Testing Work?

A modest number of studies argue that high-stakes testing does or doesnât improve student performance in tested subjects. This research tells us little. Much of it is of very low quality, and even the careful studies are hobbled by data that are inadequate for the task. Moreover, this research asks too simple a question. Asking whether test-based accountability âworksâ is a bit like asking whether medicine works. What medicines? For what medical conditions? Similarly, test-based accountability takes many forms that are likely to have different effects. Its impact is likely to vary among types of schools and students.

Test-based accountability also has diverse effects that go beyond the test scores that serve as outcomes in these studies. A program that succeeds in raising mathematics scores may reduce achievement in science, for example, if teachers rob Peter to pay Paul, taking time away from other important subjects. And education has important goals not easily measured with standardized tests that remain unevaluated (Rothstein & Jacobsen, 2006).

Thus, the debate about whether high-stakes testing âworksâ is a red herring, distracting us from the question we ought to be asking: What types of accountability systems will most improve opportunities for the students about whose welfare the civil rights community is particularly concerned while minimizing the inevitable negative side-effects? We need research and evaluation to address this question, because we still lack a well-grounded answer. We need to look at a wide range of outcomes beyond test scores. We need to create opportunities for designing these programs and for rigorously evaluating their positive and negative effects.

Can Score Increases Be Trusted?

Although research does not tell us whether high-stakes testing works, evidence does show that it works far less well than it seems. Just as economic work on incentives predicts, people try â" often successfully â" to game the system. As a consequence, scores on high-stakes tests can become dramatically inflated, creating an illusion of progress that is comforting to policymakers and educators but of no help whatever to children.

The issue of score inflation remains oddly controversial. Many in the policy world ignore it altogether or treat it as something that we really need not worry about. One superintendent of a large urban district recently dismissed the entire issue with a single sentence: âThatâs just a matter of opinion.â He was wrong. Score inflation is a matter of evidence, not merely opinion, and the problem is severe.

The inflation of test scores should not be surprising, since similar corruption of measures occurs in many other fields. Over the years, the press has documented corruption of measures of postal delivery times, airline on-time statistics, computer chip speeds, diesel engine emissions, TV program viewership, and cardiac surgery outcomes, as well as scores on achievement tests (e.g., Cushman, 1998; Farhi, 1996; Hickman et al., 1997; Lewis, 1998; Markoff, 2002; McAllister, 1998; and Zuckerman, 2000). If many cardiac surgeons avoid doing procedures on high-risk patients who may benefit for fear of worsening their numbers, as the majority of respondents to a recent survey admitted (Narins et al., 2005), it is hardly remarkable that some teachers and students will take shortcuts that inflate test scores.

The few relevant studies are of two types: detailed evaluations of scores in specific jurisdictions, and a few broad comparisons of trends on state tests and NAEP. The former are far fewer than we should have. The reason is not hard to fathom. Imagine yourself as superintendent of a district or state with rapidly increasing test scores. A researcher asks you for permission to evaluate the validity of these gains, to explore whether they are inflated and, if so, whether there are any useful patterns in the amount of inflation. Not a politically appealing prospect.

The logic of both types of study is the same. The goal of education is to teach students skills and knowledge. A test score, which reflects performance on a very small sample of this material, is valuable only to the extent that it accurately represents studentsâ overall mastery. A test is in this respect much like a political poll. For example, two months before the 2004 election, a Zogby International poll of 1,018 likely voters showed George W. Bush with a 4 percentage point lead over John Kerry. Not too bad a prediction: Bushâs margin two months later was about 2.5 percent. But should we have cared how the specific 1,018 respondents themselves actually voted? In general, no; the specific voters sampled are just a drop in the bucket of millions of voters, and we worry about their opinions only because of what they suggest about the inclinations of the electorate as a whole. Analogously, we should not be too concerned about performance on the few specific items on a given test. Instead, we need to worry about the much larger domain of knowledge and skill that these few items are designed to represent.

For that reason, gains in scores on a high-stakes test, if they represent real gains in achievement, should generalize. Higher scores should predict better performance in the real world outside of the studentsâ current schools â" whether that be later studies or the world of work. By the same token, score increases should generalize to better performance on other tests designed to measure similar bundles of knowledge and skills. Gains will not be exactly the same from one test to the next, but when tests are designed to support similar inferences about performance, gains ought to generalize reasonably well.

The results of the relatively few relevant studies are both striking and remarkably consistent: gains on high-stakes tests typically do not generalize well to other measures, and the gap is usually huge. When students do show improvements on other, lower-stakes measures used to audit gains (most often the National Assessment of Educational Progress), the gains on the audit test have generally been one-third to one-fifth the size of the gains shown on the high-stakes test. And in several cases, large gains on high-stakes tests have been accompanied by no improvement whatever on an audit test. For example, during the first two years of the high-stakes testing program Kentucky instituted in the early 1990s â" in several respects, a precursor of NCLB â" fourth-graders showed a staggering increase of about 3/4 of a standard deviation on the stateâs high-stakes reading test. NAEP, however, showed no increase at all (Hambleton et al., 1995). Other studies have found similar results in Chicago, Houston, Texas as a whole, and an anonymous district I studied earlier (Jacob, 2002; Klein, Hamilton, McCaffrey, & Stecher, 2000; Koretz, Linn, Dunbar, & Shepard, 1991; Schemo & Fessenden, 2003).

These few studies are complemented by the second group, which provide a broad overview of the comparability of trends on state tests and NAEP. They are consistent, showing that in many but not all states, gains on state tests are substantially, sometimes dramatically, larger than the same statesâ gains on NAEP (Fuller, Gesicki, Kang, & Wright, 2006; Lee, 2006; Linn & Dunbar, 1990).

The implication of this research is inescapable: much of the apparent progress shown by increasing scores on high-stakes tests is simply bogus, an illusion that allows us to proclaim success while students continue to be deprived of opportunity.

Research indicates that score inflation varies markedly from school to school, but it does not provide any general guidance about which types of schools are most affected. Given the current state of our knowledge, we cannot accurately predict which schools have sizable inflation and which do not, and we usually have no means of determining this from available scores. This has two unfortunate consequences.

First, it vitiates conclusions about the relative effectiveness of schools. If inflation were uniform, overall gains would be exaggerated, but one could still identify the schools with relatively large or relatively small improvements in learning. But given our inability to pin down school-level variations in score inflation, conclusions about relative effectiveness are entirely untrustworthy if they are based only on scores on high-stakes tests, and we can expect to reward or sanction the wrong schools a good bit of the time.

Second, we cannot ascertain the relative impact of test-based accountability programs on the groups of students who are the focus of the civil rights communityâs concern. I and several others have hypothesized that score inflation will often be worse in low-achieving schools. Our logic is simple. Systems such as NCLB require teachers in high-achieving schools to make relatively modest gains. (This depends on statesâ performance standards, of course, but it is also built into the AYP system and the âstraight-lineâ systems many states used before NCLB.) Moreover, many high-achieving schools are in communities that offer relatively substantial out-of school supports for student achievement, such as well-educated parents who press for high grades and can re-teach material at home and buy after-school tutoring. Teachers in low-achieving schools must generate far larger gains, and in many cases must do it with weaker community support. Faced with the need to do more with less, teachers in low-achieving schools will face stronger incentives to cut corners in ways that inflate scores. But this remains only a hypothesis, not yet tested by strong evidence.

Some researchers have argued that unrealistically high performance standards are analogous to auto emissions controls and that if you require more improvement than manufacturers can provide, you end up with some fraction of what you demand and thus are better off than you were before. Whether or not this is true of emissions controls, it is not likely to be true of test-based accountability. Under NCLB one gets no credit for getting part way to AYP, and the tools for inflating scores are ready at hand. Therefore, one might get less real improvement by requiring too much gain, because teachers will have incentives to abandon legitimate instructional improvements that generate slower gains in favor of short-cuts â" inappropriate test preparation, or simple cheating â" that generate faster gains. After more than three decades of high-stakes testing in the U.S., we ought to have some hard evidence on this point, but we do not.

I encountered educatorsâ responses to excessive expectations when I recently gave a talk on test preparation to a large group of principals, may from inner-city schools. I explained the principle that tests represent very small samples from larger domains of knowledge and skills. Therefore, the good way to prepare students for high-stakes tests is to focus on the knowledge and skills the tests are supposed to represent so that students will have better capabilities when they leave school. The bad way to prepare them is to focus narrowly on the specifics of their own test â" that is, to focus on raising scores on that specific test as an end in itself â" which can lead to spurious gains limited to that one measure. By analogy, they should try to persuade the entire electorate in order to win the election, rather than trying to persuade Zogbyâs 1,018 respondents to change their votes.

I then gave the principals a dozen real examples of test preparation activities, ranging from egregiously bad to reasonable by this criterion. I asked them to decide whether each one would teach the underlying knowledge and skills and therefore produce real gains that would generalize to more than one test.

A minority of the principals identified the particularly bad examples, and a few added examples of their own. One said that they are told what parts of the stateâs standards will be emphasized on the test so that teachers need not spend much time on the others â" a sure recipe for score inflation. (There is now a term for this that makes it seem innocuous: âpower standards.â)

But many of the principals steadfastly defended every single example of test preparation, even those that were unarguably bad. The most extreme was a case in which a district provided the actual test item in advance, changing only three trivial details, which is no more than simple cheating. That too was fine with many of them. Many of the participants became hostile.

In retrospect, these responses are not surprising, given the incentives and sanctions these principals face under NCLB. For several years, they have been struggling to make AYP, which requires many of them to make far more rapid gains than any of us can tell them how to do by legitimate means. And the consequences of failure are dire. Then I explained to them that many of the methods they have been using in their desperate fight to keep their noses above water are simply inflating test scores. Upton Sinclairâs principle applies: âIt's difficult to get a man to understand something when his salary depends on his not understanding it.â Until we impose a system that creates the right incentives, it is not reasonable to expect educators to ignore the perverse incentives we have already put into place.

How Do Educators Respond to High-Stakes Testing?

A substantial number of studies over the past few decades have investigated teachersâ responses to high-stakes testing. These studies show a mix of desirable and undesirable responses, and they help explain the inflation of scores found in the studies noted above (Stecher, 2002).

On the positive side, research suggests that high-stakes testing has in some cases motivated teachers to work harder and more effectively. It leads many teachers to align their instruction more closely with the tested content, which as we will see, can be both good and bad. Some teachers report that the results of high-stakes tests are useful for diagnosis. (However, it is the test, not the high-stakes attached to it, that is useful in this respect; tests designed for diagnostic purposes were widely used in American schools for decades before high-stakes testing became common.) Some studies have found specific instructional effects consistent with the goals of the accountability systems of which they are a part, such as an increase in writing instruction when tests require substantial writing.

At the same time, research has shown a variety of negative effects of high-stakes testing on educational practice. Many of these can inflate test scores, and some are undesirable for other reasons as well. It is helpful to distinguish among different types of test preparation in terms of their potential to generate either meaningful gains in achievement, score inflation, or both (Koretz & Hamilton, 2006; Koretz, McCaffrey, & Hamilton, 2001). I use âtest preparationâ to refer to all techniques used to prepare students for tests, whether good or bad, and deliberately avoid terms like âteaching the testâ and âteaching to the test,â which come freighted with inconsistent and often poorly reasoned connotations. The types are:

* Teaching more;
* Working harder;
* Working more effectively;
* Reallocation;
* Alignment;
* Coaching; and
* Cheating.

The first three are what most proponents of high-stakes testing programs â" including NCLBâ"want and expect. âTeaching moreâ and âworking harderâ can both be carried to excess, to a point at which the marginal effects on learning are negative or at which they have other negative effects (such as an aversion to schooling or to learning) that offset short-term gains in achievement. But within reason, all three of these forms of test preparation can be expected to lead to meaningful gains in scores that signal higher achievement.

Cheating is the other extreme: it can only produce bogus gains in scores. There are limited systematic data about cheating, but there are enough news accounts to make it clear that it is hardly rare (see, for example, www.caveon.com/resources_news.htm). It takes all manner of forms: providing inappropriate hints during test administration, changing answer sheets after tests are completed, circulating actual test items (or items that are nearly identical) before a test, and so on. It is not clear that all instances of cheating are intentional, but it inflates scores regardless. My speculation is that cheating is more common in low-scoring schools, again because of the far greater pressure to raise scores, but there are no systematic data to test this hypothesis.

The controversial types of test preparation are the remaining three: reallocation, alignment, and coaching. All three can produce either real gains, inflation, or both. The general principle is clear: these forms of test preparation are desirable when they improve studentsâ mastery of the broad domains of achievement â" say, eighth-grade mathematics â" that the tests are designed to represent. They are undesirable and inflate test scores when they focus unduly on the particulars of the specific test chosen and therefore produce greater gains on that particular test than true improvements in learning warrant. In practice, however, the dividing line between the good and bad forms of reallocation, alignment, and coaching is sufficiently indistinct that keeping educators on the right side will be very hard until we do a better job of creating incentives for them.

Reallocation refers simply to shifting resources â" instructional time, studentsâ study time, and so on â" to better fit the particulars of a testing program. Research has found that educators report reallocating their instruction in response to high-stakes tests. Reallocation occurs across subject areas, as shown by the reports of districts and schools reducing or eliminating time allocated to untested subject areas to make more time for the subjects that count in the accountability system (e.g., Rothstein & Jacobsen, 2006; Sunderman et al., 2004). Reallocation can also be carried out within subject areas, by emphasizing the particular portions that are emphasized by the test. Reallocation within subject areas is a key piece of the score-inflation puzzle.

Some amount of reallocation within subjects is desirable and is one of the intended effects of test-based accountability. If a testing program shows that students in a given school are not learning Topic A, and Topic A is important, one would want the schoolâs teachers to put more effort into teaching Topic A.

The problem is that instruction is very nearly a zero-sum game: more resources for Topic A necessarily mean fewer for Topic B. If Topic B is also important for the inference about performance, then taking resources away from it can inflate test scores.

Remember that a test is a small sample of a large domain of achievement, just as a poll is small sample of voters. The key to the success of both is that the small sample has to represent the larger domain. If teachers take resources away from relatively unimportant material to make way for emphasizing Topic A, then all is fine. But if the material that gets less emphasis is an important part of the domain â" if it is an important part of what users of the scores think they are measuring â" then performance on the tested sample will show improvements when mastery of these other important parts of the domain is stagnant or even declining. This is precisely what studies of score inflation have found.

The more predictable a test is, the easier it becomes for teachers to reallocate in a way that inflates scores. For any number of reasons â" the pressure of time, costs, a desire to keep test forms similar to facilitate linking of scores from year to year, the creativity needed to avoid similarities â" most testing programs show a considerable resemblance from year to year. In many programs, much of the specific content is replaced each year, but the types of content and the style and format of test items show noticeable similarities from year to year. Some educators try hard to discern these recurrences, but they need not do it on their own; there is a vibrant industry of test-prep firms that will do it for them, and many districts and states provide this as well.

Alignment is a cornerstone of current education policy and is noted repeatedly in NCLB. Instruction is to be aligned with content and performance standards, and assessments must be aligned with both. Up to a point, alignment is clearly a good thing: we want teachers to focus on important material, and no one would want to judge teachers or schools by testing students on content that schools are not expected to teach.

Alignment is often cast as an unmitigated good, and not infrequently one will hear alignment presented as a means of preventing score inflation. Not long ago, for example, a principal well known for achieving high scores in a poor, mostly minority school angrily told a crowd of college students that critics who warn of teaching to the test in her state are completely unwarranted. We donât have to worry about teaching to the test, she maintained, because her stateâs test covers important knowledge and skills that the students need to have.

This is nonsense. I do not intend to disparage her stateâs test; her argument would have been specious regardless of which stateâs test her students took. She was mistaking the test for the domain it represents â" confusing Zogbyâs 1,018 respondents with the electorate. Alignment is nothing more than reallocation by another name, albeit with the constraint that the material emphasized must be consistent with standards. But whether alignment or other reallocation inflates scores depends on more than the quality of the material given additional emphasis. It also depends â" critically â" on the material given less emphasis. Because tests are such small samples from large domains, it is entirely practical to give more emphasis to some important material while taking it away from other equally important material. There is ample room to take it away from other material aligned with standards (hence test preparation focusing on âpower standards.â) Research confirms this. Studies of Kentuckyâs KIRIS assessment program of the 1990s, which was an archetypal standards-based system, found severe score inflation in every comparison examined (Koretz & Barron, 1998).

The final form of test preparation is coaching, a term that I use to refer to focusing instruction on fine details of the test, such as the format of test items, the particular scoring rubrics, or minor details of content. Encouraging students to use format-dependent test-taking strategies, such as plugging in and process of elimination, is a form of coaching, and it generates gains that evaporate when students are presented with tasks that have no choices to plug in or eliminate.

Inflation of scores does not require that teachers or students focus on unimportant material. It can arise that way â" for example, if teachers focus on test-taking tricks rather than important content. But this is not necessary. Inflation can occur from excessive narrowing of instruction, even if the material taught is valuable. One secondary-school mathematics teacher told me that her stateâs test presented only regular polygons, and therefore, she asked, why would she bother teaching about irregular polygons? What she meant was: âSince my goal is to raise scores, why would Iâ¦?â If her question had been, âSince my goal is to teach plane geometry, why would Iâ¦,â the answer would have been different and equally obvious.

The lesson is that the incentives we currently give teachers are too crude and simply donât work as advertised. The goal has become raising scores as an end in itself â" persuading Zogbyâs 1,018 respondents â" rather than improving learning. The incentives teachers face do not favor the good forms of reallocation, alignment, and coaching over the bad. Many educators take the path of least resistance and, by doing so, they inflate scores. The system cheats kids of the education they deserve.

A common but mistaken response is that inappropriate reallocation and coaching arise because we use âbadâ tests. If we just built better tests, the argument goes, these problems would be solved. This was an argument made for moving from multiple-choice to performance assessments nearly 20 years ago, and for moving from those to todayâs standards-referenced tests. Neither change solved the problems of inappropriate test preparation and score inflation, and we are not going to solve them now with better tests. With enough creativity, time, resources, and evaluation, tests could be improved to lessen these problems â" for example, by deliberately avoiding unneeded recurrences over time, and by building in novel content and novel forms of presentation for purposes of auditing score gains. But there are numerous factors that limit how much we ameliorate the problem â" e.g., the need to keep tests sufficiently similar from year to year to allow meaningful linking of scores, resource limitations, the limited and already strained capacity of the testing industry, and the requirement, when students are given scores, that students within a cohort are administered the same or comparable sets of items. Moreover, there are many important outcomes of education that are difficult or impossible to measure with standardized testing.

Finally, there is the problem of incentives for chief state school officers. Under the provisions of NCLB, what would motivate one to spend considerably more money to buy a somewhat inflation-resistant test that would generate smaller observed gains in scores? Better tests â" by which I mean tests designed with an eye to the problems caused by test-based accountability â" might indeed be an important step, but it will not suffice, and it is no substitute for putting in place a more reasonable set of incentives for teachers.

How Much Gain Is Feasible?

One of the most remarkable and dysfunctional aspects of the test-based accountability systems in place now under NCLB is that performance targets are usually made up from whole cloth, with no basis in experience, historical evidence, or evaluations of previous programs. And for political rather than empirical reasons, the targets are uniform for all schools in a state with similar initial levels of performance, regardless of the impediments they face in improving scores.

Proponents of standards-based reporting of test scores will bristle at the word âarbitrary,â but that is a reasonable label for performance targets set without empirical evidence of attainable improvements. We do have relevant evidence, but policymakers have generally ignored it. We might start with the data we have on long-term trends in achievement. For example, the achievement decline of the 1960s and 1970s created great consternation and was a major impetus for the waves of education reform that continue with NCLB. Should we assume that schools can quickly implement reforms that produce gains as rapid as the declines of that era? Or, as Linn has suggested, we might identify the most rapidly improving schools, perhaps the top 10 percent, and use their gains to set goals (Linn, 2005). We could also use international comparisons to help us decide what is reasonable. For example, given international differences in factors outside of the control of schools, even ideal policies would presumably only bring us to the level of the highest-performing countries over a long period, if at all.

Finally, we could use research, development, and evaluation to help set targets, as we do in policy domains as varied as public health and auto safety. That is, we could design new reforms, implement them on a limited but planned basis, and subject them to rigorous tests to determine their effects before putting them into operation nationwide or even statewide.

Because evidence is not brought to bear, standards are inconsistent and often unrealistic. Linn (2000; also, this volume) has shown that if states set eighth-grade mathematics standards comparable in difficulty to the NAEP Proficient standard, as many critics of state standards suggest they should, we would be setting targets that roughly a third of the students in Japan and Korea â" two of the highest-scoring countries in the world â" would fail to meet. Is it realistic to expect that virtually all of our students (including most students with disabilities and English as a second Language) will exceed such a threshold in a mere 12 years? That is not only unrealistic; it is counterproductive â" increasing the incentives to cut corners and inflate scores â" and cruel to some lower-performing students. Without basis in historical data and research, we simply assume that educators and students can reach these targets by legitimate means and that we are doing more good than harm. Worse, we rarely have in place credible mechanisms for measuring the good and the harm.

How Much Can the Variability of Achievement Be Shrunk?

One of the most positive aspects of NCLB is its focus on equity. Many of the key aspects of NCLB â" disaggregated reporting; the conjunctive AYP system (which classifies a school as failing if any one of the mandated reporting groups fails to make AYP); the uniformity of the ultimate performance targets; and the AYP requirement of greater gains by lower-scoring schools â" are motivated by a laudable desire to decrease inequities in educational outcomes. These provisions follow in the footsteps of widespread state initiatives that had the same goals, such as the âstraight-lineâ accountability systems that required all schools to progress continually from their initial performance to the uniform statewide goal. They reflect one of the principal policy mantras of the past 15 years: âall students can learn to high levels.â This raises an obvious question that to my knowledge has received almost no attention in the debate about NCLB or other accountability systems: Just how much can we shrink the variability of achievement?

To answer this question, it is essential to separate two distinct issues that are often confounded. One issue is variation among groups â" for example, differences in performance between poor and rich, or between minority and majority children. The second is the total variability of performance of individual children, which arises both from the variation among children within groups and the differences among the groups.

It might seem reasonable to expect that we can shrink the variability of performance a great deal. We have enormous and well-documented inequities in school quality and in the opportunities afforded to students. Despite intermittent progress for several decades, we still have very large gaps in performance between racial/ethnic groups and between the poor and the well-off. One might expect that if we garnered the political will to combat these inequities â" which would take a great deal more than an educational accountability system such as NCLB â" the total variation in student achievement would shrink dramatically.

However, this is not the case: the mean differences in scores between groups in the U.S., while very large, contribute little to the total variability of performance among individual students. Most of the variation in the entire population arises from the huge variation in scores within groups, not from the differences between groups. If one entirely eradicated the mean differences between racial/ethnic groups in the U.S., so that scores in every group were distributed just as they now are among non-Hispanic whites, the total variation in student performance would shrink modestly. I calculated this with two nationally representative tests (NAEP and NELS) for reading and mathematics in grade 8. The reductions in the standard deviations â" a conventional measure of the spread of scores â" ranged from about half of one percent to nine percent.

International data similarly give us reason to dampen our expectations.

Most other countries, including more homogeneous ones such as Japan and Korea, are roughly similar to the U.S. in the variability of studentsâ scores (Beaton et al., 1996).

The implication is clear: Even if we finally create a more equitable educational system and more equitable community supports for learning, we are going to be stuck with enormous variations in student performance, perhaps appreciably smaller than the variation we have now, but still very large indeed.

Therefore, what we need is a system that will put pressure on underperforming schools and schools serving historically low-achieving students â" to increase the equity of educational outcomes between groups â" while still sensibly and realistically acknowledging the large variability that will persist within groups. We have at this time no good models for this, in part because the policy community has not seen the need for them. This gap in our knowledge should be especially worrisome to the civil rights community, which cannot afford to have its core demand for greater equity of opportunity (reflected in differences among groups) held hostage to unrealistic expectations about the reduction of within-group variability of performance. If between-group equity is not clearly distinguished from variability within groups, a failure to meet unrealistic expectations about the latter might lead cynics to become pessimistic about addressing the former.

What Are the Advantages and Disadvantages of Focusing on âPercent Proficient?â

As one state official said to me recently when discussing how to report performance on his stateâs test, âProficiency is the coin of the realm.â And NCLB, of course, carries this to an extreme. The accountability apparatus of NCLB hinges largely on a single statistic: the percentage of students above the Proficient standard.

While reporting performance in this way does have one substantial merit â" it helps to focus attention on expectations â" it has numerous severe disadvantages. Two are particularly relevant to this discussion (Linn, 2003).

The first of these disadvantages is that focusing only on percent Proficient leaves all other changes in the distribution of performance unmeasured. Perhaps worse, it makes these other changes irrelevant to the accountability system. For example, take a state that has imposed a high standard for Proficiency. If one measures only percent above Proficient, all progress with students below that cut score, no matter how large, goes unnoted and unrewarded. Conversely, a school that makes very small gains among students just below the Proficient standard, just enough to get them over it, will be mistakenly credited with having effected major improvements. Many educators frankly admit that they use this fact as a way of gaming the system, focusing disproportionate attention on students near the standard and giving short shrift to students well below or well above it. There is even a common slang term for the students who are the focus of this approach: âbubble kids.â Given the low performance characteristic of many of the schools that disproportionately serve minority and poor youth, this problem ought to be of great concern to the civil rights community.

The second disadvantage is not obvious: reporting in terms of standards distorts comparisons among groups. To be more precise, it distorts comparisons in trends among groups that differ in their initial level of achievement. This fact is a consequence of the distribution of scores, the fact that there are a great many students bunched near the average and progressively fewer as one moves toward high and low extremes of performance. For example, if African American and white students in a given state were making identical progress, measures such as âpercent above Proficientâ would create the misleading appearance of differences in their rate of gain (Koretz, 2003; Koretz & Hamilton, 2006). This too should be of concern to the civil rights community.

What Should We Do Now?

The course we are now on is not working well, and over time, as the unrealistic targets we have set draw near, it is likely to work even more badly. But research has not yet provided us good alternative designs. So what is to be done?

First, we should complement in-school programs with out-of-school interventions. There is currently some debate about the proportion of the variance in test scores that is attributable to out-of-school factors, but it is clearly large. It is therefore simply unrealistic to expect improved educational services to fully offset the disadvantages faced by historically lower-scoring groups. Interventions that go beyond educational accountability could include both additional services both in school settings and outside of them â" for example, high-intensity preschool services focused on cognitive development and language acquisition. Many of the needed interventions outside of the classroom are much more difficult and expensive than simply holding educators accountable for scores, but if we are serious about equity, they are probably necessary.

Second, we need to set more realistic targets for improvement. We need more research on performance targets, but we already know enough to recognize that the current system is simply not sensible. We need to rely on such data as we have â" international and other normative data, historical data, and program evaluations â" to set targets that educators are able to reach by legitimate means.

Third, we need to use better metrics for reporting and rewarding performance on tests. We need measures that reflect improvement across the entire range of performance and that do not create perverse incentives to ignore students in certain ranges. If we persist in using percent Proficient as part of our reporting and accountability system, we need to supplement this with other measures, and these other measures must count.

Fourth, we need to do all we can to lessen the narrowing of instruction that current test-based accountability systems produce â" both the excessive focus on tested subjects at the expense of others, and the excessive focus on the content of the test at the expense of other important content within the same subject area. Here too, we need more research, but we cannot afford to ignore the problem in the meantime. For example, states and districts should stop disseminating test-preparation materials that focus on test-taking tricks or inappropriate forms of coaching. They can encourage the vendors from which they purchase tests to lessen the unintended recurrences (for example, of details of format or content) that facilitate coaching. Professional development activities can focus on the differences between good and bad test preparation. Principals and others can be on the lookout for undesirable reallocation of instruction. All of these in theory might help, although given the incentives NCLB creates to raise scores at any cost â" particularly for schools serving historically low-scoring students â" it is naïve to expect their effects to be large until we develop a more reasonable accountability system. Most actors in the system, from teachers to state chief school officers, currently have strong incentives to ignore this advice.

Fifth, we must stop taking score gains on high-stakes tests at face value. To be clear, research thus far does not suggest that all gains on all high-stakes tests are spurious. But it does unambiguously show that score inflation is not rare and that it can be very large, dwarfing true gains in achievement. And currently, the data commonly reported do not allow us to distinguish routinely between the bogus and the real.

To address the fourth and fifth of these issues, we must begin seriously and routinely evaluating the performance of our accountability systems. This evaluation must include auditing of the gains on high stakes tests. This auditing and evaluation will have two benefits. First, it will give us better information about student performance. We should not tolerate a situation in which real improvements in equity are slowed by illusory achievement gains. Second, auditing may substantially improve the incentives faced by teachers, thus reducing the gaming of the system that currently inflates scores and cheats children. As of now, teachers who inflate scores stand virtually no chance of getting caught, and admonitions to âteach to the standardsâ rather than âto the testâ are empty rhetoric, particularly when districts and states provide the tools for doing the reverse. But if educators know that auditing will sometimes expose score inflation, they will have more reason to avoid the shortcuts that cause it.

At the moment, NAEP can provide an audit measure in many states, but it is not available in most grades, and over-reliance on that one test may lead some people to start gaming NAEP as well. Routine auditing will likely require additional measures, either separate from operational tests or embedded within them. The key is that a good audit test measures the same core knowledge and skills as the accountability test, but it differs in many of the particulars â" the details of content, format, scoring rubrics, and so on â" that individual educators and test preparation firms capitalize on to inflate test scores. While this principle is clear and simple, the practical details of constructing audit tests remain unexplored, for the simple reason that the people who buy tests have had no incentive to ask for audit measures. That must change, and the reauthorization of NCLB provides a powerful opportunity to change it.

Finally â" and I would argue, most important of all â" we need to start, belatedly, on a serious program of research, development, and evaluation to facilitate the design of better educational accountability programs that will do more to improve the achievement of historically low-scoring groups while generating fewer negative side-effects. The list of important unanswered questions remains daunting. How can we best pressure schools to reduce inequities while accommodating the inevitably wide variations in performance within groups? How can we better design assessment systems to reduce the problem of bogus gains in scores? How can we create a better mix of incentives for educators, one that will encourage greater effort with less narrowing of instruction? What types of formative assessments and other test preparation activities produce the largest gains in learning and the least score inflation? The list goes on.

Some people now argue that we have a solution at hand that circumvents the need for more R&D: value-added assessments. Value-added systems measure the gains students make while in a given grade, rather than tracking the improvements of successive cohorts of students in a single grade. Value-added systems indeed have many important advantages over the current system. For example, they are less sensitive to bias from differences in the characteristics of students, and they measure what many people consider a more appropriate variable for accountabilityâ"what a teacher or school teaches a group of students while they are in the schoolâs care. Value-added assessments, however, still confront a substantial set of difficulties. For example, if testing is annual, value-added systems work only in subjects in which the curriculum is largely cumulative; they are highly error-prone, so most of the apparent differences among teachers or schools are measurement error rather than real differences in output; they are seriously problematic where there is substantial differentiation of curricula, for example, in most middle-school mathematics programs; the rankings they provide are not always consistent from one test to another; their results can be highly sensitive to arcane technical aspects of test construction and scaling; and their results are sometimes sensitive to choices in statistical models used, many of which are extremely complex and not understood by most users of the data (McCaffrey, Lockwood, Koretz, & Hamilton, 2003). None of these difficulties argues against exploring value-added approaches as a part of educational accountability systems. But they do argue persuasively against accepting this approach as a new silver bullet that would once again free us from the hard but needed work of rigorous research and evaluation.

One of the most serious negative ramifications of NCLB is that it impedes the R&D needed for long-term improvements in policy. This may seem an odd claim, given that NCLB has encouraged the creation of vast databases of test scores. These data, however, are generally not useful for serious evaluation of alternative policies because of the problem of score inflation and because reporting is often limited to percents above standards. And the pressure created by NCLB makes experimentation with alternatives too risky. When everyone is in a race â" often a desperate race â" to raise scores on the few measures that count under NCLB, and to raise them continually, it is simply unrealistic to expect states, districts, and schools to agree to participate in R&D and experimentation. Experimentation runs the risk of smaller gains over the short term in the service of greater benefits over the long term, and this is a trade that NCLB makes very costly. This constraint is especially severe for schools serving the students who are the primary concern of civil rights advocates, because those schools face particularly great pressure to raise scores by a large amount very quickly.

This, then, is my reason for arguing that we should not let concerns about changes to the details of NCLB â" even the most important details â" blind us to the need for a longer-term plan for creating better educational accountability systems. Building those better systems requires more systematic, empirical data, and that in turn requires a serious agenda of R&D. Whether this R&D is carried out over the coming years will depend substantially on whether the reauthorization of NCLB makes it more feasible. Because of the frantic race for score increases created by NCLB in its current form, serious R&D will be hindered unless the provisions of NCLB are substantially changed. One option might be to provide waivers from NCLB accountability provisions to jurisdictions willing to do the needed, difficult work of helping to design the programs our children need and deserve. In addition, the incorporation of audit provisions into NCLB might also encourage R&D by reducing the inflated gains of jurisdictions to which the experimenting ones might be compared. And given the costs of the needed R&D â" which are too large for many jurisdictions to take on â" NCLB could include a mechanism for providing federal support for these efforts.

REFERENCES

Beaton, A. E., et al. (1996), Mathematics achievement in the middle school years: IEAâs Third International Mathematics and Science Study (TIMSS), Appendix E. Chestnut Hill, MA.: TIMSS International Study Center, Boston College.

Cushman, J. H. (1998, February 11). Makers of diesel truck engines are under pollution inquiry. The New York Times, [internet copy, not paginated].

Farhi, P. (1996, November 17). Televisionâs âsweepsâ stakes: Season of the sensational called a context out of control. The Washington Post, , p. A01.

Farhi, P. (1996, November 17). Televisionâs âsweepsâ stakes: Season of the sensational called a context out of control. The Washington Post, , p. A01.

Fuller, B., Gesicki, K., Kang, E., & Wright, J. (2006). Is the No Child Left Behind Act working? The reliability of how states track achievement. University of California, Berkeley: Policy Analysis for California Education.

Hambleton, R. K., Jaeger, R. M., Koretz, D., Linn, R. L., Millman, J., & Phillips, S. E. (1995). Review of the measurement quality of the Kentucky instructional results information system, 1991-1994. Frankfort: Office of Education Accountability, Kentucky General Assembly.

Hickman, A., Levin, C., Rupley, S., and Willmott, D. (1997, January 6), Did Sun cheat? PC Magazine, [internet copy, not paginated].

Jacob, B. (2002). Accountability, incentives and behavior: The impact of high-stakes testing in the Chicago public schools (Working paper W8968). Cambridge, MA: National Bureau of Economic Research.

Klein, S. P., Hamilton, L. S., McCaffrey, D. F., & Stecher, B. M. (2000). What do test scores in Texas tell us? (Issue Paper IP-202). Santa Monica, CA: RAND. Retrieved January 12, 2004, from http://www.rand.org/publications/IP/IP202/.

Koretz, D. (2003). Attempting to discern the effects of the NCLB accountability provisions on learning. In K. Erickan (SPELLING?) (Chair), Effects of Accountability on Learning. Presidential invited session, annual meeting of the American Educational Research Association, Chicago, April 22.

Koretz, D., & Barron, S. I. (1998). The Validity of Gains on the Kentucky Instructional Results Information System (KIRIS). MR-1014-EDU, Santa Monica: RAND.

Koretz, D., & Hamilton, L. S. (2006). Testing for accountability in K-12. In R. L. Brennan (Ed.), Educational measurement (4th ed.). Westport, CT: American Council on Education/Praeger.

Koretz, D., Linn, R. L., Dunbar, S. B., & Shepard, L. A. (1991). The effects of high-stakes testing: Preliminary evidence about generalization across tests, in R. L. Linn (chair), The Effects of High Stakes Testing, symposium presented at the annual meetings of the American Educational Research Association and the National Council on Measurement in Education, Chicago, April.

Koretz, D., McCaffrey, D., & Hamilton, L. (2001). Toward a framework for validating gains under high-stakes conditions. CSE Technical Report 551. Los Angeles: Center for the Study of Evaluation, University of California.

Lee, J. (2006). Tracking achievement gaps and assessing the impact of NCLB on the gaps: An in-depth look into national and state reading and math outcome trends. Cambridge, MA: The Civil Rights Project at Harvard University.

Lewis, P. H. (1998, September 10), How fast is your system? That depends on the test. The New York Times, , p. E.1

Linn, R. L. (2000). Assessments and accountability. Educational Researcher, 29 (2), 4â'16.

Linn, R. L. (2003), Performance standards: Utility for different uses of assessments. Education Policy Analysis Archives, 11(3). Retrieved January 26, 2007 from http://epaa.asu.edu/epaa/v11n31/.

Linn, R. L. (2005, June 28)). Conflicting demands of No Child Left Behind and state systems: Mixed messages about school performance. Educational Policy Analysis Archives, 13(33). Retrieved January 26, 2007 from http://epaa.asu.edu/epaa/v13n33/.

Linn, R. L., & Dunbar, S. B. (1990). The Nation's report card goes home: Good news and bad about trends in achievement. Phi Delta Kappan, 72 (2), October, 127-133;

Markoff, J. (2002, August 27). Chip maker takes issue with a test for speed. The New York Times, , p. C3.

McAllister, A. (1998, January 10). âSpecialâ delivery in W. Virginia: Postal employees cheat to beat rating system. The Washington Post, p. A1.

McCaffrey, D. F., Lockwood, J. R., Koretz, D. M., & Hamilton, L. S. (2003). Evaluating value-added models for teacher accountability. Santa Monica: RAND, MG-158-EDU, http://www.rand.org/publications/MG/MG158/.

Rothstein, R & Jacobsen, R. (2006, forthcoming IS IT OUT NOW?). The goals of education, The Phi Delta Kappan, 88(4), December.

Narins, C. R., Dozier, A. M., Ling, F. S., and Zareba, W. (2005). The influence of public reporting of outcome data on medical decision making by physicians. Archives of Internal Medicine, 165 (January 10), 83-87.

Schemo, D. J., & Fessenden, F. (2003, December 3). Gains in Houston schools: How real are they? New York Times . Retrieved December 3, 2003, from http://www.nytimes.com.;

Stecher, B. (2002), Consequences of large-scale, high-stakes testing on school and classroom practice, in L. Hamilton, et al., Test-based accountability: A Guide for practitioners and policymakers, Santa Monica: RAND (http://www.rand.org/publications/MR/MR1554/MR1554.ch4.pdf).

Sunderman, G.L., Tracey, C., Kim, J. & Orfield, G. (2004). Listening to teachers: Classroom realities and No Child Left Behind. Cambridge, MA: The Civil Rights Project at Harvard University.

Zuckerman, L. (2000, December 26). In airline math, an early arrival doesnât mean you wonât be late. The New York Times, [internet copy, not paginated].

— Daniel Koretz
Key Reforms Under the No Child Left Behind Act: The Civil Rights Perspective
2006-11-16
http://www.law.berkeley.edu/centers/ewi/research/k12equity/Koretz.html


INDEX OF RESEARCH THAT COUNTS


FAIR USE NOTICE
This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of education issues vital to a democracy. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit to those who have expressed a prior interest in receiving the included information for research and educational purposes. For more information click here. If you wish to use copyrighted material from this site for purposes of your own that go beyond 'fair use', you must obtain permission from the copyright owner.