Orwell Award Announcement SusanOhanian.Org Home

7 reasons why teacher evaluations won't work

Susan Notes:

Bruce Baker is an associate professor in the Department of Educational Theory, Policy and Administration at Rutgers University Graduate School of Education in New Brunswick. He explains why using student test scores as a measure of teacher effectiveness doesn't work--and he does it in a way the public can understand.

By Bruce Baker

THE Teacher Effectiveness Task Force report issued March 3 by a panel appointed by Governor Christie recommended basing teacher evaluation significantly on student test scores. A few weeks earlier, acting Education Commissioner Christopher Cerf recommended that teacher tenure and dismissal, as well as compensation decisions, should be based largely on student assessment data.

Implicit in these recommendations is that the state and local districts would design a system for linking student assessment data to teachers for purposes of estimating teacher effectiveness. The goal of statistical “teacher effectiveness” measurement systems, including the most common approach called value-added modeling, is to estimate the extent to which a specific teacher contributes to the learning gains of a group of students assigned to that teacher in a given year.

Unfortunately, while this all sounds good, it just doesn’t work, at least not well enough to even begin considering using it for making high-stakes decisions about teacher tenure, dismissal or compensation.

Bruce Baker is an associate professor in the Department of Educational Theory, Policy and Administration at Rutgers University Graduate School of Education in New Brunswick.

Here's why:

1) It is not possible to equate the difficulty of moving a group of children 5 points (or rank and percentile positions) at one end of a test scale to moving children 5 points at the other end.

Yet that is precisely what the proposed evaluations endeavor to accomplish. In such a system, the only fair way to compare one teacher to another would be to ensure that each has a randomly assigned group of children whose initial achievement is spread similarly across the testing scale.

Real schools and districts don’t work that way. It is also not possible to compare a 5 point gain in reading to a 5 point gain in math. These limitations undermine the entire proposed system.

2) Even with the best models and data, teacher ratings are highly inconsistent from year to year, and have very high rates of misclassification.

According to one recent major study, there is a 35 percent chance of identifying an average teacher as poor, given one year of data, and a 25 percent chance given three years. Getting a good rating is a statistical crap shoot.

3) If we rate the same teacher with the same students, but with two different tests in the same subject, we get very different results. University of California at Berkeley economist Jesse Rothstein, re-evaluating the findings of a much-touted Gates Foundation study, noted that more than 40 percent of teachers who placed in the bottom quarter on one test were in the top half when using an alternative test.

That is, teacher ratings based on the state assessment were only slightly better than a coin toss at identifying which teachers did well using the alternative assessment.

4) No matter how hard statisticians try, and no matter how good the data and statistical model, it is very difficult to separate a teacher’s effect on student learning gains from other classroom effects, like peer effect (race and poverty of peer group).

New Jersey schools are highly segregated, hampering our ability to make valid comparisons across teachers who work in vastly different settings. Statistical models attempt to adjust away these differences, but usually come up short.

5) Kids learn over the summer too, and higher-income kids learn more than their lower-income peers over the summer. As a result, annual testing data aren’t very useful for measuring teacher effectiveness. Annual (rather than fall-spring) testing data poses a significant disadvantage for teachers serving children whose summer learning lags. Setting aside all of the unresolvable problems above, this one can be fixed with fall-spring assessments.

But it cannot be resolved in any fast-tracked plan involving current New Jersey assessments, which are annual. The task force report irresponsibly ignores this huge concern, recommending fast-tracked use of current assessment data.

6) As noted by the task force, only those teachers responsible for reading and math in Grades 3 to 8 could readily be assigned ratings (less than 20 percent of teachers). Testing everything else is a foolish and expensive endeavor. This means school districts will need separate contracts for separate classes of teachers and will have limited ability to move teachers from one contract type to another.

Further, pundits have been arguing that we should be using effectiveness measures instead of experience to implement layoffs due to budget cuts and that we shouldn’t be laying off core classroom teachers in Grades 3 to 8. But those are the only teachers for whom “effectiveness” measures would be available.

7) Basing teacher evaluations, tenure decisions and dismissal decisions on scores that may be influenced by which students a teacher serves provides a substantial disincentive for teachers to serve kids with the greatest needs, disruptive kids or kids with disruptive family lives.

Many of these factors are not, and cannot, be captured by variables in the best models. Some have argued that including value-added metrics in teacher evaluation reduces the ability of school administrators to arbitrarily dismiss a teacher.

Opportunities for sabotage

Rather, use of these metrics provides new opportunities to sabotage a teacher’s career through creative student assignment practices.

In short, we may be able to estimate a statistical model that suggests that teacher effects vary widely across the education system â€" that teachers matter. But we would be hard-pressed to use that model to identify with any degree of certainty which individual teachers are good teachers and which are bad.

Contrary to education reform wisdom, adopting such problematic measures will not make the teaching profession a more desirable career option for America’s best and brightest college graduates.

In fact, it will likely make things much worse. Establishing a system where achieving tenure or getting a raise becomes a roll of the dice and where a teacher’s career can be ended by a roll of the dice is no way to improve the teacher work force.

Using these metrics as a basis for dismissing teachers will not reduce the legal hassles associated with removal of tenured teachers. As the first rounds of teachers are dismissed by random error of statistical models alone, by manipulation of student assignments or when larger shares of minority teachers are dismissed largely as a function of the students they serve, there will likely be a new flood of lawsuits like none ever previously experienced.

Employment lawyers, sharpen your pencils and round up your statistics experts.

Authors of the task force report might argue that they are putting only 45 percent of the weight of evaluations on these measures. The rest will include a mix of other objective and subjective measures. The reality of an evaluation that includes a single large, or even significant, weight placed on a single quantified factor is that that specific factor necessarily becomes the tipping point, or trigger mechanism.

It may be 45 percent of the evaluation weight, but it becomes 100 percent of the decision, because it’s a fixed, clearly defined (though poorly estimated) metric.

Self-proclaimed “reformers” make the argument that the present system of teacher evaluation is so bad as to be non-existent. Reformers argue that the current system has a 100 percent error rate (assuming current evaluations label all teachers as good, when all, they suggest, are actually bad).

From the “reformer” viewpoint, something is always better than nothing.

Value-added is something.

We must do something.

Therefore, we must do value-added.

Self-fulfilling prophecy

Reformers also point to studies showing that teachers’ value-added scores are the best predictor (albeit a weak and error-prone one) of a teacher’s future value-added scores â€" a self-fulfilling prophecy.

These arguments are incredibly flimsy.

In response, I often explain that if we lived in a society where people walked everywhere, and a new automotive invention came along, but had the tendency to burst into a ball of flames on every third start, I think I’d walk.

Now is a time to walk!

Some innovations just aren’t ready for broad public adoption â€" and some may never be. Some, like this one, may not be a very good idea to begin with.

That said, improving teacher evaluation is not a simple either/or and now may be a good time to step outside the false dichotomy and discuss more productive alternatives.

— Bruce Baker
The Record


This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of education issues vital to a democracy. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit to those who have expressed a prior interest in receiving the included information for research and educational purposes. For more information click here. If you wish to use copyrighted material from this site for purposes of your own that go beyond 'fair use', you must obtain permission from the copyright owner.