Due Diligence and the Evaluation of Teachers
The research on which the Los Angeles Times relied for its August 2010 teacher effectiveness reporting was demonstrably inadequate to support the published rankings. Using the same L.A. Unified School District data and the same methods as the Times, this study probes deeper and finds the earlier research to have serious weaknesses. So why does LA Times claim this study confirms earlier one?
The Executive Summary is posted below. To read this report, go to the URL below.
A REVIEW OF THE VALUE-ADDED ANALYSIS UNDERLYING THE EFFECTIVENESS RANKINGS OF LOS ANGELES UNIFIED SCHOOL DISTRICT TEACHERS BY THE LOS ANGELES TIMES
by Derek C. Briggs & Ben Domingue
On August 14, 2010, the Los Angeles Times published the results of a statistical analysis of student test data to provide information about elementary schools and teachers in the Los Angeles Unified School District (LAUSD). The analysis, covering the period from 2003 to 2009, was put forward as an evaluation of the effects of schools and their teachers on the performance of students taking the reading and math portions of the California Standardized Test.
The first step of the analysis presented in the L.A. Times was to predict student test scores for students on the basis of five factors: test performance in the previous year, gender, English language proficiency, eligibility for Title I services, and whether they began schooling in the LAUSD after kindergarten. These predicted scores were then subtracted from the scores that students actually obtained, with the difference being attributed to each student's teacher. If this difference was positive, this was considered to be evidence that a teacher had produced a positive effect on a student's learning. If negative, a negative effect was presumed. This process, known as value-added modeling, is increasingly being used to make strong causal judgments about teacher effectiveness, often with high-stakes consequences attached to those judgments.
The value-added analysis of elementary school teachers in the LAUSD was conducted independently by Richard Buddin, a senior economist at the RAND Corporation. As part of his analysis, Buddin produced a white paper entitled How Effective Are Los Angeles Elementary Teachers and Schools? We, in this new report, provide a critical review of the analysis and conclusions reached by Buddin. We conducted this review in two ways. First, we evaluated whether the evidence presented in Buddin's white paper supports the use of value-added estimates to classify teachers as effective or ineffective. This part of our report directly investigates the strength of his analysis. Second, we attempted to replicate Buddin's empirical findings through an independent re-analysis of the same LAUSD data. A hallmark of a sound analysis is that it can be independently replicated.
This new report also scrutinizes a premise of Buddin's analysis that was left unexamined: did he successfully isolate the effects of teachers on their students' achievement? Simply finding that the model yields different outcomes for different teachers does not tell us whether those outcomes are measuring whatÃ¢€™s important (teacher effectiveness) or something else, such as whether students have learning resources outside of school. Fortunately, there are good ways that a researcher can test whether such results are true or are biased. This can be done through a series of targeted statistical analyses within what we characterize as an overall "sensitivity analysis" to the robustness of Buddin's value-added model. One would expect inclusion of such a
sensitivity analysis as part of any researcher's due diligence whenever a value-added model is being proposed as a principal means of evaluating teachers.
Buddin posed two specific research questions in his white paper related to the evaluation of teachers using value-added models:
1. How much does quality vary from teacher to teacher?
2. What teacher qualifications or background characteristics are associated with success in the classroom as measured by the value-added estimates?
Regarding the first question, Buddin concludes that there is in fact significant variability in LAUSD teacher quality as demonstrated by student performance on standardized tests in reading and math. To make this case, he first uses value-added modeling to estimate the effect of each teacher on student achievement. He then examines the distribution of these estimates for teachers in each test subject (e.g., mathematics and reading). For reading performance, Buddin reports a difference between high- and low-performing teachers that amounts to 0.18 student-level test score standard deviations in reading; in math it amounts to 0.27 standard deviations. These are practically significant differences.
Regarding the second question, Buddin finds that available measures of teacher qualifications or backgrounds--years of experience, advanced degrees, possession of a full teaching credential, race and gender--have only a weak association with estimates of teacher effectiveness. On this basis, he concludes that school districts looking to improve teacher quality would be well served to develop "policies that place importance on output measures of teacher performance" such as value-added estimates, rather than input measures that emphasize traditional teacher qualifications.
In replicating Buddin's approach, we were able to agree with his finding concerning the size of the measured reading and math teacher effects. These are approximately 0.20 student-level test score standard deviations in reading, and about 0.30 in math. Our results, in fact, were slightly larger than Buddin's. Our other findings, however, raise serious questions about Buddin's analysis and conclusions. In particular, we found evidence that conflicted with Buddin's finding that traditional teacher qualifications have no association with student outcomes. In our reanalysis of the data we found significant and meaningful associations between our value-added estimates of teachers' effectiveness and their experience and educational background.
We then conducted a sensitivity analysis in three stages. In our first stage we looked for empirical evidence that students and teachers are sorted into classrooms non-randomly on the basis of variables that are not being controlled for in Buddin's value-added model. To do this, we investigated whether a student's teacher in the future could have an effect on a student's test performance in the pastÃ¢€"something that is logically impossible and a sign that the model is flawed (has been misspecified). We found strong evidence that this is the case, especially for reading outcomes. If students are non-randomly assigned to teachers in ways that systemically advantage some teachers and disadvantage others (e.g., stronger students tending to be in certain teachers' classrooms), then these advantages and disadvantages will show up whether one looks at past teachers, present teachers, or future teachers. That is, the model's outputs
result, at least in part, from this bias, in addition to the teacher effectiveness the model is hoping to capture. Because our sensitivity test did show this sort of backwards prediction, we can conclude that estimates of teacher effectiveness in LAUSD are a biased proxy for teacher quality.
The second stage of the sensitivity analysis was designed to illustrate the magnitude of this bias. To do this, we specified an alternate value-added model that, in addition to the variables Buddin used in his approach, controlled for (1) a longer history of a studentÃ¢€™s test performance, (2) peer influence, and (3) school-level factors. We then compared the resultsÃ¢€"the inferences about teacher effectivenessÃ¢€"from this arguably stronger alternate model to those derived from the one specified by Buddin that was subsequently used by the L.A. Times to rate teachers. Since the Times model had five different levels of teacher effectiveness, we also placed teachers into these levels on the basis of effect estimates from the alternate model. If the Times model were perfectly accurate, there would be no difference in results between the two models. Our sensitivity analysis indicates that the effects estimated for LAUSD teachers can be quite sensitive to choices concerning the underlying statistical model. For reading outcomes, our findings included the following:
Only 46.4% of teachers would retain the same effectiveness rating under both models, 8.1% of those teachers identified as effective under our alternative model are identified as ineffective in the L.A. Times specification, and 12.6% of those identified as ineffective under the alternative model are identified as effective by the L.A. Times model.
For math outcomes, our findings included the following:
Only 60.8% of teachers would retain the same effectiveness rating, 1.4% of those teachers identified as effective under the alternative model are identified as ineffective in the L.A. Times model, and 2.7% would go from a rating of ineffective under the alternative model to effective under the L.A. Times model.
The impact of using a different model is considerably stronger for reading outcomes, which indicates that elementary school age students in Los Angeles are more distinctively sorted into classrooms with regard to reading (as opposed to math) skills. But depending on how the measures are being used, even the lesser level of different outcomes for math could be of concern.
Finally, in the third and last stage of our analysis we examined the precision of Buddin's teacher effect estimatesÃ¢€"whether the approach can be used to reliably distinguish between teachers given different value-added ratings. We began by computing a 95% confidence interval, which attempts to take potential "sampling error" into account by providing the range that will capture the true value-added for that teacher 95 of 100 times. Once the specific value-added estimate for each teacher is bounded by a confidence interval, we find that between 43% and 52% of teachers cannot be distinguished from a teacher of Ã¢€•averageÃ¢€– effectiveness. Because the L.A. Times did not use this more conservative approach to distinguish teachers when rating them as "effective" or "ineffective", it is likely that there are a significant number of false positives (teachers rated as effective who are really average), and false negatives (teachers rated as ineffective who are really average) in the L.A. Times' rating system.
Derek C. Briggs & Ben Domingue
National Education Policy Center
INDEX OF RESEARCH THAT COUNTS