In Defense of Reforming Teacher Evaluations

One of the hot debates in education policy centers around the use of quantitative measures to evaluate teachers -- measuring student achievement and tying teacher compensation, in some way, to that. Jim Manzi has some interesting, though flawed, thoughts about the problems with such measuring systems:

I’ve seen a number of such analytically-driven evaluation efforts up close. They usually fail. By far the most common result that I have seen is that operational managers muscle through use of this tool in the first year of evaluations, and then give up on it by year two in the face of open revolt by the evaluated employees. This revolt is based partially on veiled self-interest (no matter what they say in response to surveys, most people resist being held objectively accountable for results), but is also partially based on the inability of the system designers to meet the legitimate challenges raised by the employees.

Here is a typical (illustrative) conversation between a district manager delivering an annual review based on such an analytical tool, and the retail store manager receiving it:

District Manager: Your 2007 performance ranking is Level 3, which represents 85% payout of annual bonus opportunity.

Store Manager: But I was Level 2 (with 90% bonus payout) last year, and my sales are up more than the chain-wide average this year.

DM: [Reading from a laptop screen] We now establish bonus eligibility based on your sales gain versus the change in the potential of your store’s trade area over the same time period. This is intended to fairly reflect the actual value-added of your performance. We average this over the past three years. Your sales were up 5% this year, but Measured Potential for your store’s area was 10% higher this year, so your actual value-added averaged over 2005 – 2007 declined versus 2004 – 2006.

SM: My “area potential” increased 10%? – that’s news to me. Based on what?

DM: The new SOAP (Store Operating Area Potential) Model.

SM: What?

DM: [Reading from a laptop screen] “SOAP is based on a neural network model that has been carefully statistically validated.” Whatever that means.

[Continues reading] “It considers such factors are trade area demographic changes, competitor store openings, closures and remodels, changes in traffic patterns, changes in co-tenancy, and a variety of other important factors.”

SM: What factors are up that much in my area?

DM: [Skipping to the workbook page for this specific store, and reading from it] A combination of factors, including competitor openings and the training investment made in your store.

SM: But Joe Phillips had the same training program in his store, and he had no new competitor openings – and he told me that he got Level 2 this year, even though his sales were flat with last year. How can that be?

DM: Look, the geniuses at HQ say this thing is right. Let me check with them.

[2 weeks later, via cell phone]

DM: Well, I checked with the Finance, Planning & Analysis Group in Dallas, and they said that “the model is statistically valid at the 95% significance level “ (whatever that means), “but any one data point cannot be validated.”

[10 second pause]

Let me try to take this up the chain to VP Ops, and see what we can do, OK?

SM: Whatever. I’ve got customers at the register to deal with. [Hangs up]

That's an interesting insight into the general problem with quantitative measures. Here are a few points in response:

1. You need some system for deciding how to compensate teachers. Merit pay may not be perfect, but tenure plus single-track longevity-based pay is really, really imperfect. Manzi doesn't say that better systems for measuring teachers are futile, but he's a little too fatalistic about their potential to improve upon a very badly designed status quo.

2. Manzi's description...

evaluating teacher performance by measuring the average change in standardized test scores for the students in a given teacher’s class from the beginning of the year to the end of the year, rather than simply measuring their scores. The rationale is that this is an effective way to adjust for different teachers being confronted with students of differing abilities and environments.

..implies that quantitative measures are being used as the entire system to evaluate teachers. In fact, no state uses such measures for any more than half of the evaluation. The other half involves subjective human evaluations.

3. In general, he's fitting this issue into his "progressives are too optimistic about the potential to rationalize policy" frame. I think that frame is useful -- indeed, of all the conservative perspectives on public policy, it's probably the one liberals should take most seriously. But when you combine the fact that the status quo system is demonstrably terrible, that nobody is trying to devise a formula to control the entire teacher evaluation process, and that nobody is promising the "silver bullet" he assures us doesn't exist, his argument has a bit of a straw man quality.