[Guest post by Simon van Zuylen-Wood]

On February 16th, New York state officials agreed on a new teacher evaluation system that will use student standardized test scores to help determine teacher tenure and dismissals. The previous model, in which 97 percent of New York City teachers were deemed “satisfactory,” was based solely on classroom observations.

While the deal signals an important compromise between Governor Andrew Cuomo, who has pushed for more teacher accountability, and the state teachers union, the real news came a week later. On February 24, media outlets for the first time were able to publish New York City's teacher rankings. Using data from 2008 to 2010, they put together ratings for 18,000 teachers. Those ratings seemed to provide a clear picture of the so-called “teacher effectiveness gap” – the tendency of schools in wealthier neighborhoods to attract stronger teachers, saddling lower income children with inferior instructors and, presumably, inferior educations.

But the early reports suggested the gap wasn’t so wide after all. The New York Times and The Wall Street Journal released a set of demographic maps showing that good teachers were not exclusively clustered in middle-class public schools and that bad teachers were not all mired in the South Bronx and Harlem.

When I looked at the rankings, I reached the same conclusion. The Manhattan public and charter schools in which a majority of teachers performed “above average” were usually the schools with the poorest students. During the three years that the data was collected, 28 out of 39 schools whose teachers scored highly in math qualified automatically for federal Title I funding for disadvantaged children. Thirty-one of 47 schools with the best English teachers were also Title I.

So is the teacher effectiveness gap just a myth? Not necessarily. New York’s new evaluation system adjusts for demographic disparities. But the adjustments are very crude, making it difficult to compare teachers from different schools.  As NYU Professor of Educational Economics Sean Corcoran told me, “The fact that the [rankings] do adjust for so many things … basically leaves them uncorrelated with anything you can think of.”

The value-added rankings are also based on a limited pool of scores: Some teachers’ “career average” is based on only ten or fifteen students, creating a high likelihood of error. As the Times reported, math scores could be off by 35 percent; English scores by 53 percent. A teacher deemed mediocre could be disastrous, or she could be excellent—we just don’t know. It’s “extremely difficult to make meaningful comparisons between teachers,” Corcoran said, adding that the system only accurately sorts out the very best and worst teachers.

Other states are also using these “value-added” metrics for measuring teacher performance, in part because it improves their odds of landing Race to the Top grants from the federal government. New York won’t be dismissing teachers based on the results alone (they could be a factor), but other states put more emphasis on them. By 2014, low-performing Florida teachers can be let go after two years, and in Washington, D.C., one teacher with two consecutive years of high scores has already earned $30,000 in bonuses. According to education blog Gotham Schools, the scholar who created value-added evaluation recommends using at least five years of data before rushing to judgment.

Merit pay for teachers is a laudable goal. But if a system as unreliable as value-added is our “best available model,” as Corcoran put it, we have a lot of work to do before we can accurately judge teacher quality.