PPP Methodology: Results Aren't A Defense

Last week, I wrote a long piece about PPP’s troubling methodological choices. Some people assumed it was a continuation of a fight the previous day between Nate Silver and PPP, but it’s not. And other people latched onto various elements at the expense of others. Many of those elements are important, like PPP’s baffling incompetence, or its decision to surreptitiously let the self-reported ’08 vote inform the racial composition of the electorate. But the broad point was more important than the details: I’m not sure it’s fair to say PPP has a methodology at all. And no, PPP’s results are not an adequate justification.

Polls work for a reason. It’s not voodoo. Polling works because pollsters appreciate and employ sound statistical principles to produce a representative sample of a target population—like, say, American voters. So, no, the “goal” of a poll is not to “get it right,” or another variant of the point that “PPP gets the results right, so I don’t see what the problem is.” The goal of a poll is to take a representative sample, because a representative sample should give you a representative answer. And whether a sample is representative is determined by methodological choices, not the accuracy of the result.

Everything about PPP’s polling screams “unrepresentative sample.” They don’t include voters with a cell phone (although, to their credit, that will be changing). They’ve done polls where they ask voters more than 40 questions, presumably driving the number of completes down through the floor. Their initial sample is about 80 percent white; even crazier, 83 percent over age 45.

So PPP has to weight its data heavily. More than everyone else. But unlike everyone else, they do so without clear assumptions about a representative sample. Other pollsters might weight to Census targets, or variables in a voter file. Instead, PPP weights within extremely broad target ranges, with a splash of thorough incompetence, based on any metric it so chooses on a changing basis. As a result, PPP’s methodology gives it complete freedom to construct whatever electorate they want, however they want, whenever they want.

Start at PPP’s public methodology page. It outlines procedures, but none of PPP’s assumptions for a representative sample. It’s like a cooking recipe that says “first we use measuring cups to put flour, butter, sugar, and eggs in a pan; then we put it in the oven until it’s done.” That’s not really a recipe: They never said how much flour, the temperature in the oven, or for how long.

Actually, that was a generous comparison. Measuring cups are modestly precise instruments. Random deletion and sequential weighting is like grabbing too much flour with your bare hand, knowing some will fall between your fingers, and dropping whatever’s left into the pan. And it turned out PPP didn’t disclose its use of an entire ingredient—the ’08 election—until our email exchange a few weeks ago.

When pressed on how all of this works, Jensen didn’t really have an explanation. PPP’s target ranges aren’t defined by any research or reason (indeed, he never responded to my question about how the target ranges were defined). PPP weights within those wide target ranges based on changing metrics, without any “rules” from poll to poll.

So it’s not just that PPP makes ad hoc decisions, like other pollsters do from time to time, it’s that everything PPP does is ad hoc. The only thing that’s consistent is the procedure—random deletion. And even if PPP used the same metrics and same targets, sheer incompetence might lead them to produce two very different electorates with the same sample.

What determines the composition of any given PPP poll? In Mark Blumenthal’s words, it’s Jensen’s “gut.” That’s even a word Jensen’s used before. But there’s an outstanding issue: What, exactly, is his “gut” telling him about? Is it his gut about the racial demographics of a representative sample? In the past, that’s what Jensen said. But even a quick glance at PPP’s polls is sufficient to dispel that explanation. If Jensen had strong views about the racial composition of the electorate, he wouldn’t allow huge shifts in the racial composition of the electorate. Period.

My email exchange with Jensen introduces a second possibility: Jensen’s real assumption might have been about respondent’s self-reported vote in the ’08 election. This possibility has another varient; that Jensen basically sought to balance the '08 vote and the racial composition of the electorate on some subjective, poll-by-poll basis. But if PPP only used the ’08 electorate to push back against “liberal” samples, as Jensen said, that wouldn’t fully produce the self-correcting patterns observed in my last piece. And since Jensen deleted the ’08 question from the cross-tabs, there’s no way to verify his assertion. (As an aside: PPP would get a positive piece out of me if they released the crosstabs for the ’08 question on 2012 state and tracking polls.)

There's then the possibility that PPP was weighting toward his gut about the result, or something else, like a polling average. And although it's impossible to prove, the evidence seems to line up. Jensen said he weights his polls with the goal of getting it right. PPP only defends its methodology by saying it produces accurate results. PPP decided not to release a counter-intuitive result in Colorado, saying it didn’t believe the result in the absence of other public pollsters. PPP did not deny that he weighted toward an intended result in Georgia. And how wouldn’t knowledge of the polling averages influence PPP’s completely ad hoc weighting, even if only subconsciously?

Last Friday, PPP offered a modestly more nuanced defense: It listed 15 instances when it led the pack—either by showing a trend first, or by nailing the result on its own. This isn’t persuasive.

The question isn’t whether PPP produces good results on its own. Even a mediocre pollster will do that if they poll enough (which PPP does). The question is whether PPP does as well without other pollsters as it does with other pollsters.

The only rigorous test, then, would be looking at every PPP poll, ever, and determining whether they’re more or less accurate when other pollsters are in the field. Unfortunately, I don’t have the time for that, and, worse still, PPP’s decision to withhold last week’s Colorado recall poll jeopardizes the credibility of such a study, since PPP might only be releasing the polls that “make sense” when PPP is on its own.

So cherry picking 15 examples doesn’t prove anything. Let’s illustrate with something else that doesn’t prove anything, either: cherry picking more than 92 polls consistent with PPP following the polls:

In my last piece, I showed that 50 PPP surveys from 2012 showed an unusual, self-correcting pattern: If PPP would have shown Obama doing too well, the white share of the electorate got higher; if they would have shown Obama doing worse, the white share of the electorate got lower.

Consider Florida. In the summer, PPP showed Obama ahead by a narrow margin with an electorate that was about 70 percent white. If they had stuck with this electorate, Romney would have led by a clear margin in PPP’s final polls, since PPP’s late polls showed a (false) collapse in Obama’s support among Hispanic voters. But rather than show Romney ahead, the white share of the electorate plunged down as low as 64 percent—keeping Obama ahead and much closer to surveys calling cell phone voters, like Quinnipiac, SurveyUSA, CNN, and Marist, that showed Obama ahead by an average of one-half of a point; rounding away from PPP’s final poll. This pattern repeats itself across the battleground states and in PPP’s tracking poll. For more, refer to my original article.
After Romney secured the nomination, PPP’s early polls showed Obama up by a much wider margin than other polls. They showed Obama ahead by as much as 8 points nationally, 7 points in Ohio, 8 points in Virginia, 10 points in Iowa, 12 points in New Hampshire, 8 points in Nevada, 13 points in Colorado, and 4 points in Florida. By the end of the summer, Obama’s early edge vanished and PPP came into agreement with the other pollsters.
After Missouri Senate candidate Todd Akin’s “legitimate rape” comments, PPP led the way with a poll showing Akin up by 1 point. The first live interview poll came three days later and showed McCaskill up by 9. PPP would never show Akin ahead again.
In early 2010, PPP showed North Carolina Senator Richard Burr locked in a tight race—showing him ahead by as little as 2 points as late as July 2010. After a wave of polls showed Burr up by a huge margin, PPP showed Burr with a double digit lead for the first time. In the end, PPP came into perfect alignment with the polling averages and the results.
PPPs polls are generally very near the average of non-IVR pollsters. In 2012, PPP’s presidential battleground state polls were an average of just 1.41 points away from the average of non-IVR pollsters, defined by the Huffington Post custom trendline. They were closer to the non-IVR polling averages than to the result, which they missed by an average of 1.96 points.
Along the same lines, what are the odds that Jensen's gut happens to have no house effect?
In situations where it’s tough to weight to polling averages, there’s evidence that PPP does worse than other pollsters.

--PPP is worse than other pollsters in special congressional elections—with many of its biggest misses coming when it’s on its own. According to data compiled by the indispensable Harry Enten, 5 of the 7 worst special congressional election polls from the final two weeks of a campaign over the last nine years come from PPP, usually operating on its own. (note: the other SC-01 poll comes from RedRacingHorses, which is not, well, a real pollster). PPP also missed a special congressional primary by 19 points.

--A study by two political scientists, Joshua Clinton and Steven Rogers, found that automated polls (of which PPP’s surveys represented the majority) were significantly less accurate when a live interview pollster was in the field.

I should note that I have several reservations about the study (for that reason, I never cited it until my PPP piece), so take it with some grains of salt. But it seems more credible today than it did in spring 2012, so here's the graph.

--Even when there is a polling average, PPP tends to fall farthest from a polling average when it’s toughest to weight a poll to match the average. What situations are those? When a pollster needs to produce two accurate results, not just one. That happens, for instance, in states with both a presidential contest and a down-ballot race.

So there’s exactly the pattern you’d expect if they were weighting toward the polls: (1) PPP nails the presidential results but sacrifices the close senate races in the presidential battlegrounds, and (2) PPP nails the close senate and gubernatorial contests but badly misses the presidential margin in states with a clear favorite for President. So, for instance, PPP nailed the presidential race in Florida, but missed the senate race by 8 points; conversely, PPP nailed the gubernatorial contest in Washington, but missed the presidency by 7 points. In other words, it just looks like PPP can't have it both ways.

Live interview pollsters didn't have the same issues.

I should emphasize that these Pollster averages are kind of shaky, since there aren't always very many live interview polls. But looking at it from the perspective of a "simple average," like RCP, basically yields the same result.

Do I think these bullet points prove anything? No, of course not. But I also don't think PPP's 15 cherry picked results tell us anything, either. Instead, I'm focused on the method.

***

No matter what, the extent PPP relies on its “gut” makes it “hard to distinguish the pollster’s judgments from the poll’s measurement of voter preferences,” as The Huffington Post’s Mark Blumenthal put it. In Nate Silver’s view, what PPP is doing “barely qualifies as POLLING at all.” But no one knows whether Jensen’s gut is telling him about the results, self-reports of ’08, polling averages, the racial composition of the electorate, or something else. And the fact that we don't know is quite troubling.

But here’s what we do know: PPP has adopted a methodology that gives it tremendous amount of power to determine the composition of its sample, while applying this power in a highly inconsistent but apparently accurate fashion. So ultimately, PPP is just asking us to trust it with that power. Unfortunately, withholding methodological details and deleting questions does not inspire confidence (and again, I'd write a nice article about PPP if it released the '08 questions). Neither does withholding polls based on whether they thought the results looked right. Justifying methodological choices in decidedly unscientific terms doesn't inspire confidence either; especially the admission that PPP changed its polls to avoid Republican criticism—which is terrible if true, but hard to believe.

And if PPP's record of opacity and inconsistency leaves you a little skeptical, then PPP's defense starts looking flimsy. PPP is basically saying "yes, our methodology reserve us the power to do whatever we want [potentially including weight to polling averages], but you should trust us because our results are good." Well, your results would be good, wouldn't they? Certainly, cherry picking a handful of results doesn't prove anything. There are plenty of bad PPP results when it has the field to itself. And even a mediocre pollster, without a polling average crutch, should occasionally lead the pack.

But again, we really have no idea what PPP's doing. And with that much uncertainty, so many methodological problems, and the knowledge that Jensen's "gut" is in play, the safe bet is to stay away. I remember being told in 2004 that Zogby should be watched closely, since they got 2000 right. Not a great bet. It was just four years ago that Silver said he would “want [Rasmussen] with [him] on a desert island.” Four years later, Rasmussen’s probably the pollster Silver would banish to a desert island. When making the choice between a pollster with results or good methodology, that’s a lesson worth remembering.

Updated on 9/20 to include a bullet point about PPP's house effect