Poll analysis FAQ

This page describes the simulation analyses that appear on HorsesAss.

What are these simulations all about?

We do four types of Monte-Carlo-based simulation analyses here.

  1. Local and statewide elections and ballot measures.
  2. 2012 presidential election based on state head-to-head polls
  3. 2012 senatorial elections based on state head-to-head polls
  4. 2012 gubernatorial elections based on state head-to-head polls

Essentially, we have a hobby of collect polls and using them to simulate election results. Darryl has been doing local elections for many years on HA. The more intensive effort to analyze presidential elections began in October 2007 (at HominidViews) for the 2008 presidential election. Darryl began systematically collected state head-to-head poll and using them to assess the state of the election—the score, if you will. Later he added the 2008 senatorial and gubernatorial elections were added. This FAQ mostly discusses the presidential election, although the methods are applicable for other simulation analyses presented here. For the 2012 elections, we have moved the entire operation to HorsesAss and expanded on the number of people involved.

How are you doing the electoral college analyses?

The analyses are Monte-Carlo simulations of the Electoral College outcome based on state head-to-head polling data. The results are driven by poll results (or the average of the 2004 and 2008 elections for states with no polls). Essentially, we simulate a large number of elections (typically 100,000) for all states, D.C., and Nebraska and Maine districts based, whenever possible, on recent polling data. The winner for each state is determined randomly according to the proportions of people selecting each candidate in recent polls. After all state elections have been simulated, we tally the number of Electoral College votes for each candidate. Details of the methods follow.

What polling data do you use?

For the presidential race, we collect state polls in which Barack Obama is matched-up, head-to-head, with the most likely Republican challenger. For the 2008 election season we did multiple match-ups: Obama v. McCain, Obama v. Romney, Obama v. Huckabee, Clinton v. McCain, Clinton v. Romney, …, etc. This election season, we are just going to go with the most likely party nominee. Currently this is Obama versus Romney. If this changes, we will include one or more other match-ups.

For the senatorial and gubernatorial elections we do the same thing using data for each state in which there is an election.

Where do your polling data come from?

Several places. We find polls from the web sites of well-known polling firms. However, if there is sufficient information, a secondary source (e.g. news summary of a poll) can be acceptable. The most common polling firms that release head-to-head polls are SurveyUSA, Rassmussen and Quinnipiac. But there are many, many more polls and polling companies. Some of them are listed here. Frequently, we are made aware of a poll through a polling aggregation site like Atlas of US elections, Pollster.com, or Real Clear Politics, but we try to find (and link to) original poll reports.

How do you select which polls to include?

To be considered acceptable, each poll must come from a reputable pollster and must included the following information:

  1. The name of the poll or polling firm
  2. The inclusive dates on which the poll was taken
  3. The state in which the poll was taken
  4. The number of individuals polleda
  5. The counts of or percentage of individuals supporting each candidate

aMost reputable polls include the number of individuals sampled. As I write this (Dec 2011), all polls have included this number. But some poll will not publish that number before we are through. When a 95% margin of error (MOE) is provided, I can estimate the number of sampled individuals as (0.98/MOE)2. This is based on the standard error of a binomial distribution and, as is commonly done by pollsters, assuming the true proportion for each candidate is 0.5. This is not ideal…we prefer the numbers or percents.

We usually do not include internet-based polls or polls from discredited pollsters (say…Research 2000 or Strategic Vision). We also ignore polls released by party organizations or candidates. The problem is that such polls are released strategically. Including them would bias the results.

Some polls include results for multiple categories. For example, “all adults” and a subset of “registered voters”. We take the “registered voters” sample. When results are given for both “registered voters” or “likely voters” we take the “likely voter” results.

Do you include push polls?

No. A push polls is not a real poll. Rather, it is a marketing tool. In any case, results of push polls are rarely, if ever, published.

Do your simulations include all polls in each state?

No. We use recent polls whenever possible. For example, in December 2007, “recent” meant polls that were up to one month old—that is, we used a one month “window”. Then, beginning the following August, polls were coming in fast and furiously, and the window was reduced to three weeks. It was subsequently reduced to 14 days in September. 2008. A 10 day window was used beginning Oct 21, 2008 when there was ten days remaining. The final reduction was a one week window at a week before the election. We will do something similar for 2012.

For example, if there are two polls in the last three weeks for Missouri and four older polls, only the two “recent” polls are included in the simulations.

What if there are no polls that have been conducted in the “current poll” window?

In that case, we use the single most recent poll taken, even if it was taken some months ago.

What if there are no polls whatsoever taken in the state?

In that case, that state always goes the way it did in the past. We use the average of the 2008 and 2010 election results. States that went for Bush and McCain are always assumed to go for the Republican nominee. State that went for Kerry and Obama are assumed to go for Obama. When there are mixed results (say, Bush, Obama) we average the percentages based on party and give that state’s electors to the “winning” party.

For example, as this is being written in late Dec, 2011, Idaho has no known polls with head-to-head match-ups. Since Idaho went for Bush and McCain, I assume Obama loses in Idaho for every simulated election.

Why use past election results for states lacking any polls?

This seems like the best strategy given an absence of polls. The states with no polling are those that the media and polling firms believe are highly predictable—therefore there is no reason to pay good money to conduct a poll. They’re probably right. For example, D.C is, almost certainly, not going to go for the Republican candidate in 2012, so nobody is going to pay for polling D.C. until we get much closer to the election.

Of course, as the election season goes on, there will be fewer and fewer unpolled states. In 2004, all 50 states plus D.C. were eventually polled, but there was only a single poll in some cases (like D.C.). In 2008, there were polls conducted in every state & D.C. Again, there was only one poll for D.C.

How can I see the polls being used?

From the map, click on a state to jump to the results table. From there, click on the number in the “# polls” column, and you will be taken to a list of polls.

How are the simulations done?

For each simulation, an election is “held” in each state (plus D.C. plus districts for Nebraska and Maine) using “current” polls. For the presidential analyses, state results are then combined as would happen in the electoral college—winner takes all in 48 states plus D.C., and by the special rules for Nebraska and Maine.

As an example, in Feb 2008, there was a single poll conducted in Maryland for the Obama–McCain match-up. A Rasmussen poll was conducted on 2 Jan 2008 and surveyed 500 voters, finding that 42% support McCain, 48% support Obama and 10% were either undecided or supported someone else. (This poll was “old” because it was more than a month old, but it was the most current poll at the time, so it was the best information available.) Here are the steps using data from the 2008 election:

  1. The number of people who voted for each candidate are found. Some polling companies (like SurveyUSA) make the actual numbers available. Otherwise the numbers are computed: 500*0.48 gives 240 votes for Obama and 500*0.42 gives 210 McCain supporters in the poll. There were 240 + 210 = 450 decided voters.
  2. The computer normalizes the percentage who voted for each candidate. For Obama it was 240/450 = 53.33%, and was 46.67% for McCain. “Normalized” means that the percentage for Obama and McCain summed to 1.0.
  3. The estimated probability of a voter voting for Obama in Maryland in Jan was p= 0.533. But since p itself is estimated from a sample, p is more properly described as a distribution of possible Obama preferences. That is, we really have a distribution of ps.
  4. Thus, in each simulation for each poll, the computer randomly draws a value from the distribution of ps (let’s call it p’). So for the current simulation we might draw the value p’ = 0.527. Technical details: We draw p’ from a beta distribution with parameters (Dvotes + 1) and (Rvotes + 1). So, in this example we draw randomly from a beta distribution with parameters 241 & 211. This corresponds to a binomial distribution with a uniform (uninformative) prior distribution p.

  5. Now, we simulate 450 voters, who each have a p’ (here, a 52.7%) probability of voting for Obama and 1 – p’ probability of voting for McCain. How is this done? The easy way is to draw a uniform random number between 0 and 1. If the number is less than 0.527 then the vote goes to Obama, otherwise it is a vote for McCain. The process is repeated 450 times. Technical details: In practice we use a much faster method that yields identical results. A number of votes for the candidate is drawn from a binomial quantile function with a uniform random number as its argument and parameters N and p’ (here, 450 and 0.527).

When there are multiple current polls this process is repeated for each poll in each state and the number of votes for each candidate tallied.

How are you incorporating undecided voters in your analysis?

We ignore undecided voters. In absence of any information, the method assumes that the undecided fraction would break as the decided sample breaks.

Maine and Nebraska use a different method of assigning electoral college votes. Shouldn’t you treat them differently?

All states but Maine and Nebraska use winner-take-all for electoral votes. For Maine’s two and Nebraska’s three districts, one elector is given to the candidate who wins the district’s popular vote. The other two electors go to the candidate who wins the state’s popular vote.

We ignored this little detail in the 2008 election season because neither state had ever split its electoral votes among candidates in the past. But one district in Nebraska did split from the statewide vote in 2008. As a consequence, our final mean electoral outcome was off by a single electoral vote. Doh! We will consider Nebraska and Maine districts for 2012. (As of 2 Jan 2012, there is no district-level polling data in Maine.)

Are you doing your analyses to favor a particular candidate or party?

We are most certainly not neutral on politics, but these election analyses are done as objectively as we can possibly make them.

Are you trying to predict the result of the 2008 election?

No. The analyses make no projections to election day, 2012 (except the one we run on election day, 2012). Rather, we view this as showing what the state head-to-head polls indicate would happen if the election had been held today.

Let’s use a sports metaphor. During a basketball game, the current score does not always predict the winner. Rather, it provides information on the past and current performance of each team. We get some indication of the eventual winner, but only as the end of the game approaches or the point difference gets very large. Still, do you think it would be acceptable to not give the score until late in the game? Probably not. Fans want to see the score right from the start.

Likewise in an election contest, these analyses serve as a score for each team. I fully expect the score and the point spread to change as the game goes on, but I want to know who is in the lead and by how much at every point of the game.

Aren’t these exercises futile early in the election season when the party’s are focused on the primary instead of messaging?

No. Likewise, I don’t think the score should remain hidden from spectators for the first half of a basketball game. The strategy may change throughout the game, but the strategy adopted for the second half will be based on the current “score. ”

In fact, the ebbs and flows over time—particularly with respect to events and media coverage—are fascinating. For example, Giuliani’s 2008 fall from grace in the polls after LoverGate, and in the absence of any showing in IA and NH was nothing short of stunning. That sort of thing is at least as interesting as any attempts to predict a final outcome. The 2012 season should be filled with many similar “events.”

Why not use national head-to-head polls instead?

National polls have the advantage of being current—that is, people express their support for each candidate all at the same time. The state head-to-head polls suffer because some polls are older, and public opinion may have changed since the older polls were taken. But the national head-to-head polls have a big disadvantage. Most importantly, they predict the outcome of a national popular vote. We don’t elect our presidents by popular vote. As we learned in 2000, the national popular vote doesn’t always give the same election outcome as the Electoral College vote.

How are you incorporating the margin of error of each poll in your analysis?

The margin of error is inherently incorporated into the analyses. This is done by simulating elections in each state that include the number of polled individuals, and drawing a new value of p’ (described above) for each poll every simulated election.

What is the distribution of electoral votes?

The “distribution of electoral votes” graph look like this (from the 2008 election):

To produce this graph, the computer saves the electoral vote from each of the (typically, 100,000) simulated elections. Then, the relative frequency (on the y-axis) of each possible electoral vote outcome (x-axis) is plotted. The graph can tell you several things:

  1. The highest bar is the most likely outcome for an election—this is the mode of the distribution.
  2. The vertical dashed line is simply a marker for 269 votes—which reflects a tie in the Electoral College. The blue bars to the right of the center line are wins for the Democrat and the red bars to the left are wins for the Republican
  3. If you squint a bit you can estimate where the graph would balance on a fulcrum. That would be your estimate of the mean (or average or expected electoral vote total.
  4. The point on the x-axis were half of the bar mass falls above and half falls below is the median electoral vote.
  5. The spread of the distribution is an indication of how variable the outcomes are.
  6. The raggedness of the bars reflects the differing numbers of votes per state with an Electoral College system. With 100,000 simulations, we would expect a pretty smooth distribution if a popular vote was being simulated. Not necessarily so with an electoral college system because states are won wholesale.

How are the trend graphs produced?

The trend graphs look like this:

The graph results from simulations done over time. This graph was created by simulating weekly elections over an eight month period. Basically, this comes from a series of 100,000 simulated elections for every week between 01 Dec 2007 to 01 Aug 2008. For each simulated election:

  1. Polls collected in the month preceding the focal week are included
  2. If no polls occur in the month preceding the focal week, the most recent poll taken prior to that week is used
  3. Or if no polls are available prior to the focal week, all the electoral votes are assigned according to the outcome of the 2004 election

The graph shows the median electoral vote count (purple line) for Obama. The blue lines enclose the central 75% mass of Obama’s electoral vote counts, and the green line enclose 95% of Obama’s electoral vote count.

What do the colors mean?

The “party colors” are used at several intensities to convey information in four places:

  1. On the map (like this)
  2. On the state results summary table (like this)
  3. On the poll list (like this)
  4. On lists of polls for an individual state (usually in poll results posts like this)

For the first two cases (map and results table), the colors are coded according to the probability that the Democrat wins based on the actual results of the simulation analysis:

Color From To
100% 99.999%
99.999% 90%
90% 60%
60% 50%
Exactly 50%
50% 40%
40% 10%
10% 0.001%
0% 0.001

The poll results table and state poll lists are different, because the simulation results are not saved by poll (and the state poll lists don’t involve simulations at all). Instead the colors reflect the result of a t-test of the hypothesis that the Democratic results is greater than the Republican results. Technically, we compute

t test

where d is the normalized Democratic proportion, r is the normalized Republican proportion and n is the number of individuals who responded for either the Democratic or Republican candidate. The T statistic is compared to a Student’s t distribution to decide the probability of the Democrat winning given the observed poll results. The same cut-offs as in the table above are used.

What is that distorted map thing?

This is a cartogram. Here is one from 2008:

And here is the style for 2012:

The cartogram scales the area of each state according to its electoral vote total. Thus, Alaska is scaled to the same size as Washington D.C.—both have three votes in the Electoral College. The cumulative area covered by each color on the cartogram is an honest representation of the proportion of electoral votes that would be expected if a general election were held.

For more information on cartograms, check out Mark Newman’s web page or Victor L. Vescovo’s book The Atlas of World Statistics (2006, published by Caladan Press).

Note that the cartogram for the electoral college changed between 2008 and 2012. This is because the U.S. census changed the allocation of Representatives among the states, which changed the electoral college electors from each state.

Why do you assign electoral college ties to the Republican?

In the event of a 269–269 tie in the Electoral College, the selection of the next President and Vice President is specified by the 12th Amendment of the U.S. Constitution. The new House of Representatives would vote (using an unorthodox single-vote-per-state method) for the President and the Senate would select the Vice President. Since, at this point, it seems unlikely that the House will be under Democratic control after the November, 2012 election, I assign ties to the Republican candidate.