The full numbers from Wednesday’s counts were released this morning by the Secretary of State’s (SoS) office. They’ve driven another nail in the coffin of R-71.
A total of 23,457 signatures have been checks, which is 17% of the total. Overall 3,054 invalid signatures have been found and eliminated, including 68 duplicates, 2,764 individuals not on the voting rolls, 221 signature mismatches, and 69 signatures for which the corresponding signature is missing.
The cumulative error rate is 13.3%, if the signatures with missing signature cards (hereafter “missing”) are thrown out, or 13.0% if the missing signatures are fully counted. As Goldy has explained, the cumulative error rate for the sample is misleadingly low. This is because duplicates are exponentially underrepresented as the sample size goes down.
Given the number of duplicates found in this sample, the best estimate is that is about 1.7% of signatures are duplicates on the petition. That gives an estimated total rejection rate of 14.7% (treating all “missing” signatures as valid). A rejection rate over 12.4% keeps R-71 off the ballot.
Rather than focus on percentages, we can use the number of good and bad signatures to estimate the expected number of valid signatures. This figure shows the daily estimated number of valid signatures on the petition (red line) and the number of signatures required (blue line) for the measure to make the ballot:
These estimates are conservative because I am assuming all “missing” signatures will be treated as valid. (I’ve changed my methods a bit since yesterday—a journey through the methodological details begins below the fold).
The important things to notice here are:
- The estimates are stable rather than bouncing around from day to day. This means that there is little evidence for non-sampling error. Such errors can arise if batches of petitions showed widely different error rates (more generally, from non-independence among signatures on petitions and in batches of petitions).
- The 95% confidence intervals are now so small that sampling error is no longer relevant. If God plays dice, she clearly doesn’t want R-71 on the ballot.
The trends, so far, indicate that, short of a miracle, this measure will not qualify for the ballot.
At this point, I am going to totally geek-out and discuss methodological stuff. If you’re interested, venture below the fold.
There are three methodological issues in trying to predict the total number of valid signatures based on a smaller sample of signatures.
The first is estimating the total number of duplicates, the second involves actually estimating the number of valid signatures, and the third is finding the sampling error (i.e. the uncertainty that arises because we only sampled some of the signatures).
Estimating the total number of duplicates from a sample: If we assume that there is some constant probability that each person will sign twice (and no more than twice), it means that the percentage of duplicate signature pairs increases with the sample size. Specifically, as the sample size goes up, the number of duplicates goes up nearly (but not quite) as the square of the ratio of total signatures to sampled signatures.
The commonly accepted method for estimating duplicates from a sample was proposed by L.A. Goodman in 1949 (Annals of Mathematical Statistics 20:572-9). Goodman’s method estimates the total number of duplicate pairs (D) in a petition with N signatures from observing d duplicates from a sample of n signatures. The equation is:
The estimate is unbiased under the assumptions stated above.
For example, in yesterday’s data dump, there were d = 68 duplicates found in a sample of n = 23,457 in a petition of N = 137,689 signatures. An estimate of the number of duplicates is 68×137,689×137,688/(23,457×23,456) = 2343.
The estimate of total duplicates could be thrown off if many people sign more than twice. But as of yesterday, nobody had done so.
Estimating valid signatures: This task is slightly indeterminate because it depends on the specifics of how the signatures are examined. There is a difference whether signatures are checked for validity first and only valid signatures are checked for duplication, or whether duplicates are sought first and then validity checked. Of course, there may well be a mixture. To me, it seems like the most logical processing would be to first check if the person is in the voter rolls, then check for duplication, and finally for a mismatch against the signature.
Since I don’t know the exact method the SoS uses, let’s start with the assumption that all invalid signatures are first removed and then the remaining signatures are checked for duplication. Under these assumptions, an unbiased estimate of V valid signatures for a petition, given b invalid signatures in the sample is:
(This estimator, and others that make different signature processing assumptions, are found in MM Whiteside and ME Eakin (2008) The American Statistician 62(1):17-21.)
The logic behind this equation is that we start with N signatures (first term on the right hand side), remove the invalid signatures (second term) and finally remove duplicates from the remaining valid signatures (third term).
Using this estimator with the numbers from yesterday’s count, we estimate 117,807 valid signatures.
I wrote this post yesterday, while sitting around waiting for a data dump . It didn’t come yesterday evening, but it gave me time to derive a hitherto unknown estimator for the number of valid signatures based on a more realistic processing scenario. The scenario is that the SoS office first checks to see if a person is in the voter rolls, then checks to see if that person has already been recorded as signing the petition, and finally checks the signature against the signature on file.
For a petition of N signatures, with a sample of size n, a total of r signatures from people not on the voter rolls, d duplicates, and m signature mismatches, we get this estimator for the total number of valid signatures:
So using yesterday’s totals, the equation gives us an estimated 118,330 valid signatures (assuming the “missing” signatures are all valid). So just by making more realistic assumptions about the processing sequence, the estimated number of valid signatures has increased by about 500. Still, the measure falls well short of the 120,577 needed to make the ballot.
Estimating sampling error. Sampling error arises whenever we use a fraction of signatures to estimate properties of all signatures on a petition. Sampling error is largest for small sample sizes. Once the sample becomes large relative to the number of signatures, sampling error becomes negligible. We are at that point now where sampling error is pretty small.
I’ve not found an explicit formula for the standard error of V (and haven’t derived it for V2). An approximation can be found using simulations. Here is a brief description. For each data dump, I simulated 10,000 petitions being sampled at the observed sample size. For each, I draw a random number of invalid signatures, based on observed invalid signature count, and I draw a random number of duplicates, based on the observed duplicate count and the remaining valid signatures in the sample. [Super geeky technical details: I draw a number of invalid signatures, b’, from a beta-binomial distribution with parameters b and (n – b). Then, I draw a number of duplicates, d’, from a beta binomial distribution with parameters d and (n – b’ – d)]. An estimate of V is then found from the equation above. The average of the 10,000 Vs is the expected number of signatures (V*), and the standard deviation of the Vs is an approximate standard error for V*. In fact, I used the upper and lower 2.5% quantiles of the 10,000 Vs as a 95% confidence interval for V*.
Extensions to V2 are straight-forward. Now we require three draws from a beta binomial distribution because we have broken invalid signatures into three categories (non-registered, duplicates, signature mismatches) to account for processing order.
Yesterday’s data from the SoS office had sampled 23,457 signatures. I computed V2* = 118,330 with a 95% confidence interval of from 118,305 to 118,350. Clearly, this is negligible error since 120,577 valid signatures are needed for R-71 to qualify, and none of the simulated petitions qualified. Look at what happens when the sample size is smaller. Last Friday’s data sampled 5,646 signatures, and V2* = 117,933 with a 95% confidence interval of from 112,348 to 122,059. In the simulations, 13% of the petitions qualified (assuming “missing” signatures all counted).
Last Friday, R-71 had a bit of a chance. The expected number of valid signatures fell short of the threshold, but that could have been bad luck in the sampling. Now, with a larger sample examined, the shortfall is definitely not a function of chance.