Claims of Irregularities in NYT's Edison Election Data

Mick West

Administrator
Staff member
"Data Scraping" is extracting data from a web page, either from the HTML of the web page itself, or from the data sources used by that web page.

The New York times election result pages have easily readable data sources, including time-stamped state-of-the race data, and so have been used by independent analysts to look at the trends as the results came in.

One claim is that votes are being "swapped", the claim rests on data like this:

Code:
                    {
                        "eevp": 42,
                        "eevp_source": "edison",
                        "timestamp": "2020-11-04T04:07:43Z",
                        "vote_shares": {
                            "bidenj": 0.42,
                            "trumpd": 0.566
                        },
                        "votes": 2984468
                    },
                    {
                        "eevp": 42,
                        "eevp_source": "edison",
                        "timestamp": "2020-11-04T04:08:51Z",
                        "vote_shares": {
                            "bidenj": 0.426,
                            "trumpd": 0.56
                        },
                        "votes": 2984522
                    },
Here are two data points, at 04:07:43Z and 04:08:51Z, the fraction of the vote shared is given, to three decimal places, and the exact vote is given. This was used to calculate how many votes they each had at both points, and then the change between then.

Biden:
2984468 * 0.42 = 1253477
2984522 * 0.426 = 1271406
1271406 - 1253477 = 17930

Trump
2984468 * 0.566 = 1689209
2984522 * 0.56 = 1671332
1671332 - 1689209 = -17877

So assuming the data is accurate, then at this one point in time, Trump's vote went down nearly 18K, and Biden's went up 18K, essentially taking 18K votes away from Biden and giving them to Trump.

The anonymous person in the link above wrote some code to calculate all the times Trump when down when Biden when up, and to add together all these declines. The result was a loss of 220,833 votes for trump (all numbers are with the data I downloaded today)

The immediate problem here is that it's ignoring losses for Biden when Trump gained votes. If we calculate those, we have a loss of 25,712 for Biden, not really changing things.

But what if we add up ANY decline? After all a decrease in your votes is an anomaly, it does not really matter if it happens simultaneously with a rise in your opponent's votes. If we add up ALL the declines, we have: Trump Lost = -408547 and Biden Lost = -637917. So adjustment down hurt Biden more than Trump.

But I said earlier assuming the data is accurate - now clearly if there are negative votes then something it going wrong. Let's take a step back, and look at the bigger picture. If we extract all the calculated vote totals and graph them, it looks like this:

2020-11-11_16-19-20.jpg


Blue is Biden, he starts out rapidly increasing to about 450K, then there's a sudden jump to around 700K and a more gradual rise, then a correction down, then another sharp spike and correction that's mirrored with a similar, but smaller drop/spike/correction for Trump. After that things settle down.

So clearly there's some bad data there, and errors that were corrected. The worst of it happens in the first 60 data points. If we exclude those and just look at declines after that we have Trump Lost = -93091 and Biden Lost = -24619.

The problem with this as evidence of election fraud is that it does not really make any sense. Why would you so blatantly subtract votes if the whole point was not to get caught? Where are these subtractions supposedly happening in the counting process? What happens if there's a recount?

The data here is from the NYT. They get data from Edison Research:
https://www.edisonresearch.com/election-polling/

The data we see is formatted for display purposes. The raw data would have actual numbers, not a low-resolution share of the vote. Actual results are tallies at a county level, different outlets report the county results continually, and changes should be identifiable, as they would stand out at the county level. Unfortunately, all of this is hidden in the simplistic and possibly buggy dataset from the NYT.

Ultimately the problem here is a lack of information about where the numbers a coming from, and what factors into them. How often do counties issue a correction? What actually happened at the various points on this graph? The people collecting this data have the capability to explain this issue (and quite possibly already have). While it may not seem important, things like things become the foundation of long-lasting conspiracy theories, and really needs to be addressed to prevent it from creating harm in the future.


Here's a short bash script that can be used to download the NYT data.
Code:
## These are the names as used in the NYT API. 50 US states
declare -a names=("alabama" "alaska" "arizona" "arkansas" "california" "colorado" "connecticut" "delaware" "florida" "georgia" "hawaii" "idaho" "illinois" "indiana" "iowa" "kansas" "kentucky" "louisiana" "maine" "maryland" "massachusetts" "michigan" "minnesota" "mississippi" "missouri" "montana" "nebraska" "nevada" "new-hampshire" "new-jersey" "new-mexico" "new-york" "north-carolina" "north-dakota" "ohio" "oklahoma" "oregon" "pennsylvania" "rhode-island" "south-carolina" "south-dakota" "tennessee" "texas" "utah" "vermont" "virginia" "washington" "west-virginia" "wisconsin" "wyoming")

l=${#names[@]}

for ((i=0; i<${l};i++));
do
    echo $i " - " ${names[$i]}

    wget -nc https://static01.nyt.com/elections-assets/2020/data/api/2020-11-03/race-page/${names[$i]}/president.json -O Race-Pres-${names[$i]}.json
    wget -nc https://static01.nyt.com/elections-assets/2020/data/api/2020-11-03/state-page/${names[$i]}.json -O State-Pres-${names[$i]}.json

    cat Race-Pres-${names[$i]}.json | python -m json.tool > Race-Pres-${names[$i]}-PP.json
    cat State-Pres-${names[$i]}.json | python -m json.tool > State-Pres-${names[$i]}-PP.json
done

and a modified version of the script used to tally the subtractions
Code:
import json
import sys


##print(f"Name of the script      : {sys.argv[0]=}")
##print(f"Arguments of the script : {sys.argv[1:]=}")


def findfraud(NAME):
    with open(NAME + '.json', encoding="utf8") as f:
        x = json.load(f)
    TotalVotesLost = 0
    for i in range(len(x["data"]["races"][0]["timeseries"])):
        if i != 0 and x["data"]["races"][0]["timeseries"][i]["votes"] * x["data"]["races"][0]["timeseries"][i]["vote_shares"]["trumpd"] < x["data"]["races"][0]["timeseries"][i-1]["votes"] * x["data"]["races"][0]["timeseries"][i-1]["vote_shares"]["trumpd"]:
            if x["data"]["races"][0]["timeseries"][i]["votes"] * x["data"]["races"][0]["timeseries"][i]["vote_shares"]["bidenj"] > x["data"]["races"][0]["timeseries"][i-1]["votes"] * x["data"]["races"][0]["timeseries"][i-1]["vote_shares"]["bidenj"]:
                print ("Index : " + str(i) + " Past Index : " + str(i-1))
                print (x["data"]["races"][0]["timeseries"][i]["votes"] * x["data"]["races"][0]["timeseries"][i]["vote_shares"]["trumpd"] - x["data"]["races"][0]["timeseries"][i-1]["votes"] * x["data"]["races"][0]["timeseries"][i-1]["vote_shares"]["trumpd"])
                TotalVotesLost += x["data"]["races"][0]["timeseries"][i]["votes"] * x["data"]["races"][0]["timeseries"][i]["vote_shares"]["trumpd"] - x["data"]["races"][0]["timeseries"][i-1]["votes"] * x["data"]["races"][0]["timeseries"][i-1]["vote_shares"]["trumpd"]
    print (str(str(TotalVotesLost)  + " Flo"))

def ff1(NAME):
    with open(NAME + '.json', encoding="utf8") as f:
        x = json.load(f)
    TrumpVotesLost = 0
    BidenVotesLost = 0
    for i in range(len(x["data"]["races"][0]["timeseries"])):
        prev = x["data"]["races"][0]["timeseries"][i-1]
        curr = x["data"]["races"][0]["timeseries"][i]
        prevBiden = prev["votes"] * prev["vote_shares"]["bidenj"]
        prevTrump = prev["votes"] * prev["vote_shares"]["trumpd"]
        currBiden = curr["votes"] * curr["vote_shares"]["bidenj"]
        currTrump = curr["votes"] * curr["vote_shares"]["trumpd"]


        if i > 0:
            if currTrump < prevTrump and currBiden < prevBiden:
                print (curr["timestamp"] + ": BOTH Loss: Trump " + str(int(currTrump - prevTrump))+ " Biden: "+str(int(currBiden-prevBiden)))
                TrumpVotesLost += int(currTrump - prevTrump)
                BidenVotesLost += int(currBiden - prevBiden)

            else:
                if currTrump < prevTrump: # and currBiden > prevBiden:
                        print (curr["timestamp"] + ": TRUMP Loss: " + str(int(currTrump - prevTrump))+ " Biden: "+str(int(currBiden-prevBiden)))
                        TrumpVotesLost += int(currTrump - prevTrump)
                if currBiden < prevBiden: # and currTrump > prevTrump:
                        print (curr["timestamp"] + ": Biden Loss: " + str(int(currBiden-prevBiden))+" Trump "+ str(int(currTrump - prevTrump)))
                        BidenVotesLost += int(currBiden - prevBiden)
    print (str("Trump Lost = " + str(int(TrumpVotesLost))))
    print (str("Biden Lost = " + str(int(BidenVotesLost))))

def ffDump(NAME):
    with open(NAME + '.json', encoding="utf8") as f:
        x = json.load(f)
    print("Time,Biden Share,Trump Share,Total Votes,Biden Votes,Trump Votes,Biden Mark,Trump Mark")
    for i in range(len(x["data"]["races"][0]["timeseries"])):
        prev = x["data"]["races"][0]["timeseries"][i-1]
        curr = x["data"]["races"][0]["timeseries"][i]
        prevBiden = prev["votes"] * prev["vote_shares"]["bidenj"]
        prevTrump = prev["votes"] * prev["vote_shares"]["trumpd"]
        currBiden = curr["votes"] * curr["vote_shares"]["bidenj"]
        currTrump = curr["votes"] * curr["vote_shares"]["trumpd"]
        trumpMark = 0
        bidenMark = 0
        if currTrump < prevTrump and currBiden > prevBiden:
            trumpMark = currTrump
        if currBiden < prevBiden and currTrump > prevTrump:
            bidenMark = currBiden


        print (curr["timestamp"] + ","+str(curr["vote_shares"]["bidenj"]) + "," + str(curr["vote_shares"]["trumpd"])+","+str(curr["votes"])+","+str(int(curr["vote_shares"]["bidenj"]*curr["votes"]))+","+str(int(curr["vote_shares"]["trumpd"]*curr["votes"]))+","+str(bidenMark)+","+str(trumpMark))



def findfraud2(NAME):
    with open(NAME + '.json', encoding="utf8") as f:
        x = json.load(f)
    TotalVotesLost = 0
    for i in range(len(x["data"]["races"][0]["timeseries"])):
        if i != 0 and x["data"]["races"][0]["timeseries"][i]["votes"] < x["data"]["races"][0]["timeseries"][i-1]["votes"]:
            TotalVotesLost += x["data"]["races"][0]["timeseries"][i]["votes"] - x["data"]["races"][0]["timeseries"][i-1]["votes"]
    print (TotalVotesLost)

##findfraud(f"{sys.argv[1:][0]}")
ff1(f"{sys.argv[1:][0]}")
##ffDump(f"{sys.argv[1:][0]}")

This is modified to run from the command line with a argument, like:

python fraudcatch.py Race-Pres-pennsylvania

I've attached a zip file containing the data as of around 9:30AM PST.

There are two data files for each states, a "race" file, and a "state" file. The race file mostly refers to the Presidential race, and is a smaller file. The state file includes all the data in the race file and adds considerably more data regarding the various state races.
 

Attachments

I think also you can lose perspective by looking at these media numbers. The race is not decided by the media, it's decided by the number of electors that come from the states, which is based on the final numbers that are certified by the states. Those numbers are subject to bipartisan examination, audits, and recounts and are the actual results.

The graphs above are from hour-by-hour (and sometimes minute-by-minute) updates of data that is desperately being acquired as quickly as possible. Due to the various different ways counties report data this is a partly manual process, and mistakes are made, then later corrected.

But the final result comes from pieces of paper that can be verified and recounted.
 
For sure. If there's a problem with the reporting, that doesn't mean there's a problem with vote itself.

The idea that, if there are vote counts reported at 4am in the morning, these must be somehow criminal, came from the White House.

When people look at something for the first time, they don't have a proper reference for what is normal, and their intuition leads them astray. We have many people looking at elections closely for the first time this year, many poll watchers who don't know what they're looking at, as opposed to poll workers, who have been trained and often have experience.

And then you get fake "authorities" producing youtube videos manufacturing outrage by showing you something they claim is not normal when they don't really know themselves, and don't care.

I think what might help is if local TV stations or newspapers did reports on what exactly happened to the local ballots: how were they stored, transported, processed, and what are the paper trails protecting the counts. If everyone thought about how their own vote was secured, that might help ground people.

A huge part of populism and/or conspiracy theories involves fake claims about things that happen far away that you can't check (and that often contradict everyday experience you actually have); it makes everyone feel terrible and insecure.

In every venue, there are poll workers from both parties. In every state, there are hundreds and thousands of them. If the was a conspiracy to steal the vote, surely there would be some poll workers testifying in court by now. But instead, we have the Trump campaign begging for witnesses. This suggests there aren't any, and that suggests there wasn't anything shady to witness.
 
I think what might help is if local TV stations or newspapers did reports on what exactly happened to the local ballots: how were they stored, transported, processed, and what are the paper trails protecting the counts. If everyone thought about how their own vote was secured, that might help ground people.
Agreed, we really need more of this. And to a certain degree, it's out there - the challenge is getting it to the people who need it.
 
Not that either of these things are relevant to the OP claim...but...

But the final result comes from pieces of paper that can be verified and recounted.

is this true now? i thought some states are still using old machines. the WP has an article on it nov 5 2020, but i cant access it to see which states are still paperless.

In every venue, there are poll workers from both parties.

is this true? i saw video of Detroit and PA counters with only one person at each table. In my state, each vote counting table (that we see on the news anyway) has one Dem and one Repub who count the votes together.
 
is this true now? i thought some states are still using old machines. the WP has an article on it nov 5 2020, but i cant access it to see which states are still paperless.
Article:
Today, counties in Texas, Tennessee, Louisiana, Mississippi, Indiana, Kansas, Kentucky, and New Jersey are still exclusively using paperless machines, also called direct recording electronic systems (DREs).
Derek Tisler, election security analyst with the Brennan Center, said the number of states using DREs has nearly halved since the last election, but there are a smattering of states that, for reasons mostly financial, still have not switched.

"In 2016, there were 14 states that used paperless machines as the primary polling place equipment in at least some of their counties and towns. They represented about 1 in 5 votes that were cast in the 2016 election," said Tisler. "Since then, six of those states have fully transitioned to some sort of paper-based voting equipment."
Those include Arkansas, Delaware, Georgia, South Carolina, Pennsylvania and Virginia. There, counties have been transitioning to ballot-marking devices.


So none of the states still using DREs are real swing states
 
Inspired by the the original, I went looking for more source data. Unfortunately most sites (other than the NYT) that I've looked at so far don't have time series data (history), they just have a snapshot of the results (which now is essentially the final result).

But then I thought about archive.org. They actually took fairly regular snapshots of the data streams for various sites through the election counting process. I focussed on the Pennsylvania results, and the official page:
https://www.electionreturns.pa.gov/

With this data source
https://www.electionreturns.pa.gov/...ned&electiontype=undefined&isactive=undefined

2020-11-12_12-37-11.jpg


Results start to show up early morning Nov 4, and there about 10 per day for the says after. They look broken, but the data is actually still in there. I could not easily script the download, so I just saved each only by hand (took about 15 minutes). Then I wrote a little Python script to convert this to CSV.
Code:
import json

print ("Democrat,Votes,ElectionDayVotes,MailInVotes,ProvisionalVotes,Republican,Votes,ElectionDayVotes,MailInVotes,ProvisionalVotes,Libertarian,Votes,ElectionDayVotes,MailInVotes,ProvisionalVotes")
for n in range (16,92):
    filename = 'GET ({0}).xml'.format(n)
    f = open (filename,"rt")
    data = f.read()
    f.close
    #print(data)
    start = (data.find('{"Election":'))
    end = (data.find('</string>'))
    xml = data[start:end]
    j = json.loads(xml)
    statewide = j["Election"]["President of the United States"][0]["Statewide"]
    #print (statewide)
    a=statewide[0];
    b=statewide[1]
    c=statewide[2]
    print ("BIDEN" +","+ a["Votes"]+","+a["ElectionDayVotes"]+","+a["MailInVotes"]+","+a["ProvisionalVotes"]+\
        ","+"TRUMP" +","+ b["Votes"]+","+b["ElectionDayVotes"]+","+b["MailInVotes"]+","+b["ProvisionalVotes"]+\
        ","+"JORGENSEN" +","+ c["Votes"]+","+c["ElectionDayVotes"]+","+c["MailInVotes"]+","+c["ProvisionalVotes"])

Of interest here, the votes are split into election-day votes and mail-in votes (with a very small number of provisional ballots). So we can see visually how this progressed.
Trump.png


Biden.png
 
Related.

Source: https://www.youtube.com/watch?v=etx0k1nLn78


Summary: Benford's Law describes the distribution of the first digit of numbers in many real-life data sets, including, sometimes, elections. IN Chicago, Trump's distribution followed this, but Biden's did not. The reason is the size distribution of Precincts is so small and clustered around 500, and even that Benford's law does not apply (as Benford himself explained). Trump got so few votes that they clustered around zero (per precinct), which looked like Benford's law. Biden's just had a normal distribution clustered a bit below 500.

The last two digits are not random because Trump had many results under 100, so a lot of them were two-digits, so that followed the previous distribution.
 
Mick West, I started a similar analysis as you did on the NYT/Edison time series data and essentially came to the same conclusions.

Executive Summary
The claim circulated on social media that 220,883 votes in Pennsylvania were switched from Biden to Trump was the result of an erroneous analysis of time series display data used by NYTimes and other media companies to update real-time dashboards on election night. Running the same analysis from Biden's perspective finds that 120,834 votes were switched from Biden to Trump in Florida. Attempts to fix limitations in the analysis produced even more confusing results, such as suggesting that vote switching also occurred to the benefit of third-party presidential candidates. The analysis method is fundamentally flawed.

The Edison time series data used by the media companies is indeed rife with errors and corrections, but these are more likely the result of manual entry of published state data into a central system prior to publishing data to media companies than due to underlying problems with vote tabulations at the state- or precinct-level. The Edison time series data cannot possibly be valid for an analysis of vote count integrity.

Original Approach
The original program published by PedeInspector had a simple premise. The idea was to compare the votes each candidate had at the time of each update, with the votes he had at the previous update, and to determine if the change in votes was sensible. It seemed reasonable to think that the vote count should never decrease. Any reduction in Trump's vote count, or in the total vote count, was understood as a sign of fraudulently deleted votes. A simultaneous reduction in votes for Trump while Biden's votes increased was understood as a sign that votes were switched from Trump to Biden.

Even if the source data were 100% valid for this type of analysis (more on that later), this approach does not consider the following possibilities:
  • Votes could be subtracted from Trump without being added to Biden, even if Biden's votes increase at the same time
  • Votes could be subtracted from Biden too and switched to Trump
  • Votes could be subtracted from either candidate and added to third-party candidates who are not listed but are included in the total vote count
  • Votes that are subtracted from a candidate in one update can return later; they are not necessarily permanently "lost"
  • Votes could simply be fraudulently added to a candidate (seems easier!)
What the Original Analysis Missed
Focusing only on two states, the original claims were:
  • Pennsylvania
    • Votes switched from Trump to Biden: 220,883
    • Lost Votes: 941,248
  • Florida
    • Votes switched from Trump to Biden: 21,422
    • Lost Votes: 456
The original Python program looked for changes in the vote counts where Trump votes decreased AND Biden votes simultaneously increased.

IF (trumpvotes_now < trumpvotes_before) AND (bidenvotes_now > bidenvotes_before):
THEN VOTES WERE SWITCHED FROM TRUMP

It also assumed that every time votes decreased, they disappeared forever. It summed up all subtractions to determine how many total votes were "lost" in each state, and these figures were then widely reproduced on social media.

The biggest miss was that the reverse case was not considered. It should immediately have been obvious to look for cases where Biden votes decreased AND Trump votes simultaneously increased.

IF (bidenvotes_now < bidenvotes_before) AND (trumpvotes_now > trumpvotes_before):
THEN VOTES WERE SWITCHED FROM BIDEN!!

Running this comparison (see my updated program attached) reveals the following:
  • Pennsylvania
    • Total votes presumed switched from Trump to Biden: 220,884
    • Total votes presumed switched from Biden to Trump: 24,723
  • Florida
    • Total votes presumed switched from Trump to Biden: 21,423
    • Total votes presumed switched from Biden to Trump: 120,834
Clearly, this data anomaly is at work in both directions and does not hurt only Trump! If this anomaly helped Biden win PA, then it helped Trump win FL.

More details are in the attached, along with my theory about why the source data set has these anomalies.
 

Attachments

"Data Scraping" is extracting data from a web page, either from the HTML of the web page itself, or from the data sources used by that web page.
<snip>

The immediate problem here is that it's ignoring losses for Biden when Trump gained votes. If we calculate those, we have a loss of 25,712 for Biden, not really changing things.

OK, before we get into the meat of this question, I'd like to ask you why you believe the NYT data is scraped?

If I understand correctly, this data that is the focus of concern is obtained by subscription from Edison Polling and is used by CBS, CNN, ABC and NBC (along with its cable sibling MSNBC) which are members of the National Election Pool, a consortium using data from Edison Research.

Source: https://www.usatoday.com/story/ente...orting-different-vote-totals-race/6175428002/

The data in question is thus: https://static01.nyt.com/elections-assets/2020/data/api/2020-11-03/race-page/nebraska/president.json is Nebraska's Presidential Race. There is a URL for each Race and each state.

1. To confirm, is this the data on which you have performed analysis?

My understanding is that Edison collects all data from the states periodically throughout the voting process. I do not know if they source votes via a manual process like the AP using boots on the ground calling in precinct data, or if they have an agreement with each state to submit the data in a certain format, or if the states each have an RSS feed that is checked at set intervals to retrieve additional data.

But, I'd bet good money on them not using web scraping because each web page is formatted differently and there is no assurance the page won't be reformatted immediately prior to or during an election.

2. Thus, my question - why do you say they obtain votes by data scraping?

Edison then augments the state vote data with results from previous years, demographic data, and exit interviews they perform. The media use it for talking points and to display the current situation throughout the election.

3. Agreement on only looking for Trump negative votes. Biden also shows negative votes in these files. But, there is an important third factor - that's the third party candidates.

Would you agree Biden, Trump and the Third Party Candidate should all factor into this analysis?

I wanted to clarify those points with you before going further. Thanks.
 
This is straining the limits of my competence in the subject but... could Simpson's paradox be playing a role here? A trend appears in separate groupings of a data set but disappears when the groupings are combined.
 
Last edited:
OK, before we get into the meat of this question, I'd like to ask you why you believe the NYT data is scraped?

If I understand correctly, this data that is the focus of concern is obtained by subscription from Edison Polling and is used by CBS, CNN, ABC and NBC (along with its cable sibling MSNBC) which are members of the National Election Pool, a consortium using data from Edison Research.

Source: https://www.usatoday.com/story/ente...orting-different-vote-totals-race/6175428002/

The data in question is thus: https://static01.nyt.com/elections-assets/2020/data/api/2020-11-03/race-page/nebraska/president.json is Nebraska's Presidential Race. There is a URL for each Race and each state.

1. To confirm, is this the data on which you have performed analysis?

My understanding is that Edison collects all data from the states periodically throughout the voting process. I do not know if they source votes via a manual process like the AP using boots on the ground calling in precinct data, or if they have an agreement with each state to submit the data in a certain format, or if the states each have an RSS feed that is checked at set intervals to retrieve additional data.

But, I'd bet good money on them not using web scraping because each web page is formatted differently and there is no assurance the page won't be reformatted immediately prior to or during an election.

2. Thus, my question - why do you say they obtain votes by data scraping?

Edison then augments the state vote data with results from previous years, demographic data, and exit interviews they perform. The media use it for talking points and to display the current situation throughout the election.

3. Agreement on only looking for Trump negative votes. Biden also shows negative votes in these files. But, there is an important third factor - that's the third party candidates.

Would you agree Biden, Trump and the Third Party Candidate should all factor into this analysis?

I wanted to clarify those points with you before going further. Thanks.
Hi Amy, I would also like to find out the methodology Edison used to assemble their data feed. Have a look at my analysis in this thread. I agree with you that Edison most likely did NOT build data readers or integrations with state systems, because each state has their own system. The most likely explanation is that Edison employed a team of data entry clerks to copy published results from the 50 state systems into Edison's central system, from where the data was published to media outlets in this data feed.
 
I created an Edison/NYT Data Simulator (with a writeup) that illustrates why people may be finding irregularities in NYT's Edison Election Data. From the results of the simulations, the "vote switching" seems to be happening when only a small number of votes for one candidate are reported to Edison during a particular round of accessing Edison's data.
 
I created an Edison/NYT Data Simulator (with a writeup) that illustrates why people may be finding irregularities in NYT's Edison Election Data. From the results of the simulations, the "vote switching" seems to be happening when only a small number of votes for one candidate are reported to Edison during a particular round of accessing Edison's data.
Your analysis demonstrates the effect of rounding in the published JSON data stream on attempts to reverse-calculate the votes for each candidate. This certainly must be considered (but was not) by people analyzing this data for fraud. Running your simulation about ten times, it shows that errors in the vote count for each candidate on the order of 2,000 to 3,000 votes is to be expected with 5,000,000 simulated total votes. This stands to reason, since a precision of 0.001 (0.1%) as observed in the published data stream amounts to 5,000 on this total.

As stated in my analysis report, I believe that the other, larger errors are likely due on data entry errors or timing errors between updates being keyed in by the staff who are amalgamating results across multiple states into the Edison system for publishing.
 
Here is how the AP does it. How we count the vote | AP. But various tabulation systems may have automated result publishers. For example, Dominion has a results publishing server (bottom right). View attachment 42220
The link to AP's procedure was helpful, because it shows that their process uses data entry clerks. I suspect Edison Research has the same type of process as they amalgamate votes from multiple state systems. This makes the published data streams unreliable for analysis purposes, even though they are good enough for updating dashboards on TV networks.
 
When I tried to recreate the graph you had, I did not get the same result:
1624400982709.png

This is the code from R:

Code:
states_mst <- c('idaho', 'montana', 'wyoming', 'utah', 'arizona', 'new-mexico', 'colorado', 'pennsylvania')

for (i in seq_along(states_mst)){

    nyt_api_data <- fromJSON(str_c("https://static01.nyt.com/elections-assets/2020/data/api/2020-11-03/race-page/",states_mst[i],"/president.json"))
    
    
    race_data <- nyt_api_data$data$races$timeseries %>%
        bind_rows()
    
    race_data_edited <- race_data %>%
        select(-vote_shares) %>%
        mutate(timestamp = as_datetime(timestamp))
    
    final_race_data <- bind_cols(race_data_edited, race_data$vote_shares[1], race_data$vote_shares[2]) %>%
        mutate(trump_votes = votes * trumpd,
               biden_votes = votes * bidenj) %>%
        pivot_longer(cols = trump_votes:biden_votes, names_to = "candidate") %>%
        select(timestamp, candidate, value) %>%
        arrange(timestamp)
        
    assign(str_c("final_race_graph_",states_mst[i]),final_race_data %>%
        filter(timestamp < '2020-11-05') %>%
        ggplot(aes(x = timestamp, y = value, color = candidate)) +
        geom_line() +
        # geom_label_repel(inherit.aes = FALSE,
        #                  data = final_race_data %>% filter(value == 0),
        #                  aes(x = timestamp, y = value, label = as.character(timestamp))) +
        theme_tq() +
        labs(title = str_c("NYT Edison Data - ", str_to_title(states_mst[i]))) +
        theme(axis.text.x = element_text(angle = 90)) +
        scale_y_continuous(labels = scales::comma))

    assign(str_c("final_data_timestamp_",states_mst[i]),final_race_data %>% filter(timestamp < '2020-11-05') %>%     arrange(timestamp))
}

I'm not super familiar with the Edison data, so hoping someone who has more familiarity with it could tell me where I went wrong.
 
@riffology_123 Mick uses a different time axis than you do, compare them!
You also seem to be processing a row that contains no data.

For the period of Nov 4th, the graphs do look similar, but Mick's graph extends beyond that date and yours doesn't.
 
It's been a while, but the code and data is in the top post here, so you should be able to track down the differences.
 
"Data Scraping" is extracting data from a web page, either from the HTML of the web page itself, or from the data sources used by that web page.

The New York times election result pages have easily readable data sources, including time-stamped state-of-the race data, and so have been used by independent analysts to look at the trends as the results came in.

One claim is that votes are being "swapped", the claim rests on data like this:

Code:
                    {
                        "eevp": 42,
                        "eevp_source": "edison",
                        "timestamp": "2020-11-04T04:07:43Z",
                        "vote_shares": {
                            "bidenj": 0.42,
                            "trumpd": 0.566
                        },
                        "votes": 2984468
                    },
                    {
                        "eevp": 42,
                        "eevp_source": "edison",
                        "timestamp": "2020-11-04T04:08:51Z",
                        "vote_shares": {
                            "bidenj": 0.426,
                            "trumpd": 0.56
                        },
                        "votes": 2984522
                    },
Here are two data points, at 04:07:43Z and 04:08:51Z, the fraction of the vote shared is given, to three decimal places, and the exact vote is given. This was used to calculate how many votes they each had at both points, and then the change between then.

Biden:
2984468 * 0.42 = 1253477
2984522 * 0.426 = 1271406
1271406 - 1253477 = 17930

Trump
2984468 * 0.566 = 1689209
2984522 * 0.56 = 1671332
1671332 - 1689209 = -17877

So assuming the data is accurate, then at this one point in time, Trump's vote went down nearly 18K, and Biden's went up 18K, essentially taking 18K votes away from Biden and giving them to Trump.

The anonymous person in the link above wrote some code to calculate all the times Trump when down when Biden when up, and to add together all these declines. The result was a loss of 220,833 votes for trump (all numbers are with the data I downloaded today)

The immediate problem here is that it's ignoring losses for Biden when Trump gained votes. If we calculate those, we have a loss of 25,712 for Biden, not really changing things.

But what if we add up ANY decline? After all a decrease in your votes is an anomaly, it does not really matter if it happens simultaneously with a rise in your opponent's votes. If we add up ALL the declines, we have: Trump Lost = -408547 and Biden Lost = -637917. So adjustment down hurt Biden more than Trump.

But I said earlier assuming the data is accurate - now clearly if there are negative votes then something it going wrong. Let's take a step back, and look at the bigger picture. If we extract all the calculated vote totals and graph them, it looks like this:

View attachment 42100

Blue is Biden, he starts out rapidly increasing to about 450K, then there's a sudden jump to around 700K and a more gradual rise, then a correction down, then another sharp spike and correction that's mirrored with a similar, but smaller drop/spike/correction for Trump. After that things settle down.

So clearly there's some bad data there, and errors that were corrected. The worst of it happens in the first 60 data points. If we exclude those and just look at declines after that we have Trump Lost = -93091 and Biden Lost = -24619.

The problem with this as evidence of election fraud is that it does not really make any sense. Why would you so blatantly subtract votes if the whole point was not to get caught? Where are these subtractions supposedly happening in the counting process? What happens if there's a recount?

The data here is from the NYT. They get data from Edison Research:
https://www.edisonresearch.com/election-polling/

The data we see is formatted for display purposes. The raw data would have actual numbers, not a low-resolution share of the vote. Actual results are tallies at a county level, different outlets report the county results continually, and changes should be identifiable, as they would stand out at the county level. Unfortunately, all of this is hidden in the simplistic and possibly buggy dataset from the NYT.

Ultimately the problem here is a lack of information about where the numbers a coming from, and what factors into them. How often do counties issue a correction? What actually happened at the various points on this graph? The people collecting this data have the capability to explain this issue (and quite possibly already have). While it may not seem important, things like things become the foundation of long-lasting conspiracy theories, and really needs to be addressed to prevent it from creating harm in the future.


Here's a short bash script that can be used to download the NYT data.
Code:
## These are the names as used in the NYT API. 50 US states
declare -a names=("alabama" "alaska" "arizona" "arkansas" "california" "colorado" "connecticut" "delaware" "florida" "georgia" "hawaii" "idaho" "illinois" "indiana" "iowa" "kansas" "kentucky" "louisiana" "maine" "maryland" "massachusetts" "michigan" "minnesota" "mississippi" "missouri" "montana" "nebraska" "nevada" "new-hampshire" "new-jersey" "new-mexico" "new-york" "north-carolina" "north-dakota" "ohio" "oklahoma" "oregon" "pennsylvania" "rhode-island" "south-carolina" "south-dakota" "tennessee" "texas" "utah" "vermont" "virginia" "washington" "west-virginia" "wisconsin" "wyoming")

l=${#names[@]}

for ((i=0; i<${l};i++));
do
    echo $i " - " ${names[$i]}

    wget -nc https://static01.nyt.com/elections-assets/2020/data/api/2020-11-03/race-page/${names[$i]}/president.json -O Race-Pres-${names[$i]}.json
    wget -nc https://static01.nyt.com/elections-assets/2020/data/api/2020-11-03/state-page/${names[$i]}.json -O State-Pres-${names[$i]}.json

    cat Race-Pres-${names[$i]}.json | python -m json.tool > Race-Pres-${names[$i]}-PP.json
    cat State-Pres-${names[$i]}.json | python -m json.tool > State-Pres-${names[$i]}-PP.json
done

and a modified version of the script used to tally the subtractions
Code:
import json
import sys


##print(f"Name of the script      : {sys.argv[0]=}")
##print(f"Arguments of the script : {sys.argv[1:]=}")


def findfraud(NAME):
    with open(NAME + '.json', encoding="utf8") as f:
        x = json.load(f)
    TotalVotesLost = 0
    for i in range(len(x["data"]["races"][0]["timeseries"])):
        if i != 0 and x["data"]["races"][0]["timeseries"][i]["votes"] * x["data"]["races"][0]["timeseries"][i]["vote_shares"]["trumpd"] < x["data"]["races"][0]["timeseries"][i-1]["votes"] * x["data"]["races"][0]["timeseries"][i-1]["vote_shares"]["trumpd"]:
            if x["data"]["races"][0]["timeseries"][i]["votes"] * x["data"]["races"][0]["timeseries"][i]["vote_shares"]["bidenj"] > x["data"]["races"][0]["timeseries"][i-1]["votes"] * x["data"]["races"][0]["timeseries"][i-1]["vote_shares"]["bidenj"]:
                print ("Index : " + str(i) + " Past Index : " + str(i-1))
                print (x["data"]["races"][0]["timeseries"][i]["votes"] * x["data"]["races"][0]["timeseries"][i]["vote_shares"]["trumpd"] - x["data"]["races"][0]["timeseries"][i-1]["votes"] * x["data"]["races"][0]["timeseries"][i-1]["vote_shares"]["trumpd"])
                TotalVotesLost += x["data"]["races"][0]["timeseries"][i]["votes"] * x["data"]["races"][0]["timeseries"][i]["vote_shares"]["trumpd"] - x["data"]["races"][0]["timeseries"][i-1]["votes"] * x["data"]["races"][0]["timeseries"][i-1]["vote_shares"]["trumpd"]
    print (str(str(TotalVotesLost)  + " Flo"))

def ff1(NAME):
    with open(NAME + '.json', encoding="utf8") as f:
        x = json.load(f)
    TrumpVotesLost = 0
    BidenVotesLost = 0
    for i in range(len(x["data"]["races"][0]["timeseries"])):
        prev = x["data"]["races"][0]["timeseries"][i-1]
        curr = x["data"]["races"][0]["timeseries"][i]
        prevBiden = prev["votes"] * prev["vote_shares"]["bidenj"]
        prevTrump = prev["votes"] * prev["vote_shares"]["trumpd"]
        currBiden = curr["votes"] * curr["vote_shares"]["bidenj"]
        currTrump = curr["votes"] * curr["vote_shares"]["trumpd"]


        if i > 0:
            if currTrump < prevTrump and currBiden < prevBiden:
                print (curr["timestamp"] + ": BOTH Loss: Trump " + str(int(currTrump - prevTrump))+ " Biden: "+str(int(currBiden-prevBiden)))
                TrumpVotesLost += int(currTrump - prevTrump)
                BidenVotesLost += int(currBiden - prevBiden)

            else:
                if currTrump < prevTrump: # and currBiden > prevBiden:
                        print (curr["timestamp"] + ": TRUMP Loss: " + str(int(currTrump - prevTrump))+ " Biden: "+str(int(currBiden-prevBiden)))
                        TrumpVotesLost += int(currTrump - prevTrump)
                if currBiden < prevBiden: # and currTrump > prevTrump:
                        print (curr["timestamp"] + ": Biden Loss: " + str(int(currBiden-prevBiden))+" Trump "+ str(int(currTrump - prevTrump)))
                        BidenVotesLost += int(currBiden - prevBiden)
    print (str("Trump Lost = " + str(int(TrumpVotesLost))))
    print (str("Biden Lost = " + str(int(BidenVotesLost))))

def ffDump(NAME):
    with open(NAME + '.json', encoding="utf8") as f:
        x = json.load(f)
    print("Time,Biden Share,Trump Share,Total Votes,Biden Votes,Trump Votes,Biden Mark,Trump Mark")
    for i in range(len(x["data"]["races"][0]["timeseries"])):
        prev = x["data"]["races"][0]["timeseries"][i-1]
        curr = x["data"]["races"][0]["timeseries"][i]
        prevBiden = prev["votes"] * prev["vote_shares"]["bidenj"]
        prevTrump = prev["votes"] * prev["vote_shares"]["trumpd"]
        currBiden = curr["votes"] * curr["vote_shares"]["bidenj"]
        currTrump = curr["votes"] * curr["vote_shares"]["trumpd"]
        trumpMark = 0
        bidenMark = 0
        if currTrump < prevTrump and currBiden > prevBiden:
            trumpMark = currTrump
        if currBiden < prevBiden and currTrump > prevTrump:
            bidenMark = currBiden


        print (curr["timestamp"] + ","+str(curr["vote_shares"]["bidenj"]) + "," + str(curr["vote_shares"]["trumpd"])+","+str(curr["votes"])+","+str(int(curr["vote_shares"]["bidenj"]*curr["votes"]))+","+str(int(curr["vote_shares"]["trumpd"]*curr["votes"]))+","+str(bidenMark)+","+str(trumpMark))



def findfraud2(NAME):
    with open(NAME + '.json', encoding="utf8") as f:
        x = json.load(f)
    TotalVotesLost = 0
    for i in range(len(x["data"]["races"][0]["timeseries"])):
        if i != 0 and x["data"]["races"][0]["timeseries"][i]["votes"] < x["data"]["races"][0]["timeseries"][i-1]["votes"]:
            TotalVotesLost += x["data"]["races"][0]["timeseries"][i]["votes"] - x["data"]["races"][0]["timeseries"][i-1]["votes"]
    print (TotalVotesLost)

##findfraud(f"{sys.argv[1:][0]}")
ff1(f"{sys.argv[1:][0]}")
##ffDump(f"{sys.argv[1:][0]}")

This is modified to run from the command line with a argument, like:

python fraudcatch.py Race-Pres-pennsylvania

I've attached a zip file containing the data as of around 9:30AM PST.

There are two data files for each states, a "race" file, and a "state" file. The race file mostly refers to the Presidential race, and is a smaller file. The state file includes all the data in the race file and adds considerably more data regarding the various state races.
Sorry for the late reply. I have been looking for this analysis for several months after downloading the json files from NYT.
I was wondering about the Pennsylvania data that occurred at the Nov 4 2am time stamp where Trump goes down by 0.006 and Biden goes up by 0.006 and only 54 incremental votes are shown in the vote total. The 0.006 fraction is a 17,000 vote swing that can not be accounted for in the data. So disappointing that the data we have for a Presidential election can be this flawed. Should not matter which side of the political spectrum you are on the process should be better than this. The wild data swings earlier of +/-900,000 votes is real sloppy.
 
Back
Top