US election prediction with boosted decision trees

This page is no longer being updated – all updates at this page from now on

 

08:15 08/11/2016 – UPDATE#5. Election day. I’m not changing my model any further at this point, so the model accuracy will remain at 63.6%. Including today’s news, I get a confidence of 60.2% for a HILLARY win.

I’ll stick with my estimate of HILLARY TO WIN WITH 52.5% OF THE VOTE (excluding third-party candidate votes). Note, this number is definitley an educated guess – the model isn’t designed to predict voter share, so I’m just using the probability to predict it.

Since I’ve just found out some polling stations declare their result before the nationwide count has even been taken (!!?), I will be making no further predictions – the actual results are now coming in.

23:30 07/11/2016 – UPDATE#4. With the model improved to take training data over weighted 5-day intervals, and a slightly different variant on the adaboost algorithm the model accuracy is now 63.6% – a significant improvement and above my personal target of 60%. Yes, it’s still not that much better than guessing but it’s a lot better than I started with.  HILLARY is still the predicted winner, but with a lower probability – 52.5%. However, the improved model accuracy offsets this.

My final prediction will probably not include tomorrow’s news since it is likely to be representative of the training data. However, I’ll carry on updating tomorrow.

At this stage I’ll commit  and predict HILLARY TO WIN WITH 52.5% OF THE VOTE RELATIVE TO TRUMP. 

So, goodnight, and to any American readers – good luck!

16:30 07/11/2016 – UPDATE #3. Woah, a big increase in Hillary’s chance – the update I gave earlier today gives a 64.2% probability of HILLARY winning now. BUT, the confidence is still just 53.8%. I’ll try and improve that tonight – the last improvement (made by tuning the cross-validation parameters) took 3 hours for my computer to calculate so it takes a while.

13:00 07/11/16 – UPDATE #2. With today’s news included, I still predict HILLARY to win. I made some changes to the parameters for the adaboost method, but unfortunately my laptop battery ran out before I managed to record the probability and model accuracy! I’ll update again tonight.

23:19 06/11/16 – UPDATE #1. I’ve altered the model to take news from the past few days as input (weighted accordingly). I’m now predicting HILLARY to win, with 52.9% probability and 52.3% accuracy (ie, still marginally better than guessing). Next update tomorrow lunchtime (UK) / breakfast time (US).

My prediction will be updated every day on the run-up to the election – keep coming back for my latest prediction. The first prediction (made at 18:00 on 06/11/16) was for TRUMP to win with 52.8% probability and 52.9% accuracy on the model.

 

It’s been ages since my last post. This is largely because I’ve decided that I find data science so interesting I’ve decided to do a Master’s degree in it. Unfortunately this means I’ve not had as much time to write interesting posts. Hopefully I should be able to get more up in the near future.

 

Anyway, let’s get to the point. This is a bit of an ‘impulse post’. I only had the idea on Friday and today is Sunday. The US election is on Tuesday, so I haven’t had much time to put together a particularly robust model. So, this is a semi-light hearted post since I haven’t been able to do anything like the amount of data-collection and validation I’d like. Nevertheless, the methods I have used are robust and, given time and improvements, could probably be used for serious prediction.

Oh, and I’ll do my absolute best to make this post entirely objective…which, given the circumstances, is difficult.

 

Election polls, and why they aint what they used to be

Until a few years ago, in the run-up to any major election, poll data was a pretty reliable way to determining the winner. Here’s how to conduct a poll.

  1. Somehow find a group of people who are representative of the entire population.
  2. Ask them who they are going to vote for.
  3. Add up the results – this is the prediction.

Seems pretty trivial. However, in recent years, polls seem to have become less and less reliable. Here in the UK, pretty much every major poll got the EU independence referendum and the last general election results completely wrong.

I’m not going to go into detail about why this may be, but let’s just assume that polls are, in isolation, no longer a reliable predictor for election outcomes.

 

A possible solution – aggregate predictors

Ok, so if polls are innacurate, maybe we can just take the average result from all of them and use that as our prediction?  We see this a lot when news agencies refer to their ‘poll of polls’. This probably gives a better estimate than just a single poll but is still subject to all the same issues.

A better method is to account for polling error somehow:

 

Use historical polls to correct for errors

If we start studying historical elections, we can start to identify patterns in polling data and account for the errors. For example, if the polls for the Florida area consistently underestimated the Republican vote by 4%, we could just add 4% to our predicted Republican vote from Florida in our prediction.

All kinds of other factors can be used to correct for poll results. For example, campaigning budget for that area. Using a list of factors like this, an entire prediction model can be built up.

The best poll predictor can be found at FiveThirtyEight, their models use as much poll data as they can get hold of and then attempt to correct for errors through various means. They then run simulations to predict the outcome for the election thousands of times and look at the average results to give a prediction,with estimated confidence levels.

Before we go any further, I want to stress that the model used by sites such as FiveThirtyEight is vastly superior to anything I’ll be posting here today. But, they do have the issue of having to account for individual errors by having to identify separate factors that may skew poll results. This is extremely difficult to do. In the case of 2016, the nature of the whole run-up to the election is just so different to anything historically that it is impossible to predict what will actually affect the result.

This also brings us nicely onto why using data from historical elections is difficult to use. Let’s look at a timeline of previous elections:

 

techs

Past US election years shown as black lines. We can see everyday technologies that did or did not exist during those election years

There are two obvious problems with trying to fit new models to previous election results:

  1. There aren’t many of them. When fitting a model to data, we want as many instances (data-points) as possible. In the last 35 years there have been just 10 elections, that isn’t anything like enough to build a good model with.
  2. It is doubtful whether any of the previous election results are comparable to this one. Sure, we can say ‘the polls got it right for the 2012 election’, but that election was very different to this one. Twitter was used far less back then, the news coverage was very different, and the two candidates were nothing like as controversial as this time. Whether or not these factors make any difference to the reliability of polls is unknown as we have nothing to compare this election against.

Is there any other data we can use to build a model?

 

Proxy Data

In an ideal world, to build a model, we would carry out a nationwide vote every day in the run up to the election. Somehow, we would know that people were giving an honest answer in this vote. We would ask everyone a series of questions before they voted. The responses to these questions (input data) would be recorded along with the overall vote for that day (ideal poll result).

We would then look for correlations between people’s responses and the vote outcome for that day. These correlations would form our model for predicting the election outcome. When we wished to predict the election, we would ask everyone the questions and plug them into our model – this would give us the predicted election result.

Clearly we can’t ask everyone in the US a series of questions everyday and then get them to vote in a nationwide poll, so we need proxy data. This is real-life data that is representative of this ideal data.

 

Proxy input data – the news

Perhaps what is in the news can predict that national sentiment. This is the first assumption I have made in my model. My rationale is that there is a huge volume of news published online each day.On a very simple level, on days where a candidate is facing a scandal, their approval is likely to be lower. However, a good model will pick up far less apparent correlations than this. Perhaps a candidate tends to do better on days when the news is talking about employment figures. A good model will identify associations like these.

Since I put my model together in a day, I haven’t had time to write a program to trawl through multiple news sites and strip out the words. I have chose to use the New York Times. The front page is published everyday online and it is easy to search for past issues. So I wrote some code to:

  1. Download the front page of the New York Times for the past 6 months.
  2. Strip out all the words and delete common/boring words (to, the, and etc) and html tags.
  3. Find the 125 most commonly used words over the 6 month period.
  4. Rank how often each of these words are used on the front page every day.

What we are left with is a series of lists, one for each day of the last 6 months. Each of these lists is the number of times each of the 125 most commonly used words is used for that day. This is our input data for the model.

 

Proxy poll data – the stock market

Now we need to classify each day as a ‘republican day’ or a ‘Democrat day’. In other words, if everyone voted on that day, would they have voted Democrat of Republican?

What we want is a nationwide daily vote. And we have one – the stock exchange.

But how can the stock exchange tell us whether the public would have voted Republican or Democrat for that day? We look at how certain stocks performed for that day. Stocks that are perceived as performing better under a Democrat president and under a Republican president were identified and grouped into a ‘democrat index’ and a ‘republican index’. The sum of the % change for each stock in the index was calculated and the best performing index of that day was taken as the ‘vote’ result for that day. ie on days when the ‘democrat index’ performed better, we assume the result of a public vote for that day would have been ‘democrat’.

By the way, I did not include weekends or bank holidays in the input data since the stock market was closed then.

I cannot stress enough how much of a weak assumption all of the above is. Sure, the price of individual stocks moves according to who the market thinks the next president will be. But, thousands of other factors affect this. The hope is that a sufficiently trained model can identify a correlation between the news input data and the vote classification from the stock market, despite all the ‘noise’.

Let’s quickly discuss the stocks I selected:

Democrats – solar, power, defence, consumer staples and exports

As with everything else here, I didn’t have time to pull together a huge amount of data. A bit of reading around suggests that companies that do well out of a democrat president are those selling renewable energy, consumer staples and large exporters, so I included a few large companies which do this. I also included Northrup Grumman who are a large defence contractor. Apparently Hillary is seen as more likely to take the US to war in the middle East so some defence companies favour her.

Republicans – guns, prisons, gold and oil

The republican shares are hilariously stereotypical. The title above says it all. Since Trump is seen as a ‘more disruptive’ candidate, investors would take their money out of more risky stocks and shares and invest in gold (which is seen as safer) if he won. So, I’ve used a couple of gold mining companies in the index.

Ideally I’d have experimented around with different combinations of companies to find one with a good fit. But, as I keep saying, I just didn’t have the time.

 

Building a model

So, we have a bunch of data that we can use to predict the outcome of the election. We also have a simulated poll result for each day this data was taken. Now it’s time to build a model which identifies any correlations between the input data (from the news) and the poll data (from the stock market).

 

Decision Trees

The first concept to understand is that of a decision tree. We’ve all seen these before – something like this:

insects

A simple decision tree for identifying insects.

 

Given a series of input data, and known classifications (which is what we have with the news and stock market data) it is possible to build up a tree like this. There are various algorithms for doing so and my code tries a few different ones. The one which performed best is called C5.0.

I’m not going to go entirely into how this algorithm works because it’s not very interesting, and it’s actually not known – C5.0 is a commercial algorithm and the exact details have not been published. However, the basic principle is to split the tree on conditions which give the highest ‘information gain’.

For the input data we have collected, we get the following tree:

rplot

Decision tree built using C5.0. At each node we take the wordcount for each word and follow the tree down to the final result.

Taking the wordcount from today’s New York Times, this tree gives the following prediction:

probability (Democrat win) = 65.3%

probability(Republican win)=34.7%

But there is a catch. We haven’t discussed how good our prediction model is. When training the model, the data was split into training and testing sets in order to test its accuracy (the method used for sampling was leave one out cross validation for those interested). A model with absolutely no capability of predicting the winner would have an accuracy of 50% – ie a 50/50 chance of getting the right answer. This tree is indicated as having a very high confidence, but it’s still probably not much use.

A single decision tree for multidimensional data which is subject to noise (as we have here) is also very prone to overfitting. I won’t go into what that means, but it basically describes a model well-fit to the training data but unable to make effective new predictions.

Improving the model by boosting

Rather than just building one tree with the data and accepting it has a high error rate,why not focus on where the errors were made and use this to build a better tree? We can repeat this process many times to build a more accurate model.

In fact, rather than building entire trees, we can just make a ‘decision tree stump’ which asks only a single question. We then look at the cases this stump gets wrong, and generate a second stump which focuses on getting these right. After many iterations we have many decision stumps which can then be weighted accordingly to give an overall prediction. This is known as ‘adaptive boosting’, or ‘adaboost’.

Let’s look at this idea with a diagram. Imagine we are only using two words as predictors:

ada

How adaboost builds a model for a 2 dimensional input. At each step, a new decision stump is introduced, with each of the instances weighted appropriately (letter size represents weight). The final model splits up the training data with 100% accuracy.

The above visualisations are for 2-dimensional data. Our data is 125-dimensional (one dimension for each word), but the same principle still applies.

Building a model using adaboost, and feeding in the wordcount from today’s New York Times, gives the following results (updated intermittently):

Prediction Time (GMT) Winner Probability Model Accuracy
06/11/2016 – 18:00 Trump 52.8% 52.9%
06/11/2016 – 23:30 Hillary 52.9% 52.3%
07/11/2016 – 16:30 Hillary 64.2% 53.8%
07/11/2016 – 23:30 Hillary 52.5% 63.6%
08/11/0216 – 08:15 Hillary 60.2% 63.6%
       
       
       
     
   
   

My final prediction is HILLARY TO WIN WITH 52.5% OF THE NON-THIRD PARTY CANDIDATE VOTES.

I have decided to use the final prediction from the day before the election as the newspapers on the day are likely to have unusual headlines that do not fit the model. Also, a few stations had already announced their results when I made my prediction on the day, so it’d be cheating (sort of).

The model was never really designed to estimate voter share, but the probability is the best estimate i have of it so that’s what i’m using.

All these predictions are with an extremely low accuracy however – at around 63.6% it isn’t much better than just guessing; but is better nontheless.

The prediction method has now been updated to take the wordcounts from the previous 5 days. The most recent days provide the most weight, with weights of 0.5, 0.25, 0.14, 0.07 and 0.04 per day (most recent to last recent). Where did these numbers come from? Basically an educated guess – each day carries half the weight of the previous day.

Here are the words which the model determined to be the most important for a prediction:

variable_importance

The predictive model attaches the above importances to the above words. It seems the number of times the word ‘David’ is mentioned is the biggest predictor of whether a democrat or republican is the winner for that day.

As alluded to before, the adaboost model doesn’t just look for trivial associations. Just because the word ‘republican’ is mentioned more on a day doesn’t mean republicans are leading the ‘polls’ that day. Looking at the above words, we see that many that we would not expect to be predictors.

Conclusion

In line with most poll results, my model is currently predicting that Hillary Clinton will win the election.

The model discussed here was put together in just a day and uses extremely limited input data. However, the use of stock market and news data as predictors removes the need to rely on poll data for election prediction. By running the adaboost algorithm, we can use this data and not worry about having to artificially correct for factors caused by  unreliable polls.

On the other hand, we are assuming there is indeed a correlation between the days news and the likelihood of a certain candidate winning. Likewise, we are assuming price of the chosen collection of stocks shows at least some correlation with the preferred candidate for that day. For these reasons, the model accuracy (and therefore reliability) is low – not much above a simple guess.

But then again, it seems that polls are not much better than a simple guess at the moment.

In other words, it is almost certainly possible to provide a reasonable prediction for the election result using data mining algorithms such as boosted decision trees. However, months of data collection and model validation and parameter tuning are necessary to develop a reliable model. Perhaps in the future such models will overtake poll data as the main means of predicting the outcome. If I get the time, I will try and develop a more robust model ahead of the next major election and see if I can get a good prediction on the actual votes per candidate.

Until then, get over to FiveThirtyEight for the best prediction available.

Links and further reading

A more in depth description of C5.0 trees and boosting:

http://www.patricklamle.com/Tutorials/Decision%20tree%20R/Decision%20trees%20in%20R%20using%20C50.html

 

An excellent overview of the adaboost algorithm (includes a description of how it is used for facial recognition):

 

MIT Lecture on boosting:

 

FiveThirtyEight – the best election predictor available on the internet!

http://fivethirtyeight.com/

 

Code

The code for the model and data retrieval scripts was written in R and is available here:

https://www.dropbox.com/sh/p0azzzt441t360r/AABSFV0mF9-XpPP-TA6T3O4ga?dl=0

The excel sheets show the stocks I used to build the indices. The R files have comments which describe how to use them to build your own model.

 

Advertisements

One thought on “US election prediction with boosted decision trees

  1. Pingback: Miniature American flags for all! US election result – how did I do? | brain -> blueprint -> build

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s