Watching the Watchmen: Finding Spyplanes with Autoencoder Neural Nets

One of the weirder datasets on Kaggle is the ‘spyplanes’ dataset. It consists of a bunch of flight data from 2015 and was originally used as part of a Buzzfeed News article where the author, Peter Aldhous, reckoned he could detect spyplanes apart from other aircraft using machine learning. His analysis worked by training a model on flight data for known FBI and US Security Services aircraft and then looking for other aircraft with similar behaviour.

I think there is another viable approach to this problem.

Surveillance-aircraft are, by definition, supposed to be difficult to identify. By comparing aircraft behaviour with that of known government aircraft, we risk missing many other planes which may behave drastically differently to those we manually identified. Rather than training a model on a known dataset (aka supervised learning), I decided it is more appropriate to scour the whole dataset of flight data for unusual behaviour (aka unsupervised learning). In other words, I’m going to look for aircraft which look ‘weird’…otherwise known as outliers. To do this I’m going to use an autoencoder neural network.

All code and the Jupyter notebook from which this page was created available at

The page actually looks far better in Jupyter notebook form – unfortunately importing it into the free version of WordPress messes it up a bit. I want to improve the general presentation of these posts but that’s for the future.

The Dataset

Full details of the dataset used are available at

It consists of modified flight data from 2015. As this project was more about finding an excuse to use an autoencoder, I chose a pretty convenient dataset which has already been engineered for features. Were I using raw flight data, I’d probably have performed the feature engineering differently, but that’s not interesting now.

Each entry includes an aircraft with a number of distance/time based metrics, to give an idea of the typical flight pattern. Also included is the aircraft type and transponder info and number of observations/flights seen over the period.

The original author also included a small directory of aircraft which he believed were spy aircraft (as they were operated by the FBI or US Defence Department) and ones which he believed were not. In total he identified 97 probably spy aircraft and 500 non-spy aircraft. He used this ‘known’ dataset to train up his model. I am going to used it for model validation, but not in the normal sense. Remember, we don not know that all of the 97 aircraft are actually spyplanes and any of the 500 ones could be too.

Processing the dataset

Rather than training the model on a dataset and then testing it with known results, a kind of hybrid approach was used. A model was trained on a subset of the main dataset and then tested on the table of known aircraft. Since we do not know how many of the labels on the ‘known’ aircraft are correct, the performance was only really an estimate. The model was then applied to the rest of the main dataset to give candidate spyplanes. Normally we’d just run a model on the raw dataset and let it tell us which entries were the outliers, but in this case we have the luxury of some ‘sort of’ labelled data.

First the necessary libraries are installed and the main dataset imported. A second table called ‘labelled_data’ is created. This contains the ‘known’ spyplanes and non-spyplanes and will be used to test the model. These aircraft are removed from the main dataset.

#import libraries
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from scipy import stats
import tensorflow as tf
import seaborn as sns
from pylab import rcParams
from sklearn.model_selection import train_test_split
from keras.models import Model, load_model
from keras.layers import Input, Dense
from keras.callbacks import ModelCheckpoint, TensorBoard
from keras import regularizers
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix

#import dataset

#convert type column to integer (there are probably better approaches to use here)

#import labelled aircraft
test_ident = pd.read_csv("train.csv")

#use the labelled data as a train/test set (note the different context of testing to normal for this model)


10% of the aircraft from the main dataset are then removed and used to build a training set. Note, the data was scaled between 0-1:

#use 10% of the input data as a training set

#also consider adding some of the actual data to the train/test set to improve the size
test_set = labelled_data

#save test set labels for later
#get number of positive classes in test set
test_set = test_set.drop(['class'], axis=1)

#convert to array and normalise
train_set = preprocessing.MinMaxScaler().fit_transform(train_set.values)
test_set = preprocessing.MinMaxScaler().fit_transform(test_set.values)

Applying A Model

The aim was to find an algorithm which would identify outliers in the data. For this dataset, aircraft which behave oddly may well be surveillance aircraft. I used an autoencoder neural network to do this, here’s how they work:

I’m not going to go into too much detail about autoencoders. In a sentance, they take the input data, compress it, and then try and ‘re-predict’ the input data from the compressed version. Here’s how you’d represent one, with nodes and links:


in the example above, a noisy image forms the input data. By compressing it and then recreating it, the noise is removed, giving a clearer version of the original image.
At first glance this seems weird: training something to recreate the input data? This becomes useful when we consider what happens if we feed the network an unusual data point. By training it on typical data points, it learns to approximat to something which is typical. However, if the trained network is then given an anomolous point, it is likely to recreate it with a high degree of error. By measuring the difference (ie error) between the input data and the recreated points, we can identify which are the largest outliers in the data.

In other words, a properly trained autoencoder can spot unusual data points.

To do this with Keras/Tensorflow we first define the layers of the network. The input and output layers must be the same size as the data, with a node for each attribute. The intermediate layers wre half the size, to allow the network to compress the records as described above.

#define layers
input_dim = test_set.shape[1]
encoding_dim = int(input_dim/2)

input_layer = Input(shape=(input_dim, ))
encoder = Dense(encoding_dim, activation="tanh", 
encoder = Dense(int(encoding_dim / 2), activation="relu")(encoder)
decoder = Dense(int(encoding_dim / 2), activation='tanh')(encoder)
decoder = Dense(input_dim, activation='relu')(decoder)
autoencoder = Model(inputs=input_layer, outputs=decoder)

Next the model was run and the best performing model saved. Note, all the parameters like the optimiser and the loss function. These can be changes along with the number and size of the layers. This process is called hyperparameter tuning and we seek to find the best combination of hyperparameters to obtain the best results. For this excercise, some random paramaters were tried and the best selection chosen. Normally I’d tune the parameters far more extensivley but since this is just a demo I didn’t want to get too bogged down with this part.

Note, the model was only run on the 10% of aircraft which were used as the training set. Ideally I’d train it only on aircraft which were known not to be spyplanes. This would make it optimised for reconstructing ‘normal’ aircraft patterns only and unusual aircraft not seen in the training set would be reconstructed with a high degree of error. I figured very few of the training set aircraft would br spyplanes so this is probably not a huge issue.

nb_epoch = 100
batch_size = 50
checkpointer = ModelCheckpoint(filepath="model.h5",
tensorboard = TensorBoard(log_dir='./logs',
history =,train_set,
                    callbacks=[checkpointer, tensorboard]).history


Now it’s time to test the model on the dataset of ‘known’ aircraft. Remember, I am not convinced that all of those labelled ‘spyplanes’ are indeed spyplanes. Similarly, many of the aircraft labelled ‘normal’ could well be spylplanes. I was therfore not expecting great performance. To classify aircraft, I ensured the model only labelled the top 97 anomalous entries as spyplanes, because that is how many were labelled as such in the original ‘known’ aircraft dataset.

#predict on testing set
rmse = pow(np.mean(np.power(test_set - predictions, 2), axis=1),0.5)
error_table = pd.DataFrame({'reconstruction_error': rmse,
                        'actual_class': test_set_labels})
#we know how many entries have a positive in the test set. Take this number and label the predictions with the highest rmse (ie the outliers) with the positiveclass prediction
error_table['predicted_class'] = np.where(error_table['reconstruction_error'] >= min(error_table.nlargest(int(test_set_positives),'reconstruction_error', keep='first')['reconstruction_error']), 'surveil', 'other')

#get confusion matrix

When I initially ran the script, 31 of the 97 spyplanes were correctly identified. By tuning the parameters, I managed to improve this to 54, which gives the following confusion matrix:

normal spyplane
normal 457 43
————– ———— —————
spyplane 43 54

The rows represent actual classifications and the columns predicted classifications. So, the top left to bottom right diagonal represents correct classifications. So there were 43 false positives and 43 false negatives (false positives will always = false negatives in this exercise since I predefined the number of positive predictions). At first glance this might look like a not-too-great result but I was actually pretty impressed how well it performed – it found 54 of the 97 aircraft which were known to be government operated. Let’s remember that the algorithm is identifying unusual aircraft just by the fact they fly in an irregular pattern – it is not using any known data to correlate observed aircraft with known spyplanes.

Let’s also remember that some of the aircraft identified as spyplanes which were labelled as ‘normal’ may indeed be spyplanes. The test data relies on accurate labelling of aircraft which is impossible.

Let’s have a look at the reconstruction error for each of the aircraft in the test set. This is a measure of how much of an outlier each aircraft was – recall the top 97 outliers were taken to be spyplanes:

#plot error
for name, group in groups:
    ax.plot(group.index,group.reconstruction_error,marker='o', ms=3.5, linestyle='',
            label= name)


Now that is cool. The plot above shows how ‘wrong’ the autoencoder was when predicting each aircraft. In other worse, the higher a point is on the y axis (each point is an aircraft), the more anomolous that point is. Although far from perfect, the testing examples we believe are spyplanes indeed tend to be more anomolous. This plot strongly suggests this method is capable of identifying surveillance aircraft. Remember, the plot was created without any labels on the training/testing data. So, the algorithm has no idea which examples are spyplanes…it doesn’t even know this is aircraft data at all – it has just identified points in the data which are dissimilar to other points. By coloring the points with their appropriate class we have shown that the autoencoder is able to predict which aircraft are spyplanes significantly better than chance (guessing randomly would only identify 15 spyplanes correctly on average).


Note, since I only allowed it to predict 97 aircraft, it is difficult to get a high accuracy, but 54/97 = 56% correct classfications is way better than what I was expecting.

Applying the model to the unlabelled data

Now that we have a good degree of confidence that this approach can distinguish between normal aircraft and spyplanes (well, government operated aircraft anyway) it’s time to apply it to the whole dataset and see which aircraft are identified as most likely to be spyplanes:


The higher the x value of a point, the more likely it is to represent a spyplane. Of course, we don’t know how many spyplanes are in the dataset, so we can’t pick a threshold value for the error. However, just for fun, I picked 0.2 as the threshold – it’d pretty arbitrary but above that we have only the most anomalous data points. So, I’m going to go ahead and say these points represent spyplanes. Let’s get the identification numbers for those aircraft:

#get identifiers for predictes spyplanes

That’s 160 candidate aircraft identified. Let’s see how many of the ones I identified were in common with the 101 identified by Peter Aldhous, the original author of the Buzzfeed News article:

#compare to previous results

Only 17 were in common with the previous results. This is not necessarily a bad thing since the two methods used completely different approaches. Both could be good classifiers, neither could be or one could be good and the other bad (or anything between these extremes). Without further data it is impossible to know.


 You can run the script yourself to get the list of candidates, but let’s look at the flight history of a couple of them from flightradar 24 – these flights are from the last week (as of 2018-01-10). You’ll have to take my word but I didn’t cherry pick these – they were the first aircraft I found which had flown within the last week:
Aircraft ADF7A5 (registration = N9999)

That’s definitely odd – this flight happened just 4 hours before time of writing. It looks like the flight departed from a small airport in the middle of Fort Lauderdale; then kind of circled around for 25 minutes before apparently disappearing.


How about this one:

Aircraft 1996CB (registration N171HP)


This flight happened on 2018-01-05. It took off from Columbus, Ohio and then just kind of circled over the city for more than 3 hours. That has to be a surveillance aircraft! I used to do a bit of flying myself and I’ve never seen a flight plan anything like that.

In fact, I have a good guess (and it’s just a guess) as to exactly what this might me. A while ago, I listened to a Radiolab podcast about a trial which used aircraft taking high res photos repeatedly over cities to try and detect crime, you can listen to it here: . The website for the system is here: . It looks awfully like this – a small aircraft circling a city for several hours.

For the record, I’m dead against this kind of mass surveillance but we’ve discussed that in another post.


And finally, I’m happy to admit that not all of my predictions will be spyplanes. Look at this absolute beast which I came across in the results. It’s owned by the California highway patrol, so in a sense it is kind of a surveillance plane, and is a converted Hercules. I can see why my method would identify this as a spyplane even though it isn’t (…or is it?):




Using an unsupervised approach is the way to go when we cannot be sure whether our data is labelled correctly. For this problem, we had some labelled data but personally I wasn’t very confident in its reliability. The unsupervised approach looked for ‘odd’ data points rather than trying to fit them to known examples.

This application was a bit of fun really, rather than a practical example. However, autoencoders are essential tools. We can imagine this approach being used for all kinds of examples such as equipment failure detection, monitoring medical patients or early warning systems.

With some more tuning, I believe my approach could be improved further, I may revisit this project in the future – keep an eye on the gitHub repository if you are interested.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s