Leaked Password Analysis

There have been a number of good analyses on leaked password databases over the years. It’s about time I did one, but in this post we’re going to go beyond just looking at how many people used ‘password’ or ‘123456’. What I want to know is whether we can use machine learning to find interesting relationships or statistics that a human can miss.


How I got the data (and why this is not illegal)


Most articles similar to this use a database of passwords that have been publicly leaked by a hacker.

Although the title is ‘leaked password analysis’, it’s not entirely accurate since I am not aware the data I am using has been publicly leaked. That said, I was able to get it so easily, it almost certainly has by now.

So how did I get hold of a list of usernames, passwords and other credentials? As long as you aren’t fussy about where they come from, it’s staggeringly simple. I just did a google search for Excel documents containing the phrases ‘username’, ‘password’ and ’email’. Yeah, that’s it. If some idiot anywhere in the world has accidentally saved a spreadsheet of members details and made it publicly accessible, it will show up in that search. Surely that never happens, or if it does it’s like a 1 in a million event? Well there are millions of organisations around the world, so that search picks up those that are that one in a million.

What I found was a list of user credentials for an organisation in the USA. I am not going to state which organisation it was, and I am not going to provide the data here – just the results of my analysis. In fact, I emailed them and let them know they were broadcasting their member’s details to the whole world (they took the spreadsheet down pretty quickly after that).


The data

The data consists of a spreadsheet of 6614 user’s details. It includes:

  • First name
  • Last name
  • username
  • password (encrypted)
  • plain text password (!!!)
  • date of registration
  • date of last login
  • personal email
  • signup IP address

Yep, the organisation took the effort to encrypt the passwords but someone has saved the plaintext version in the same document. Amazing.

The fact that this dataset contains personal emails made me hesitant to use it for anything. But I decided since I am not making it public, and all I am doing is an analysis of the dataset itself, it is acceptable to go ahead.

The other really interesting attribute is the ‘signup IP address’. This gives us the location of users when they signed up (barring any using a proxy). Again, in the wrong hands this would be extremely sensitive data, but we can use it to draw some really interesting conclusions.

At only 6614 users, it’s a small dataset, but that gives us a lot more scope to experiment with different methods that would otherwise take hours. It’s still definitely large enough to look for interesting trends.


Part 1 – Location Data

The signup IP addresses can easily be translated to location data. Putting any valid IP address into this link will return lat-long coordinates, the city, the region, the country and other relevant information:

http://freegeoip.net/json/”insert IP address here”

So, I wrote an R script to feed in all the IP addresses one by one and save the location data. This gives us the actual location for each user.Let’s plot them out:


Users by longitude and latitude (it’s shaped like the USA for anyone who hasn’t noticed!)

Look familiar? You can probably make a good guess at which state the organisation is based in from that. I particularly like how users in Hawaii and Alaska show up too. Along with the lat and long coordinates, I saved the state of residence for each user – we’ll be using that later.


Part 2 – General analysis, aka ‘stand back, I know regular expressions!’

With all the location, and raw data saved in a single file, I then started to go through it to get some basic statistics.


Most common passwords

This is the most obvious place to start. I was actually quite impressed that there were no passwords used a overwhelmingly large amount of times. Perhaps the members of this particular organisation are more security aware than most. Here are the most commonly used passwords:

password number of occurrences
[name of the organisation]* 6
baseball 5
buster 5
cowboy 5
hotrod 5
pepper 5
123456 4
abc123 4
[removed] 4
[removed] 4
retired 4
1990ford 4
777777 3
bigdog 3
blademan 3
boomer 3
harley 3
hunter 3
mustang 3
password 3
union 3

To be honest, nothing too unusual here. Many of these appear in similar analyses. As I mentioned, I was surprised by the low number of repeating passwords – the most common one only appears seven times, so I definitely think the users of the website in question are better at picking passwords than most.

Speaking of the most common password in the list, I had to remove it since it is the name of the organisation! More on that in a bit. I had to remove a couple of others too since they give clues to the nature of the organisation.


Dictionary search

From this point in, everything in this section involved matching the data against regular expressions. For those unfamiliar, these are basically like using a wildcard but much better.

I downloaded a list of about 500,000 English words. I then counted the number of passwords which consisted of a dictionary word.Next, I looked for passwords CONTAINING a dictionary word.

Here are the most common words, in descending order of occurrence, from the dictionary found within passwords (5 letters or more only to exclude trivial cases):


100 most common dictionary words appearing in the passwords

 “[removed]” “[removed]” “union” “[removed]” “uster” “opera” “chevy”  “hunter” “retire” “stang” “buster” “cowboy” “erato” “harle”  “operator” “[removed]” “brand” “grade” “honda” “mustang” “sword”  “cooper” “hotrod” “retired” “rober” “robert” “tired” “willi”  “aider” “angel” “baseball” “cooter” “dodge” “eeler” “ember”  “[removed]” “steve” “white” “atman” “bronc” “bronco” “butte”  “chest” “christ” “daddy” “ester” “horse” “iller” “inter”  “james” “password” “pepper” “ronco” “steel” “aiders” “bandi”  “bandit” “black” “blade” “bubba” “buddy” “butter” “charlie”  “chester” “cookie” “dakota” “[removed]” “drill” “fishing” “football”  “granite” “green” “happy” “hawaii” “honey” “jesus” “jordan”  “lores” “michael” “molly” “olden” “peter” “river” “roost”  “rooster” “sammy” “sierra” “steele” “steeler” “super” “track”  “utter” “acker” “actor” “ammer” “andre” “anger” “annie” “boomer” “[removed]”

Unfortunately I had to remove a few since they gave clues to the nature of the organisation. Let’s just say, had the organisation been an airline, for example, those removed would have been words like ‘plane’ and ‘pilot’.

Then I looked for passwords containing a dictionary word with a number at the end. And then passwords containing a dictionary word with more than one number at the end.


Dictionary search results:

Password characteristic % of passwords with this property
Note, this only picks up lowercase matches is dictionary word 8.3
contains dictionary word 39.3
is dictionary word + digit 12.3
is dictionary word + multiple digits 6


All the above passwords are vulnerable to a dictionary attack. Sure, they are a lot better than useless, but if someone really wanted to get into an account, these would be relatively easy to circumvent. It’s also worth noting, my dictionary search only looked for lowercase passwords, so many more would probably show up if I made the first letter a capital.


Update – non-case sensitive dictionary search

In addition to the above, I also looked for dictionary words which appear in passwords regardless of the case:


Password characteristic % of passwords with this property
Contains dictionary word (non case-sensitive) 45.5

While 39.3% of passwords contain a dictionary word in lowercase only, 45.5% contain a dictionary word in any case. This is a smaller difference than I expected actually. It seems that people have the tendency to not include capitals when they use a dictionary word.

Here are the 100 most common dictionary words appearing in passwords when the case is ignored:

 “[removed]” “[removed]” “[removed]” “opera” “union” “chevy” “hunter” “buster” “retire””uster”
“[removed]” “brand” “erato” “harle” “operator” “cooper” “cowboy” “stang” “angel” “bandi”  “bandit” “grade” “honda” “james” “mustang” “retired” “rober” “robert” “sword” “tired”  “white” “christ” “dodge” “eeler” “hotrod” “peter” “willi” “aider” “andre” “baseball””black” “cooter” “[removed]” “ember” “fishing” “green” “hawaii” “horse” “iller” “[removed]” “password” “raider” “sierra” “steve” “atman” “bronc” “bronco” “buddy” “butte” “charlie”  “chest” “daddy” “david” “[removed]” “[removed]” “ester” “football” “granite” “inter” “jesse”  “jesus” “michael” “pepper” “river” “ronco” “scoot” “scooter” “steel” “super” “survey” “aiders” “annie” “blade” “boomer” “bubba” “butter” “chester” “cookie” “daisy” “dakota” “diesel” “grand” “happy” “honey” “jordan” “kelly” “lance” “lores” “marie” “molly”

This isn’t too different to the case-sensitive dictionary matches. But, we do see more words related to the actual organisation used here (they are the ones I had to remove). That is kind of interesting – folk are using these obvious words and throwing in a few capitals and numbers in an attempt to make their passwords more secure, but keep them memorable.

Now let’s look at the passwords that are useless:


Stupid passwords

To be fair, there were less of these than I expected. We’ve already seen a few stupid ones above – namely ‘password’, ‘baseball’ and ‘123456’. Let’s look more into the ones which contain the name of the organisation.

The organisation itself has a couple of alternative names, so  I searched for passwords which contain any of the names of the organisation

Then I went for passwords containing the users actual first and last names.

And finally, passwords which either are or contain the username.

stupid passwords summary:

password characteristic % of passwords with this property
password contains name of organisation 1
password IS name of organisation 0.15
password contains users forename 1.2
password contains users last name 1
password IS the username 0.35
password CONTAINS the username 0.85
password is the username + one digit 0.27


The last ones are stronger if the username is not known. However, let’s see how many people’s username is contained in their public email:

32.6% of users have their username in their email address

Wow, that was higher than I expected. If you knew the email of a user, there’s a 1 in 3 chance you have a good idea what their username is. Even if you get it wrong, you can just keep guessing by using substrings from the email until you get it right.

The site gives 10 attempts at guessing a password (yeah, this is actually written in the raw data). We could get about 1/3 of the usernames from email addresses if we had those available. Then if we guessed the password as the username, followed by the 9 most common passwords from above, we would gain access to an average of:

(6614*0.326)*(6+5+5+5+5+5+4+4+4+(0.0035*6614)) = 22 accounts

Sure, this isn’t a huge number, but we could get into that number of accounts (on average) just by guessing. Incidentally, if we were only allowed 3 attempts at each password, we’d still get into 10 accounts.

Weak passwords

Ok, so we’ve seen the really bad passwords, and the ones that could be easily cracked by someone determined enough. Now let’s look at better ones, but still not strong passwords.


weak passwords:

password characteristic % of passwords with this property
text only passwords 28.1
lowercase only passwords 25.3


Ok, that’s enough basic figures for now, here’s a quick summary so far:

Password statistics summary

Password characteristic % of passwords with this property
password IS name of organisation 0.15
password is the username + one digit 0.27
password IS the username 0.35
password CONTAINS the username 0.85
password contains name of organisation 1
password contains users last name 1
password contains users forename 1.2
is dictionary word + multiple digits 6
is dictionary word 8.3
is dictionary word + digit 12.3
lowercase only passwords 25.3
text only passwords 28.1
contains dictionary word 39.3
contains dictionary word (non case-sensitive) 45.5


Part 3 – finding associations with the apriori algorithm

This is my favourite part. We could look through the data ourselves and try and find rules based on various hypotheses. For example we may try and determine whether people in a certain area have weaker passwords. However, humans are not very good at this. Not only are we slow, but we don’t notice unexpected rules. The unexpected ones are usually the most interesting ones.

This is what unsupervised machine learning algorithms do. You basically feed in your data and the algorithm will look for any clustering or association rules (depending on the type of algorithm) without explicitly being told what it is looking for.

Since we do not have a huge dataset with many attributes, we can use one of the simplest unsupervised algorithms there is – the Apriori algorithm. I won’t go into details of how it works INSERT LINK, but it’s noteable for not really using any maths beyond counting. It basically just counts itemsets, makes candidate rules, and then counts the number of occurrences.

Rules generated by the apriori algorithm may be ranked by ‘support’, ‘confidence’ or ‘lift’. We’ll be focusing on ‘confidence’, which is the number of times the rule occurred compared to how many times it could have occurred.


Password strength by date

After running the algorithm a few times with various parameters tuned, I started to notice something interesting about the weaker passwords. The dataset contains a lot of accounts that are dormant, it they haven’t been signed into for years. If we plot the proportion of passwords that are weak (dictionary matches, lowercase only and text) against the year the user last signed in, we get this:


% of passwords that are weak, against year the user last signed in

So, it looks like password quality improves with time. The dormant accounts will generally have older passwords. So, we conclude that the users of the website have become better at picking stronger passwords over time. This isn’t that surprising but it is certainly interesting to see the relationship.


Geographic-based rules

Hawaiian Users

After a little more playing around, I discovered rules relating to users from ‘Hawaii’. Here’s what I found:

Rule Confidence for the rule (%) Confidence for RHS only Probability the difference occurred by chance(%)
{region_name=Hawaii} =>
{password contains dictionary word=FALSE}
71.90% 60.70% <0.1%
{region_name=Hawaii} =>
{password is dictionary word=FALSE}
96.60% 91.70% <0.1%
{region_name=Hawaii} =>
{text only password=FALSE}
75.90% 71.80% 18%

The left hand column gives us the confidence that each rule is true. We can compare this to the middle column to compare the rule against the data as a whole. Looking at the first two rules, we see that less passwords from users in Hawaii contain dictionary words. The difference isn’t as high for the bottom rule (and it isn’t statistically significant at 18% of occurring by chance), suggesting that users from Hawaii tend to use text-only passwords about as often as those from other regions.

Why is this? I’m going to guess that users in Hawaii tend to use Hawaiian words in their passwords. I didn’t even know this was a language before writing this! So if you wanted to do a dictionary attack on Hawaiian users, best modify your dictionary.


Users with the same username and email tend to have poor passwords

The final rule I’ll present here relates to users which have their username in their email AND have lowercase only passwords:


Rule Confidence for the rule (%) Confidence for RHS only Probability that the difference occurred due to chance
{email contains username, password is lowercase text only}     =>
{password is a dictionary word}
34.20% 8.20% <0.1%

This indicates something not particularly surprising – users who pick poor usernames also tend to pick poor passwords. What is surprising is the prevalence above the ‘background rate’. Only 8.2% of users use dictionary words for a password but 34.2% of users with their username in their email use a dictionary word – a huge difference.



While this dataset contained a higher percentage of strong passwords than in other analyses I’ve read, there is still a very high chance that numerous accounts could be accessed by an unauthorised user simply by guessing. The most commonly used words are in keeping with those discussed in other studies.

Unsurprisingly there are a small number of users with extremely vulnerable passwords, using the name of the organisation, or their email address as a password.

The rules found by the Apriori algorithm are a little more surprising, it would be interesting to see if similar rules are maintained in other password leaks. It is encouraging to see that password strength has appeared to increase over time.

The most important thing to take away, however, is how easily I found this dataset. Every time you sign up to anything online, you are trusting the administrators on the other side with your details. This is why you should ALWAYS use different passwords for different websites. It’s laborious and difficult, but it’s the only way to keep your personal information safe.


Links and further reading

Apriori algorithm – one of the easiest to understand data mining algorithms there is!



Want to learn how to fetch anything from a big document by using a garbled string of seemingly random characters? Welcome to the world of regular expressions!



An analysis of a much larger password leak:



How not to pick terrible passwords:




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s