Leaked Password Analysis – 2011-06 Edition
As most of you likely know, several months ago saw a shift in how a certain type of attack was being done on the Internet. Instead of breaking into a website and simply stealing information, people began breaking into sites to steal information and then release it publicly on the Internet. It is not my intent to discuss the choice of targets or the motivations of these groups. Others have written plenty on this topic and really, if you’re not working for a target or one of the attackers, anything you can say about their motivations is likely to be guesswork at best.
Instead, I want to talk about the passwords. I’ve been following these leaks and collecting password information. My goal is not to break into people’s accounts or to discuss whether or not the leaked data supports the claims of either side. I only have one goal in doing this. I want to find out what I can about people and passwords so I can help everyone choose better ones. So here is my initial analysis. If time permits, I hope to come back to this and do the analysis with more rigor and dive more deeply. However, since my initial rough analysis is done, I wanted to share my preliminary findings. I think they’re interesting, so I hope that you will as well.
I’ve combed leaked data for all the cleartext passwords I could fine. Realistically, this means that the passwords I’ve analyzed here fall into two categories. The first category is passwords that were stored unencrypted or very weakly. The second is passwords that were weak to begin with and were easily cracked by those who released the data sets or analyzed them later. So, the important takeaway here is that this is not and analysis of typical passwords used on the Internet. This is an analysis of bad passwords used on the Internet combined with passwords that were stored poorly. Still, since I want to learn what not to do, this seems like a worthy use of my time.
The data set exceeded half a million passwords… but likely involved some duplicate records. I hope to tighten up the analysis in my next go-around.
Everyone starts these analysis with a list of the most common passwords. I do not wish to disappoint, so here is what I found.
So what can we learn from this? First of all, note the number of passwords that are just numbers. 123456, 12345678, 12345, 111111, 1234, 1234567, and 123456789 were seven of the top 20 bad passwords. This is ridiculous. Who on earth thinks this is a good idea? A lot of people, apparently.
Second, notice the surprisingly large number of people who thought that trustno1, baseball and superman were good choices. Perhaps choosing passwords based on popular culture is unwise.
I then looked at the average password length. There’s not much of a surprise here, but here’s the graph if you’re interested:
What I found most interesting was how relatively few passwords were seven characters long. I expected six and eight to be large, but not for seven to be so short. Also, note how quickly it drops off after 8. Nine characters and up are ridiculously small.
This is where things get interesting. We have been talking for years about how people should use a mix of lower case, upper case, numbers and symbols in their passwords. I don’t want to bore you with math, but the reason is that the more characters you have to pick from, the longer it’s going to take to guess the password. If, for example, your password is one character long, if you use a lowercase letter and the attacker tries those first, it will only take 26 tries to get it. If you use a character from any of these sets, it will take 26 (lower case) + 26 (upper case) + 10 (numbers) + 32 (symbols) = 94 tries. If your password is longer, then it will be increasingly harder.
Let’s use a few pictures to make this easier to talk about.
This is what we’d like to think people are doing. We know that not everyone is following our advice, but at a guess, we’d expect there to be a reasonable mix of people doing it the right way and some overlaps within the other spaces.
Our ideal, of course, would be to widen the overlapping space. This way, more people are using more complex passwords and would be safer.
… and this is where we actually are today. The spaces aren’t the same size, which isn’t terribly surprising I guess. However, I didn’t expect not only for the special characters space to be so small, but I also didn’t expect the overlap to be so tiny. In fact, of the 519,229 I analyzed, only 315 had a mix of lower case letters, upper case letters, numbers and special characters. No wonder they got hacked. This means that 0.06% of all the passwords were considered minimally secure.
Really… is it so hard to add an exclamation point or question mark in there somewhere? Here, I’ll even give you some you can use. I mean, really!?!?!?!?!?!?
Other Metrics of Interest
When I compared the list of passwords to itself and weeded out the duplicates, I found that 65.71% of the passwords overlapped. I must say, folks are just not as creative as I had hoped.
For those that follow math, the average entropy score of the password set was 29.63. I hope to make a neat graph comparing entropy to things like length and commonality, but will apparently have to get more proficient with better graphing tools first. My existing tools found graphing 500,000+ data points somewhat challenging. :)
When I ran the list of passwords against the standard Linux word list, I got 85,196 hits out of 178,049 unique passwords. That’s a 47.85% rate of people that aren’t even trying. Again, we’re talking about the easily-cracked passwords, so this number is inflated… but it’s still much too high.
Surprisingly, I did not see many passwords that were just dates. Those stories of people using their kids’ birthdays as passwords seem to have been exaggerated… or perhaps people today don’t care about their kids very much. :)
So What Do We Do?
Given that this was a set of easily broken passwords, the key things to do to prevent your password from being broken is to make them not fit these patterns. This means:
- Use a mix of lower case letters, upper case letters, numbers and special characters. Use at least one of each.
- Make your passwords longer than eight characters. To lay outside of this data set, 10 would be fine. Personally, I’m going up to 16. After all, if you can remember an eight character password, you should be able to remember two of them stuck together.
- Avoid basing your password on popular culture, sequences of numbers (or keys on the keyboard) or sports. Those passwords are much more common than you’d think.
That’s it. If you do these three steps, you’ll be well outside of this data set and therefore, much less likely to get your password stolen. Of course, the one thing I couldn’t measure was how much these passwords are shared between accounts of the same person. The 65.71% overlap rate suggests that there is a lot of this going on, but I can’t prove it. Still, it’d be a good idea not to do that.
Do these suggestions sound familiar? They should. If you’re still not following them, maybe you should. We don’t suggest them to be annoying or to help protect against some amorphous threat in the future. We suggest it because if you don’t follow these rules, you will be hacked.
We’ve just seen it happen.
Over half a million times in the last six months.