Humans are Insecure Password Generators

Published on May 21, 2025 5:58 AM GMT

And They're Currently Being Cracked

(A crosspost of Humans are Insecure Password Generators)

Any password longer than 12 characters or so is invulnerable to brute force attacks. So why do so many longer ones keep getting breached?

Because humans don't pick passwords randomly; we tend to pick ones that are easy to think of and easy to remember. This tendency can be exploited; attackers can identify passwords that are more likely than others and try those first.

The best way to think about this is that humans are picking passwords from some underlying distribution; a distribution that is not uniform across all possible strings of characters.^[1]

Any time you're trying to find a specific value from a non-uniform distribution, the optimal strategy is to try the most likely values first. If you're trying to guess the first digit of a number, you should guess "1" first, not "9".^[2] Passwords are no different; a traditional brute force attack that tries each string in alphabetical order is horribly inefficient, because you're treating the space as uniform when it would not. The ideal attack checks the most likely passwords first.

Today's hackers can already do something similar. One of the best password cracking techniques todays is a dictionary attack: You take known passwords that have been released in previous security breaches and test them against other peoples' accounts. This is, effectively, taking thousands or millions of samples from the underlying "humans picking passwords" distribution. But this is still only a tiny fraction of the passwords that humans do use; vast numbers will be inaccessible with this technique.

Clever attackers can do a bit better than this by identifying common password-mangling patterns like "replace S with $", and then write a program to try these variations. This allows them to generalize out from the specific examples they've seen, and estimate other unknown areas of the distribution. But this is still highly imperfect and will miss many things.

Enter neural networks. A neural network takes in millions or billions of examples, distills them down into a probability distribution, and then outputs new things from that distribution. This is exactly what is needed to crack passwords. Neural networks allow an attacker to brute-force the effecient way; by going over the entire probability distribution, most to least likely.

This idea isn't new. In 2017, researchers released PassGan, a deep learning system that does exactly this.^[3] It had a... less than stellar reception from the security community. Arcs Technia called it "mostly hype", and "mediocre performance dressed up as something to worry about", pointing out that it wasn't significantly better than other existing password crackers.

But this completely misses the point; the makers of PassGan weren't trying to break into people's accounts, quite the opposite; they were security researchers demonstrating a new attack vector.^[4] Proofs-of-concept exist to prove that a concept is viable; not to actually put it into practice. Work on the subject has progressed slowly but surely since then; a 2021 paper improved on PassGan's results, and a 2022 attempt did even better, doubling the 2021 paper's accuracy.

The real advance will only arise once a bad actor tries to bring such a method into the age of big data; we've observed robust scaling laws on how LLM performance improves with larger amounts of compute, and there's no reason to expect that this won't apply to password generation just like it did to everything else. PassGan was trained on a single GPU for a few hours with a dataset of 23,679,744 known passwords. Compare this to the tens of thousands of GPUs used over the course of several months to train frontier AI chatbots, and the hundreds of millions of passwords available in other data breaches, and there are clearly large improvements on the table for anyone willing to commit enough resources to the task.

Of course, pure password cracking of this form is of limited utility. It can be applied to any hash that was released in a data breach, increasing the risk that the source password becomes known. In the worst case scenario, an advanced model would effectivly be able to invert most password hashes, defeating the protection they provide. This would be quite bad, (though note that existing crackers can already reverse more than 50% of all hashed passwords), but still requires that the parent organization's security is breached, so most users of most services would still be safe. The jackpot would be guessing accounts with unknown hashes, but attacking "clean" accounts almost always requires going through an online API with rate limits. Even a program drawing from the theoretically optimal probability distribution wouldn't be able to guess the median human's password in less than 10 attempts.^[5]

The scarier risk arises from personalized guesses. Most of us have quite a lot of information about ourselves online. Our name, hobbies, friends, location, workplace, and much more. Feeding all of these into an AI would allow it to build a distribtion for that particular person rather than all people, which can be many orders of magnitude more accurate. (Consider: existing password cracking algorithms are smart enough to check for birthdays in passwords, but there are around 20,000 different birthdays people can have, and it has to try them all. A personalized algorithm would only need to check 1.) This sort of thing is already being done by password recovery services with the help of the account owner, and the large-scale data gathering done by many companies shows that the subject's consent is not really needed for stuff like this.

The fact is, when a human thinks of a password to use for themselves, this is a fundamentally insecure process. We have no cryptographic guarantees about the human brain like we do about a carefully-designed computer algorithm, and mounting evidence shows that humans are in fact quite predictable. One 2019 study found that by using a neural network, they were able to guess almost half of everyone's passwords in under 1000 attempts just by knowing a single one of that person's other passwords. It's only going to get worse from here.

But I think this is actually a good thing.

Traditional password cracking tools could be created by one smart programmer, meaning there was no inherent advantage for the defenders; the bad guys can come up with new cracking methods just as easily as the good guys can warn people about them. But it's pretty hard for a criminal organization to rent 10,000 GPUs in a datacenter without being noticed. And the current largest store of real life password data is probably the white-hat website HaveIBeenPwned, with over 5.5 billion breached passwords in its (private) database.

This presents an opportunity for prosocial organizations to gain the upper hand. They could build private models that warn users with potentially-guessable passwards, before criminals have an opportunity to gain the same capabilities. (And the criminals will gain these capabilities soon enough.) This would allow us to move away from the model of urgently notifying users that a hash of their password has been released on the dark web, forcing password resets, and getting a lot of hacked accounts anyway. Instead, we'd be able to warn users of the weakness of their passwords before they're ever created. (And do this in a reliable, consistent way, not by using easily-defeated kludges like "if it doesn't contain a special character it's unsafe".) In a world where everyone is using secure passwords, it won't even matter if the hashes are stolen!^[6]

The end state of this arms race is simple: people need to use random passwords. It is fundamentally unsafe to entrust your digital security to the hope that your thought process is sufficienly incrutable that no possible technology could figure it out; a hope that is consistenly being dashed.

As technology inexorably improves, the "let people pick their own passwords" model becomes less and less viable. We need to be encouraging a transition away from that model as soon as we can, and trying to minimize harm in the mean time. The status quo taunts users with the possibility of having an easily-remembered-and-secure password, but the majority of them will fail at this. Better to go all the way and get rid of the idea entirely. Choosing one's own passwords should be regarded as foolish for anyone but the experts; akin to representing yourself in court, or installing a high-voltage electrical appliance.

The safe solution is already known: Get a password manager. Let it pick your passwords for you. Use a fully random passphrase generator for the master password, or if you absolutly can't remember a passphrase, write it down on a piece of paper and keep that paper in a secure place. Any method that involves your brain is simply not going to remain secure for much longer.

^{^}
Of course the exact distribution will vary from person to person; someone with a cat is more likely to choose a cat-related password. But without knowing anything about the person, there's also a global distribution across all humans that's equally valid.
^{^}
Another neat example of this principle is that if you're searching across an ordered set of objects where you know their order, thus enabling binary search, in many practical cases you do not want to pick the midpoint for each iteration of the search like many computer science 101 courses will tell you. You actually want to first estimate a probability distribution over where your target element is, and then divide the probability space in two with each iteration. This can significantly speed up the search in any case where the target's location is not completely random.
^{^}
It actually wasn't the first; this paper from 2016 beat it to the punch. But they didn't give the AI a catchy name, and it didn't get much attention. (See also probabalistic context-free grammars from 2009, a neural-networed-powered password strength checker from 2006, etc.)
^{^}
Ok, they were also doing a publicity stunt for their company.
^{^}
I can guarantee this is true without needing to know anything about the ceiling of theoretical AI capabilities simply because there are enough passwords in use that the most common 10 of them together don't account for 50% of all accounts.
^{^}
As long as they're salted, not using a broken hashing algorithm, etc.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签