The hugest collection of usernames and passwords has been filtered…or not (I)

ElevenPaths    28 January, 2019
Sometimes, someone frees by mistake (or not) an enormous set of text files with millions of passwords inside. An almost endless list of e-mail accounts with their passwords or their equivalent hash. Consequently, headlines start to appear again and again in the media: “Millions of passwords have been filtered…”. Even if it is not a fake headline, sometimes it may be tricky. In particular, we are talking about the last massive leak, named “Collection #1”.

We have analyzed this huge leak. Beyond the “Collection #1” that has reached the media, we have got a superset with more than 600 GB of passwords. It is so great that over our analyses we could count more than 12,000,000,000 combinations of unfiltered usernames and passwords. It is an astronomical figure. However, the important point here is that they are “in-raw”. What is still interesting after having performed any cleaning? We must consider that a filtration of a filtration is not a filtration. If some months or years ago someone filtered a database of a given website, this is called “leak”. Conversely, if someone concatenates that file with other ones and publishes them, it is not a filtration: they are simply making available their particular collection of leaks on the Internet.

Demystifying the leak: Repetitions

Demystifying the leak: Repetitions imageRepetitions are classified into two types: 

  • Occurrence of the same account and password
  • Finding the same account but with a different password 

In both cases, it can be just a reutilization of an e-mail account and password on multiple sites, as a result of the union of different filtration databases. In both cases (regardless of if they are valid and out-of-context) we can reduce the “different” data. A quick glance at these 600 GB of information shows us a lot of repeated accounts. Although this information may be valid, it helps to low the possibilities of affected users.

Data expiration
How valuable is a 6-month leak? What about a 5-year one? And a 10-year leak? Getting an e-mail account and password does not mean having permanent access to the secrets hidden behind the authentication process. Every single day these data are less valuable. In general, this kind of data is like fish: it must be eaten fresh, otherwise it rots very fast. When someone has access to an account with its appropriate credentials, they have a time frame until the account’s owner is alerted, so this one will change the password or the service itself will detect the account filtration and go ahead with its disabling or preventive deletion.
This tight time frame or access lifetime is the account’s initial value (then, other properties come into play, such as the domain they belong or even better: their owner). Afterwards, the e-mail account and credentials will be useful just to take a chance on other services, use them to send spam or other frauds; but that is another matter.
We have performed a simple test. We have concatenated all the files containing e-mails within the megaleak and we have removed all the passwords. The result: a “todos.txt” of around 200 GB. From them, we have selected a group of accounts on a pseudorandomized basis (as randomized as mathematics and system generators allow us): 
 Data expiration image
Fictitious data? imageThe ‘0.0001’ extracts a minimum sample, however, they mean  more than a thousand e-mail accounts. Moreover, “salida.txt” is filtered on e-mails with non-existent domains, duplicates and servers that do not allow to verify an account through VRFY (a command of SMTP).
Based on that sample of more than a thousand e-mails, we have verified their existence. The result: 9,8 % did not exist or never existed in that domain. Nearly 10 % of the “working” e-mail addresses are no longer available on their corresponding e-mail servers. We dare to say that this result can be extrapolated to the mentioned 12,000,000,000 combinations. And all this without considering than in many cases the passwords are not even valid.
Fictitious data?
Let’s see some entries. Pay attention to the domains that does not exist or never existed, since they are not domains gathered by IANA.
This is an illustrative example. There are thousands of non-existent TLDs within the multiple files that constitute the leak.
Another suspicious example is the content of some files itself, let’s examine it:
example is the content of some files itself image
The grey rectangle we have placed in order to not expose the data may mislead, but it constitutes a list where the chain [email]:[password] consists of 32 characters exactly; no more and no less. 32 characters where maybe because of the e-mail or password’s length, all have the same size and figure a column which is suspiciously perfect. The attacker may have organized them, but in any event it is curious, since it is not a single file with thousands of e-mails of the exact same length. Within the leak there are other files where the chain length is both higher or lower, but homogeneous in any case. We cannot imagine the practical utility of having chain lists formed by same-length e-mails and passwords. Might we assert that they have been generated this way by any means?
So, is it serious?
Theoretically, it would be necessary to validate a number of factors; but with 12,000,000,000 combinations, the operation results, at least, complex. Just by these samples and examples we could venture to assert that this collection constitutes a valuable set of data, not in terms of privacy or destruction of users’ privacy, but as a dictionary of accounts’ system.
We think that concluding by asserting that “a filtered account corresponds to have access to someone’s e-mail or data” is a reckless reasoning. The useful number of these accounts is much more reduced, due to their expiration or simply because they never existed. We think that within the leak there is out-of-date or unverified information and, even so, it has been artificially enlarged.
In any case, the good point of these ads is that they make a small proportion of the general public to change their passwords, an even smaller proportion of them get a password manager and just a few of them enable the second authentication factor. It’s better than nothing.
In the second part we will see more curiosities on this huge file.
David García
Innovation and Labs (ElevenPaths)

Leave a Reply

Your email address will not be published.