We have analyzed this huge leak. Beyond the “Collection #1” that has reached the media, we have got a superset with more than 600 GB of passwords. It is so great that over our analyses we could count more than 12,000,000,000 combinations of unfiltered usernames and passwords. It is an astronomical figure. However, the important point here is that they are “in-raw”. What is still interesting after having performed any cleaning? We must consider that a filtration of a filtration is not a filtration. If some months or years ago someone filtered a database of a given website, this is called “leak”. Conversely, if someone concatenates that file with other ones and publishes them, it is not a filtration: they are simply making available their particular collection of leaks on the Internet.
Repetitions are classified into two types:
- Occurrence of the same account and password
- Finding the same account but with a different password
In both cases, it can be just a reutilization of an e-mail account and password on multiple sites, as a result of the union of different filtration databases. In both cases (regardless of if they are valid and out-of-context) we can reduce the “different” data. A quick glance at these 600 GB of information shows us a lot of repeated accounts. Although this information may be valid, it helps to low the possibilities of affected users.

We think that concluding by asserting that “a filtered account corresponds to have access to someone’s e-mail or data” is a reckless reasoning. The useful number of these accounts is much more reduced, due to their expiration or simply because they never existed. We think that within the leak there is out-of-date or unverified information and, even so, it has been artificially enlarged.