In the early 20th Century, when calculators, computers and smartphones did not yet exist, scientists and engineers used tables of logarithms compiled in thick volumes for their calculations.
For example, a shortcut for multiplying two large numbers is to look up their logarithms in the tables, add them together (adding is easier than multiplying, isn’t it?) and then look up the anti-logarithm of the result in the tables.
In the 1930s, physicist Frank Benford worked as a researcher at General Electric. One day, Benford noticed that the first pages of the logarithm books were more worn than the last ones. This mystery could only have one explanation: his colleagues were looking for numbers starting with smaller digits more often than those starting with larger digits. 
As a good scientist, he asked himself: why did he and his colleagues find such a distribution of numbers in his work? Intuitively we think that the first digit of any number should follow a uniform distribution, i.e. the probability of any number starting with 1, 2, 3, … Up to 9 should be the same and equal to 1/9 = 11,111…%. But no!
Frequency of digit occurrence
Benford was puzzled to see how the frequency of occurrence of digits in the numbers of many natural phenomena follows a logarithmic distribution. Intrigued by this discovery, Benford sampled data from various sources (from river lengths to population censuses) and observed that the probability of the first digit of any number being equal to d is given by the following logarithmic law:
Pr( d ) = log( d + 1 ) – log( d ) = log ( ( d + 1 ) / d ) = log( 1 + 1 / d )
The following table lists all the values of P( d ) from 1 to 9.
On the Testing Benford’s Law page you will find numerous examples of datasets that follow this law, such as the number of followers on Twitter or the user reputation on Stack Overflow.
Why digits form this distribution
The explanation of why they form this distribution is (relatively) simple. Look at the following logarithmic scale bar. If you pick random points on this bar, 30.1% of the values will fall between 1 and 2; 17.6% will fall between 2 and 3; and so on, until you find that only 4.6% of the values will fall between 9 and 10.
Therefore, in a numerical series following a logarithmic distribution, there will be more numbers starting with 1 than with another higher digit (2, 3, …), there will be more numbers starting with 2 than with another higher digit (3,4, …), and so forth.
But we are not going to stop here, are we? The next interesting question that arises is: how can one identify data sets that normally conform to Benford’s law?
To understand the answer, we need to travel with our imagination to two very different countries: Mediocreland and Extremeland.
In Extremeland, Benford’s law rules
Lining up all the employees in your organisation and measuring their heights, you will get a normal distribution: most people will be of average height; a few will be rather tall and a few will be rather short; and a couple of people will be very tall and a couple of people will be very short.
If an employee arrives late to the measurement session, when we add his or her height to the rest, it will not significantly alter the group average, regardless of how tall or short he or she is. If instead of measuring height you record weight or calories consumed each day or shoe size, you will get similar results. In all cases, you will get a curve similar to the following one.
Now that you have them all together, you could write down the wealth of each one. What a difference! Now the majority will have rather meagre total capital, a much smaller group will have accumulated decent capital, a small group will have a small fortune and a very few will enjoy outrageous fortunes.
And if the CEO arrives late and we add his wealth to that of the group, his impact is likely to be brutal on the average. And if you measure the number of Instagram followers of your colleagues and there is a celebrity among them, you will get similar results. Graphically represented, all these results will have a shape similar to the following.
As you can see, not all random distributions are the same. In fact, there is a great variety among them. We could group them into two broad categories: those following (approximately) normal distributions and those following (approximately) potential distributions.
Nicholas Nassim Taleb describes them very graphically in his famous book The Black Swan as two countries:
- Mediocristan, where individual events do not contribute much when considered one at a time, but only collectively.
- Extremistan, where inequalities are such that a single observation can disproportionately influence the total.
So to answer the question of which data sets fit Benford’s law, we are clearly talking about data in the country of Extremistan: large data sets comprising multiple orders of magnitude in values and exhibiting scale invariance.
The latter concept means that you can measure your data using a range of different scales: feet/metres, euros/dollars, gallons/millilitres, etc. If the digit Frequency Law is true, it must be true for all scales. There is no reason why only one scale of measurement, the one you happen to choose, should be correct.
A couple of additional restrictions for a dataset to follow Benford’s Law are that it consists of positive numbers, that it is free of minimum or maximum values, that it is not composed of assigned numbers (such as telephone numbers or postcodes), and that the data is transactional (sales, refunds, etc.). Under these conditions, it is possible, but not necessary, for the dataset to follow this law.
OK, so you have a dataset that is perfectly in line with Benford’s law. What good does it do you? Well, it is useful, for example, to detect fraud, manipulation and network attacks. Let’s see how.
How to apply Benford’s Law to fight cybercrime
The pioneer of anti-fraud law enforcement was Mark Nigrini, who recounts in his book Benford’s Law: Applications for Forensic Accounting, Auditing, and Fraud Detection a multitude of fascinating examples of how he caught fraudsters and scammers.
Nigrini explains, for example, that many aspects of financial accounts follow Benford’s Law, such as:
- Expense claims.
- Credit card transactions.
- Customer balances.
- Journal entries.
- Stock prices.
- Inventory prices.
- Customer refunds.
- And so on.
It proposes special tests, which it calls digital analysis, to detect fraudulent or erroneous data that deviates from the law when it has been fabricated. I found it particularly revealing how it unmasks Ponzi schemes such as the Madoff scam because of financial results that, when fabricated, did not follow Benford’s Law and set off all the alarm bells.
The method is not infallible, but it works so well that these tests have been integrated into the audit software used by auditors, such as Caseware IDEA o ACL.
In another paper, the authors showed that images in the Discrete Cosine Transform (DCT) domain closely follow a generalisation of Benford’s law and used this property for image steganalysis, i.e. to detect whether a given image carries a hidden message.
Benford’s law can also be used to detect anomalies in:
- Economic and social data collected in surveys.
- Election data.
- Cryptocurrency transactions.
- The keystroke dynamics of different users.
- Detect errors or manipulations in drug discovery data.
In the Benford Online Bibliography you will find a non-commercial, open-access database of articles, books and other resources related to Benford’s law.
Another use case of Benford’s law is the detection of Internet traffic anomalies, such as DDoS attacks. It has been known for many years that packet inter-arrival times exhibit a potential distribution, which follows Benford’s law.
In contrast, DDoS attacks, being flooding attacks, break any normality of traffic behaviour in a network. In particular, packet inter-arrival times are not long enough and appear as noticeable deviations from Benford’s law, as can be seen in the following figure.
The best thing about this anomaly-based DoS attack detection method is that, unlike other approaches, “it requires no learning, no deep packet inspection, it is hard to fool and it works even if the packet content is encrypted.
Benford’s future in cyber security
Biometrics, steganalysis, fraud, network attacks,… The world of cybersecurity is beginning to incorporate the analysis of the probability distribution of logarithmic laws with very promising results.
It is a flexible technique, consumes hardly any resources, is very fast and requires no training. It does require, however, that the normal data set meets sufficient conditions to conform to Benford’s law.
Next time you are faced with a dataset, ask yourself if the first digit of each number follows Benford’s law. You may find unexpected anomalies.
 In fact, this same observation was made in 1881 by the astronomer and mathematician Simon Newcomb. He published a paper on it, but it went unnoticed.
Featured photo: This is Engineering RAEng / Unsplash