The base rate fallacy or why antiviruses, antispam filters and detection probes work worse than what is actually promised

ElevenPaths    17 March, 2019

Before starting your workday, while your savoring your morning coffee, you open your favorite cybersecurity newsletter and an advertisement on a new Intrusion Detection System catches your attention:

THIS IDS IS CAPABLE OF DETECTING 99% OF ATTACKS!

“Hmmm<, not bad”, you think, while taking a new sip of coffee. You scroll down taking a look at a few more news, when you see a new IDS advertisement:

THIS IDS IS CAPABLE OF DETECTING 99.9% OF ATTACKS!

At first glance, which IDS is better? It seems obvious: the best will be the one which is capable of detecting a higher number of attacks, this is, the ID that detects 99.9%, against the 99% one.

Or maybe not? I’m going to make it easier. Imagine that you find a third advertisement:

THIS IDS IS CAPABLE OF DETECTING 100% OF ATTACKS!

This IDS is definitely the bomb! It detects everything!

Ok, it detects everything but…  at what price? See how easy is to obtain a 100% detection rate IDS: you only have to tag every incoming packet as malicious. You will obtain 100% detection by 100% false positives. Here, a second actor comes into play −often overlooked when data on attack detection effectiveness is provided: how many times the system has raised the alarm when there was no attack?

The detection problem
There is high number of cybersecurity applications that address the challenge of detecting an attack, anomaly or a malicious behavior:

  • IDSs must detect malicious packets from the legitimate traffic.
  • Antispam filters must find the junk mail (spam) among regular mail (ham).
  • Antiviruses must discover disguised malware among harmless files.
  • Applications’ firewalls must separate malicious URLs from benign ones.
  • Airport metal detectors must point out weapons and potentially dangerous metallic objects, instead of inoffensive objects.
  • Vulnerability scanners must warn about vulnerabilities in services or codes.
  • Cyberintelligence tools such as Aldara must know whether a conversation in social networks might become a reputational crisis, or if a Twitter account is a bot or is used by a terrorist cell.
  • Log analysis tools must identify correlated events.
  • Network protocol identification tools must correctly tag the packets.
  • Lie detectors must discern if a suspect is telling the truth or lying.
  • And many other applications. You can add more examples in the comments below. 

In spite of their disparity, all these systems have a common feature: they generate alerts when they consider that a True Positive (TP) has been found. Unfortunately, they are not perfect and also generate alerts even if there is no malicious or anomalous activity, which is known as False Positive (FP).

The following table shows all the possible response statuses of an IDS when it faces an incident. If the system detects an incident when it has actually occurred, it is working correctly: a True Positive (TP) has taken place. However, the system is malfunctioning if the incident has occurred, but the system does not warn about it: it results in a False Negative (FN). Similarly, if there is no incident and the system identifies it inaccurately, we will be facing a False Positive (FP), while we will be dealing with a True Negative (TN) if the system does not warn in such case.

False alerts are as important as detections
Let’s think about any detection system. For instance, an IDS that detects 99% of attacks is capable of tagging as malicious 99% of packets that are indeed malicious. In other words, the Detection Rate (DR), also known as True Positive Rate (TPR), is 0.99. Conversely, when there is a non-malicious incoming packet, the IDS is capable of tagging it as non-malicious in 99% of cases, meaning that the False Alert Rate (FAR) −also called False Positive Rate (FPR)− is 0.01. The truth is that in a conventional network the percentage of malicious packets is extremely low compared to legitimate packets. In this case, let’s assume that 1/100,000 packet is malicious, a figure quite conservative. Given these conditions, our IDS warns that one packet is malicious. What is the likelihood that it will be malicious?
Don’t rush to give an answer. Think about it again. 

And think about it carefully once again, do you have an answer? We will reach it step by step. In the following table, you will find a list of all the data for a specific example of 10,000,000 analyzed packets. Of all of them, 1 out of 100,000 are malicious, this is: 100; 99% of them will have been correctly identified as malicious, this is: 99 packets, while 1% −a single packet− has not been detected as malicious and no alarm has been raised: such packet has slipped through the system. The first column is completed. Moreover, the remaining 9,999,900 packets are legitimate. That said, the alarm will have sounded erroneously for 1% of these packets, summing up 99,999 packets; while for the remaining 99% the system did not sound the alert, that is: it maintained silence for a total of 9,899,901 packets. The second column is ready. Obviously, rows and columns must add the total amounts showed in the table.

With this table, now we are able to quickly answer the previous question: What is the likelihood that a packet will be malicious if the system has tagged it as such?

The answer is provided by the first row: only 99 of the 100,098 generated alerts corresponded to malicious packets. This means that the probability for that alert to be a malicious packet is quite exiguous: only 0.0989031%! You can check the calculations. It won’t get it right even one out of the thousand times that the alarm is raised. Welcome to the false positive problem!

Many people are shocked by this result: how is it possible that it fails at that point if the detection rate is 99%? Because the legitimate traffic volume is overwhelmingly higher than malicious traffic volume!

The following Venn diagram help to understand better what is happening. Even if it is not to scale, it shows how legitimate traffic (¬M) is much more frequent than malicious one (M). Indeed, it is 100,000 times more frequent. The answer for our question can be found in the section between the 3) area and the whole A area. This 3) area is quite small compared to A. Consequently, the fact that an alarm is raised does not mean too much in absolute terms regarding the danger of the analyzed packet.

The most paradoxical fact is that it does not matter how the Detection Rate of this hypothetical IDS improves, nothing will change until its False Alert Rate decreases. Even in an extreme case, on the assumption that DR = 1.0, if the remaining parameters are left untouched, when the IDS sounds an alarm, the probability that this alarm will correspond to a real attack will remain insignificant: 0.0999011%! As we can see, it does not reach even one per thousand. For this reason, IDSs have such a bad name: if an IDS only gets right one out of the thousand times it warns, finally you will end up ignoring all its alerts. The only solution would be to improve the False Alert Rate, approaching it to zero as far as possible.

The following graphic shows how detection effectiveness evolves, in other words: how as the False Alert Rate P(A|¬M) decreases, the probability P(M|A) that there will be a real malicious activity (M) −when the system raises an alert (A)− changes. Once examined the graphic, it is evident that however much the detection rate improves, it will never overcome the possible maximum effectiveness… unless the False Alert Rate decreases.

In fact, results are bleak to an extent: even with a perfect 100% Detection Rate (P(A|M) = 1.0) to reach an effectiveness higher than 50%, P(M|A) > 0.5, it would be necessary to reduce the False Alert Rate under 1,E-05, a feat not likely to be achieved.

In summary, it becomes clear that the effectiveness of detection systems (malware, network attacks, malicious URLs or spam ones), does not depend so much on the system’s capability to detect intrusive behavior, but on their capability to discard false alerts.Why you cannot ignore the Base Rate when evaluating the effectiveness of a system intended to detect any type of malicious behavior

When evaluating the behavior of an IDS, three variables are interconnected:

  • Detection Rate: how well the system identifies a malicious event as such. Ideally, DR = 1.0. This information is usually highlighted by the manufacturer, particularly when it is 100%.
  • False Alert Rate: how well the system identifies a legitimate event as such, without tagging it as malicious. Ideally, FAR = 0.0. In practice, this value is usually far from being ideal. It is common to find values between 1% and 25%.
  • Base Rate: what percentage of events are malicious in the context of the study. The higher the value is, −in other words, the more “peligroso” the environment is−, the more efficient the IDS will be, just because there are more malicious events, so logically by tagging any event as malicious the percentage of right detections would increase. The same IDS in two different contexts, one of them with a high base rate (many backdrop attacks), and the other one with a low base rate (few malicious events) will seem to magically improve its effectiveness. Actually, all that happens is that the higher number of attacks you receive, the more times you will be right in terms of detections by tagging anything as an attack. 

Manufacturers tend to highlight the first value and leave out the remaining two. That said, the False Alert Rate is as important as the Detection Rate. A given organization may waste thousands of hours investigating false alerts in malware, IDSs, logs, spam, etc. Important e-mails or critical files might be kept in quarantine, or event deleted. In the worst-case scenario, when the FAR is very high, the system is so annoying and resource-intensive to validate the alerts, that may be disabled or completely ignored. You have fallen into the alarm fatigue!

At a time when most of the manufacturers of this type of tools reach detection rates close to 100%, pay attention to False Positives as well. 

And remember that the effectiveness of any solution will be completely conditioned by the base rate. The less prevalent a problem is, the more times the system will falsely shout: “The Wolf is coming!”.

Gonzalo Álvarez Marañón
@gonalvmar
Innovation and Labs (ElevenPaths)
www.elevenpaths.com

Leave a Reply

Your email address will not be published.