How to forecast the future and reduce uncertainty thanks to Bayesian inference (II)

ElevenPaths    23 April, 2019

In the first part of this article we explained how Bayesian inference works. According to Norman Fenton, author of Risk Assessment and Decision Analysis with Bayesian Networks: Bayes’ theorem is adaptive and flexible because it allows us to revise and change our predictions and diagnoses in light of new data and information. In this way if we hold a strong prior belief that some hypothesis is true and then accumulate a sufficient amount of empirical data that contradicts or fails to support this, then Bayes’ theorem will favor the alternative hypothesis that better explains the data. In this sense, it is said that Bayes’ Theorem is scientific and rational, since it forces our model to “change its mind”.

The following case studies show the variety of applications of the Bayes’ Theorem to cybersecurity.

Case Study 1: The success of anti-spam filters
One of the first successful application cases of Bayesian inference in the field of cybersecurity were Bayesian filters (or classifiers) in the fight against spam. Determining if a message is spam (junk mail, S) or ham (legitimate mail, H) is a traditional classification problem for which Bayesian inference was especially suitable.

The method is based on studying the probability that a number of words may appear on spam messages compared to legitimate messages. For example, by checking spam and ham history logs the probability that a word (P) may appear on a spam message (S) may be estimated as Pr(P|S).

Nevertheless, the probability that it may appear on a legitimate message is Pr(P|H). To calculate the probability that a message will be spam if it includes such Pr(S|P) word we can use once again the useful Bayes’ equation, where Pr(S) is the base rate: the probability that a given e-mail will be spam.

Statistics report that 80% of e-mails that are spread on the Internet are spam. Therefore, Pr(S) = 0.80 and Pr(H) = 1 – Pr(S) = 0.20. Typically, a threshold for Pr(S|P) is chosen, for instance 0.95. Depending on the P word included in the filter, a higher or lower probability compared to the threshold will be obtained, and consequently the message will be classified as spam or as ham.

A common simplification consists in assuming the same probability that the one assumed for spam and ham: Pr(S) = Pr(H) = 0.50. Moreover, if we change the notation to represent p = Pr(S|P), p1 = Pr(P|S) and q1 = Pr(P|H), the previous formula is as follows:

But of course, trusting a single world to determine if a message is spam or ham may lend itself to a high number of false positives. For this reason, many other words are usually included in what is commonly known as Naive Bayes classifier. The term “naive” comes from the assumption that searched words are independent, which constitutes an idealization of natural languages. The probability that a message is spam when containing these n words may be calculated as follows:

So next time you open your e-mail inbox and this is free from spam, you must thank Mr. Bayes (and Paul Graham as well). If you wish to examine the source code of a successful anti-spam filter based on Naive Bayes classifiers, take a look at SpamAssassin.

Case Study 2: Malware detection
Of course, these classifiers may be applied not only to spam detection, but also to other type of threats. For instance, over last years, Bayesian inference-based malware detection solutions have gained in popularity:

Case Study 3: Bayesian Networks
Naive Bayesian Classification assumes that studied characteristics are conditionally independent. In spite of its roaring success classifying spam, malware, malicious packets, etc. the truth is that in more realistic models these characteristics are mutually interdependent. To achieve that conditional dependence, Bayesian networks were developed, which are capable of improving the efficiency of rule-based detection systems ꟷmalware, intrusions, etc. Bayesian networks are a powerful type of machine learning for helping decrease the false positive rate of these models.

A Bayesian network of nodes (or vertices) that represent random variables, and arcs (edges) that represent the strength of dependence between the variables by using conditional probability. Each node calculates the posterior probability if conditions of parent nodes are true. For example, in the following figure you can see a simple Bayesian network:

And here you have the probability of the whole network:

Pr(x1,x2,x3,x4,x5,x6) = Pr(x6|x5)Pr(x5|x3,x2)Pr(x4|x2,x1)Pr(x3|x1)Pr(x2|x1)Pr(x1)

The greatest challenges for Bayesian networks are to learn the structure of this probability network and train the network, once known. The authors of Data Mining and Machine Learning in Cybersecurity present several examples of applications of Bayesian networks to cybersecurity, as the following one:

In this network configuration, a file server ꟷhost 1ꟷ provides several services: File Transfer Protocol (FTP), Secure Shell (SSH), and Remote Shell (RSH) services. The firewall allows FTP, SSH and ANDRSH Traffic from a user workstation (host0) to the server 1 (host1). The two numbers in parenthesis show origin and destination host. The example addresses four common vulnerabilities: sshd buffer overflow (sshd_bof), ftp_rhosts, rsh login (rsh) and setuid local buffer overflow (Local_bof). The attack path may be explained by using node sequences. For example, an attack path may be presented as ftp_rhosts (0, 1) → rsh (0, 1) → Local_bof (1, 1). Values of conditional probability for each variable are shown in the network graphic. For instance, the Local_bof variable has a conditional probability of overflow or no overflow in user 1 with the combinational values of its parents: rsh and sshd_bof. As it may be seen:

Pr(Local_bof(1,1) = Yes|rsh(0,1) = Yes,sshd_bof = Yes) = 1,

Pr(Local_bof (1, 1) = No|rsh(0, 1) = No,sshd_bof(0, 1) = No) = 1.

By using Bayesian networks, human experts can easily understand the structure of the network as well as the underlying relationship between the attributes among data sets. Moreover, they can modify and improve the model.

Case Study 4: The CISO against a massive data breach
In How to Measure Anything in Cybersecurity Risk, the authors present an example of Bayesian inference applied by a CISO. In the scenario raised, the CEO of a large organization calls his CISO because he is worried about the publication of an attack against other organization from their sector, what is the probability that they may suffer a similar cyberattack?

The CISO gets on with it. What can he do to estimate the probability of suffering an attack, apart from checking the base rate (occurrence of attacks against similar companies in a given year)? He decides that performing a pentest could provide a good evidence on the possibility that there is a remotely exploitable vulnerability, that in turn would influence on the probability of suffering such attack. Based on his large experience and skills, he estimates the following probabilities:

  • The probability that the pentest result suggests that there is a remotely exploitable vulnerability, Pr(T) = 0.01
  • The probability that the organization hides a remotely exploitable vulnerability in case of positive pentest, Pr(V|T) = 0.95; and in case of negative pentest Pr(V|¬T) = 0.0005
  • The probability of successful attack if such vulnerability exists, Pr(A|V) = 0.25; and if it does not exist Pr(A|¬V) = 0.01

These prior probabilities are his previous knowledge. Equipped with all them as well as the Bayes’ equation, now he can calculate the following probabilities, among others:

  • The probability of successful attack: Pr(A) = 1.24%
  • The probability of a remotely exploitable vulnerability: Pr(V) = 1.00%
  • The probability of successful attack if the pentest provides positive results: Pr(A|T) = 23,80%; and if it provides negative results: Pr(A|¬T) = 1.01%

It is clear how pentest results are critical to estimate the remaining probabilities, given that Pr(A|T) > Pr(A) > Pr(A|¬T). If a condition increases the prior probability, then its opposite should reduce it.

Of course, the CISO’s real work life is much more complex. This simple example provides us a glimmer of how Bayesian inference may be applied to modify judgements according to evidences accumulated in an attempt to reduce uncertainty.

Beyond classifiers: Bayesians’ everyday life in cibersecurity
Does this mean that since now you need to carry an Excel sheet everywhere in order to estimate prior and posterior probabilities, likelihoods, etc.? Fortunately, it doesn’t. The most important aspect of Bayes’ theorem is the concept behind Bayes’ view: getting progressively closer to the truth by constantly updating our belief in proportion to the weight of evidence. Bayes reminds us how necessary it is for you to feel comfortable with probability and uncertainty.

A Bayesian cybersecurity practitioner:

  • Has the base rate in mind: most of the people focus on the new evidence and forget the base rate (prior knowledge). It is a well-known bias that we discussed when explained the representativeness heuristic: we pay more attention to anecdotal evidence than to the base rate. This bias is usually known as base rate fallacy as well.
  • Imagines that his model is flawed and ask himself: what may go wrong? Excessive trust in the extent of your own knowledge may lead you to very bad decisions. In a previous article we examined the confirmation bias: we tend to prioritize that information confirming our hypotheses, ideas and personal beliefs, no matter whether they are true or not. The greatest risk of this confirmation bias is that if you are looking for a single type of evidence, you will certainly find it. You need to look for both types of evidences: for the one that refutes your model as well.
  • Updates his beliefs: the new evidence impacts on the initial belief. Changing your mind in light of a new evidence is a sign of strength, not of weakness. These changes do not have to be extreme, but incremental as evidences are accumulated in one direction or another.

Riccardo Rebonato, author of Plight of the Fortune Tellers: Why We Need to Manage Financial Risk Differently, asserts:

According to the Bayesian view of the world, we always start from some prior belief about the problem at hand. We then acquire new evidence. If we are Bayesians, we neither accept in full this new piece information, nor do we stick to our prior belief as if nothing had happened. Instead, we modify our initial views to a degree commensurate with the weight and reliability both of the evidence and of our prior belief.

First part of the article:

» How to forecast the future and reduce uncertainty thanks to Bayesian inference (I).

Leave a Reply

Your email address will not be published. Required fields are marked *