Adversarial Attacks: The Enemy of Artificial Intelligence (II)

Franco Piergallini Guida    28 September, 2020
Adversarial Attacks: The Enemy of Artificial Intelligence (II)

In Machine and Deep Learning, as in any system, there are vulnerabilities and techniques that allow manipulating its behaviour at the mercy of an attacker. As we discussed in the first part of this article on Adversarial Attacks, one of these techniques are adversarial examples: inputs carefully generated by an attacker to alter the response behaviour of a model. Let’s look at some examples:

The easiest one can be found in the beginning of spam detection, standard classifiers like Naive Bayes were very successful against emails containing texts like: Make rapid money! Refinance your mortgage, Viagra… As they were automatically detected and classified as spam, the spam generators learned to trick the classifiers by inserting scores, special characters or HTML code like comments or even false tags. So they started using “disguises” like: v.ia.g.ra, Mα∑e r4p1d mФn €y!…

And they went further, having solved this problem for the classifiers, the attackers invented a new trick: to evade the classifiers that relied on text analysis, they simply embedded the message in an image.

Adversarial examples Ebay
Adversarial examples EbayPicture 1: Adversarial examples Ebay

Several countermeasures were quickly developed based on image hashes known as spam using OCRs to extract text from images. To evade these defences, attackers began applying filters and transformations to the images with random noise making the task of recognizing characters in the images quite difficult.

Random noise
Picture 2: Random noise

As in cryptography, we find ourselves in an endless game where defence techniques and attack techniques are constantly found. Let’s stop at this point.

Image Classification and Adversarial Attacks

In the classification of images, the attackers learned how to meticulously and strategically generate white noise, using algorithms to maximize the impact on neural networks and go unnoticed by the human eye. In other words, they achieve a stimulation in the internal layers of the network that completely alters their response and prevents them from being processed intelligently.

One of the reasons why there are these types of attacks in the images is due to the dimensions of the images and the infinite possible combinations that a neuronal network can have as an input. While we can apply techniques such as data augmentation to increase both the size and variety of our training data sets, it is impossible to capture the great combinatorial complexity involved in the actual space of possible images.

But how is this white noise generated? First, we will formulate the adversarial examples mathematically, from the perspective of optimization. Our fundamental objective in supervised learning is to provide an accurate mapping from an input to an output by optimizing some parameters of the model. This can be formulated as the following optimization problem:

〖min 〗_θ loss(θ,X_i 〖,Y〗_i )

Which is typically known as neural network training. To perform this optimization, algorithms such as stochastic gradient descent are used, among others.

A very similar approach can be used to get a model to misclassify a specific input. To generate an adversarial example, we used the parameters into which the network converged after the training process and optimised on the possible input space. This means that we will look for a disturbance that can be added to the input and maximize the model’s loss function:

〖max 〗_(δ∈∆) loss(θ,X_i+ δ〖,Y〗_i )

Toy Example

Let’s think for a moment about a simple example where we have a linear regression neuron, with a 6-dimensional input:

This image has an empty alt attribute; its file name is image-36.png

Which, when going through the training process, converged with the following weights: W=(0,-1,-2,0,3,1), b=0. If the input is given:

This image has an empty alt attribute; its file name is image-40.png

The neuron will remain as output:

This image has an empty alt attribute; its file name is image-41.png

So how do we change x→x* so that yx* changes radically but x x*≅x? If we take the derivative of ∂y/∂x=WT, it will tell us how small changes in x impact on y. To generate x* we add a small perturbation εWT,ε=0.5 ε to the x input:

This image has an empty alt attribute; its file name is image-42.png

And if we do forward propagation to our new x* input, if we are lucky, we will notice a difference from the output provided by the model for x.

This image has an empty alt attribute; its file name is image-43.png

Indeed, for x* input we get 6.5 as output, when for x we had -1. This technique (with some minor differences to the toy example we have just seen) is called fast gradient sign method and was introduced in 2015 by Ian Goodfellow in the paper entitled Explaining and Harnessing Adversarial Examples

Future Adversarial Examples: Autonomous Cars

Adversarial examples are an innate feature of all optimisation problems, including deep learning. But if we go back about 10 years, deep learning did not even do a good job on normal, unaltered data. The fact that we are now searching and investigating ways to “hack” or “break” into neural networks means that they have become incredibly advanced.

But can these attacks have an impact on the real world, such as the autopilot system in a car? Elon Musk gave his opinion in Lex Fridman’s podcast assuring that these types of attacks can be easily controlled. In a black-box environment, where attackers do not have access to the internal details of the neural network such as architecture or parameters, the probability of success is relatively low, approximately 4% on average. However, Keen Labs researchers have managed to generate adversarial examples by altering the Tesla car’s autopilot system. Furthermore, in white-box environments, adversarial examples could be generated with an average success rate of 98% (An Analysis of Adversarial Attacks and Defences on Autonomous Driving Models). This implies a high susceptibility in open-source self-driving projects such as comma.ai, where the architecture and parameters of the models are fully exposed. Waymo, a developer of autonomous vehicles belonging to the Alphabet Inc. conglomerate, lays out a range of high-resolution sensor data collected by its cars in a wide variety of conditions, in order to help the research community move forward on this technology. This data could be used to train a wide variety of models and generate adversarial attacks that in some cases could have an effect on the networks used by Waymo due to transferability, a property of neural networks in which two models will be based on the same characteristics to meet the same objective.

We must mention that there is a big gap between cheating a model and cheating a system that contains a model. Many times, neural networks are just another component in an ecosystem where different types of analysis interact in decision making. Regarding the case of autonomous cars, the decision to reduce speed due to the detection of a possible nearby object, detected in the analysis of the front camera, may not agree with the data obtained from another component such as a LIDAR in the case of an adversarial attack. But in other types of decision making, such as analysing traffic signs, only video analysis could interfere and have a really dangerous effect by converting a stop sign into, for example, a 50 kilometres speed limit sign.

Stop signal
Picture 4: Stop signal

This technique undoubtedly constitutes a latent threat to the world of deep learning. But that is not everything, since there are other types of attacks for each of the stages in the machine learning pipeline in which an attacker can take advantage:

  • Training stage: poisoning of the data set.
  • Learned parameters: parameter manipulation attacks.
  • Inference stage: adversarial attacks.
  • Outputs Test: model theft.

Want to know more about Adversarial Attacks? Find out in the first part of this article here:

Leave a Reply

Your email address will not be published.