A neural network has a simple objective: to recognise patterns inherent in data sets. To achieve this, it must have the ability to “learn” by going through a training process where thousands of parameters are adjusted until a combination that minimises a given error metric is reached. If it finally finds a combination of parameters allowing it to generalise the data, it will be able to recognise these patterns and predict, with an acceptable error tolerance, data inputs never seen in the training process. This data can be images, videos, audios, or tabular data. What if someone knows how to manipulate this data to provide the most convenient data for them?
Imperceptibly and unconsciously, we use neural networks in a multitude of tasks that we perform every day. Some of the most simple examples are the film recommendation systems on Netflix and music on Spotify, the identification and categorisation of emails, the interpretation of queries and next-word predictions in search engines, virtual assistants and their natural language processing, facial recognition in cameras and, of course, the identification of friends on social networks as well as the funny filters that change our facial features.
Without discrimination, neural networks succeed in an immense variety of fields. We can use them to diagnose COVID-19, track down drug dealers on social networks, or even detect fake news. However, it has been shown that they may also be hacked, taking us back to the essence and definition of hacking: manipulating the normal behaviour of a system.
One of the techniques used to arbitrarily manipulate neural networks is what is commonly known as “Adversarial Attacks”. Thanks to this we can produce the desired output by creating a carefully crafted input. For example, if we have a neural network that, based on the sound of a cough, predicts the probability of having or not having COVID-19, we could manipulate the recorded spectrograms by adding noise to modify the probability of response (increase or decrease it). Or we could even generate a spectrogram with no sense or similar to those generated by the cough and thus obtain any desired response probability.
Example with Deep Fakes
Let’s see a specific example: We have a system very good at predicting whether a video is deepfake or not. One of the traditional solutions to this issue begins with the collection and alignment of n faces appearing in the video by using a specific neural network for this task. Once collected, another network predicts the probability of a face being deepfake or not.
The last step is to take an average of all the probabilities for the n faces collected. If this average is greater than an established limit (for example, 0.6), then the video is classified as deepfake. Otherwise, it is classified as not deepfake. Clearly, we can see that in the example the quality of the generated deepfake is not very good, so the system is very confident when classifying it (0.86).
To modify the output probability of the system, we should add strategically generated noise and insert it into the video. To achieve this, we have three restrictions:
- The noise generated must be sophisticated enough for the network that identifies the faces to continue to do its job well.
- The noise must be generated in such a way as to lower the probability that the second network predicts on all collected faces.
- The modifications should be as unnoticeable to humans as possible.
Analysing the second network in detail, we can see that the input received is always the same size: a 256-pixel high by 256-pixel wide RGB image. Neural networks are deterministic, that is, for any input image that fits the first layer, it will produce an output. Pixels take values between 0 and 256, which means that the space of possible combinations for inputs from the second network will be 256256*256*3 but only a very small subset will meet all three restrictions.
To generate noise, we use the Fast Gradient Sign Method (live demo), which involves a white box attack and full access to the system. But what happens when we have only one chance to fool the system? We could create our own replica model of the original and generate the noise based on it. There are high probabilities that the attack will work by transferability, a property that is still a case study but that basically says that two models with the same objective will be based on the same features to accomplish it.
How Can We Protect Ourselves from This Kind of Attack?
One solution may be to add a new neural network that works as a kind of IDS (SafetyNet) in our process. If it detects that the image or video contains this type of attack, it can discard the video and classify it as malicious. Another solution would be to generate these attacks and include them in our data sets and within the training process of our network so that it can tag them as malicious. However, this option is very cost-intensive due to the amount of combinations over which they can be generated.
A very clever solution from an NVIDIA team called BaRT (The Barrage of Random Transforms) proposes to apply different types of attacks to the data set where the neural network is trained to make it difficult for the attacker to perform a black box attack so that the network can correctly classify a video as malicious.
Cleverhans, from Tensorflow, and ART (Adversarial Robustness Toolbox), from IBM, are libraries where we can find a starting point with examples to learn more about this type of attacks in neural networks, as well as ways to fix them in our models and increase their robustness.
There are many places where attackers can exploit these types of techniques and have a significant impact: Identity theft in facial recognition systems, tricking detectors of sexual or violent content on social networks, traffic signs used by autonomous vehicles, fake news detectors, etc. Behind all these applications that we use daily there are models that, like any system, can become vulnerable and their behaviour can be disrupted.