Deep Learning vs Atari: train your AI to dominate classic videogames (Part III)

AI of Things    2 July, 2018

Written by Enrique Blanco (CDO Researcher) and Fran Ramírez (Security Researcher at Eleven Paths)

In this post, we will offer details about the architectures chosen for our models, the logic that the agent follows during the training, the results of the project and our conclusions. This article concludes our Deep Learning and Reinforcement Learning experiment in games generated by OpenAI Gym. If you haven’t yet read the first two parts you can do so here:

Convolutional Networks: The Archtitecture of our Model

As we have explained previously, our agent must use an appropriate control policy that allows us to satisfactorily approximate the Q(s, a) function in order to maximize the reward obtained from an action a in a state s. In order to deal with the complexity that results from combining many complex states and to approximate the function, we need to apply Reinforcement Learning (RL) algorithms to Deep Neural Networks (DNNs). These networks are also known as Deep Q-Networks (DQNs).

For this type of training, the best neural networks to use are those called “Convolutional Neural Networks”. Throughout the history of Deep Learning, these networks have proven to be architectures that behave excellently when recognizing and learning patterns in images, as in the case of this White Paper. 

Deep Neural Networks take the pixel values of the frames it receives as the input data. Generally, a Deep Neural Network begins with a layer that has a similar number of dimensions to the input data, and ends with a layer that reduces the number of dimensions to the number of the action space. The representations of the entries will become more abstract as they go deeper into the architecture, and finish in a dense final layer with a number of outputs equal to the action space of the environment (4 in the case of Breakout-v0, 6 for SpaceInvaders-v0).

By having an architecture with various layers, we are able to extract structures that are difficult to identify in complex entries. One should carefully choose the number of layers and dimensionality of the architecture. As we will show later, using an increased number of layers in a model can become counterproductive if we are seeking to optimize the training time.

Although one would normally place pooling operations between the convolutional layers, one should note that in this case they have not been included within the first three layers. This is due to the fact that, when including them between convolutional layers, the representations that the models learns do not adapt to the spatial situation, since it is difficult for the network to determine the location of an object in the image. This characteristic is helpful when the location of the object in the image is not completely important (as is the case when identifying images). However, in our case, the relative location of the objects in the game is a vital factor when determining which action to take in ordeer to maximize the reward.

For Breakout-v0, an architecture with various convolutional layers collects the state returned in the previous layer, by applying an activation function with a rectified linear unit (ReLU).

  • The first convolutional layer is made up of 16 filters with a 3×3 kernel and a stride of 2
  • The second convolutional later has 32 filters with a 3×3 kernel and a stride of 2
  • The third convolutional later increases to 64 filters with a 3×3 kernel and a stride of 1

Then we included the following layers, with the same activation of the convolutional layers, except the final layer:

  • A dense, fully-connected layer of 1024 units
  • Another dense, fully-connected later of 516 units
  • A final exit later with 4 units (one unit for each possible action)

For SpaceInvaders-v0, we used a similar convolutional architecture to the one used in Breakout-v0

  • The first convolutional layer is made up of 16 filters with a 3×3 kernel and a stride of 2
  • The second convolutional layer has 32 filters with a 3×3 kernel and a stride of 2
  • The third convolutional layer increases to 64 filters with a 3×3 kernel and a stride of 2

With the same activation function as the convolution layers (except the final layer), we also have:

  • A dense, fully connected layer with 516 units
  • A final exit later with 6 units 

The hyperparameters of this model are the same as those used for the Breakout-v0 environment.

Logic of the Training of the Agent

Broadly speaking, the algorithm we have used follows the following steps:

  1. Start up all the Q-Values around zero. This generates the model, calling on the class of neural networks that is dedicated to estimating the Q-Values for a determined image of the game.
  2. Generate a game state from the class dedicated to preprocessing the images of the game environment. The structure of this class depends on the chosen strategy. If you are using a single image of the environment, it will be impossible to determine both the direction of movement and the speed of the ball and the paddle. An immediate alternative is to consider including a second processed image in the game state which shows the traces of the most recent movements in the game environment. Another, more instant alternative, involves stacking the previous four images of the game in the state, which aims to allow the agent to deduce the direction, speed, velocity and acceleration of the elements in the game environment. Regardless of the chosen strategy, the aim of this class is to generate a state with which one can “feed” the model in order to obtain the Q-Values.
  3. Once the input is generated, it can be introduced into the model.
  4. Take either a random action with probability epsilon (ϵ) or one based on the highest Q-Value. This control policy is defined in the EpsilonGreedy class.
  5. Add the state obtained in the second step, the action taken, and the reward obtained to the ReplayMemory.
  6. When the ReplayMemory is full or sufficiently full, it works back in the memory and updates all the Q-Values according to the rewards obtained.
  7. Carry out an optimization of the model by taking random batches from the ReplayMemory in order to improve the estimation of the Q-Values. This prevents overfitting during the initial phases of the training, and guarantees an efficient mapping of all the possible states of the environment.
  8. Save a checkpoint of the model.
  9. Introduce a recently pre-processed image into the model and repeat from step 3).
Figure 1: Logic of the agent during the training stage in OpenAI Gym

Results of the Training

In order to compare the experience of our agent, we have taken (with a modified architecture) an alternative training carried out according to the first approximation of [4]. This model was trained up to 1.2e8 episodes, and achieved a very good agent performance.

For the second approximation of the input to the model, we proceeded to log, for each episode of the training, the following information:

  • the reward obtained for each episode
  • the average reward of the last thirty episodes
  • the evolution of the Q-Values estimated by the model, including the maximum value of the Q-Values; the minimum value of the Q-Values; the average estimation of the action; the typical deviation of the Q-Values

In the same way, we stored all the values of the relevant hyperparameters before starting to optimize the network:

  • ϵ
  • learning rate
  • the loss-limit allowed for each optimization of the network
  • the maximum number of epochs allowed to optimize the model within
  • percentages of states in the ReplayMemory that produced bad estimation of the Q-Values

The trend of the average Q-Values for a given number of episodes is shown in the following graph. It presents the two models as separate lines so that one can compare the two. As can be seen, the average Q-Values for each model is similar, albeit they increase slightly quicker when using of “4 Frames Stacked” model of data entry. The same happens with the evolution of the average reward and the time of the training. The learning of both models is similar and end up converging towards the end of our training.

Figure 2: Graph showing the average score of the previous 30 episodes (Breakout-v0)
Figure 3: Average Q-Values (Breakout-v0)

In terms of the scores for SpaceInvaders-v0, one can see how the agent learns during the first 2e4 episodes, but from then on, the learning remains around 300 points on average.

The following graph represents the average Q-Values plotted against the number of episodes. One can see a rapid growth up to 8e3 episodes. This trend of growth continues, but less rapidly after the first phase of the training.

Figure 4: The average score of the previous 30 episodes (SpaceInvaders-v0)
Figure 5: Average Q-values (SpaceInvaders-v0)

The graph above (figure 5) gives us a clearer idea of the difficulties that our agent had when learning to manage this environment. As can be seen, the evolution of the percentage of states from the ReplayMemory that led to an incorrect estimation of the Q-Values is not as good as was expected. At the beginning of the training, this error percentage reached 72% and it scarcely decreases as the agent explores the game. It is true that the relative drop from this point is drastic once the learning rate stabilizes and the decision-making policy becomes less random. However, the fact that the error rate does not fall below 50% does not inspire much confidence in the prediction capabilities of the model.

After letting our model train in the Breakout-v0 environment for almost 4e4 episodes, we consider the network to be sufficiently trained for us to proceed to test the capacity of the agent. We carried out 200 test episodes, with ϵ = 0.05, aimed at minimizing the random nature of the actions to carry out. The maximum score that our AI obtained during the tests was 340 points, and the highest during the entire training was 361 points (obtained in episode 6983). These very high scores are achievable when out agent manages to open a tunnel in the layer of bricks; a strategy that would likely be used more often were we to advance the training. In the video below, you can watch the agent achieve its highest score.

For SpaceInvaders, we left the model training for over 5e4 episodes, we decided to test the capabilities of the SpaceInvaders-v0. We launched 200 test episodes, with ϵ = 0.05, as it was in the case of Breakout-v0.The maximum score obtained was 715 points, and during the entire training, it managed to score 1175 in episode 15187. Below, you can see an example of what our agent was capable of achieving in this environment.


We have shown how it is possible to train an AI or agent in the Breakout-v0 environment generated by OpenAI Gym. After a long training period, making use of 4 consecutive frames as the entry data, we have been able to achieve an acceptable score for an AI in the environment, even surpassing the average scores obtained by other, well respected, models.

We have reached the conclusion that it is advisable to carry out the training on a machine with a fair amount of memory and with a GPU.

We decided to train the agent in the SpaceInvaders-v0 environment. For this final environment, we reduced the size of the architecture but followed the same input strategy that was used in Breakout-v0. In this case, the results were not bad, since the agent managed to score an average of 310 points, but the improvement over time was not as notable as it was in the case of Breakout-v0.

This brief project has left certain question open, and some possible improvements have been noted, including:

  • It would be helpful to research architectures that are more efficient in order to accelerate the convergence of the solution. We have observed that accumulating an excessive number of layers slows down the training, despite equipping out machine with a GPU. One could explore the possibility of reducing the number of layers, in particular the dense layers at the end of the architecture
  • On could change the focus of the architecture, and make use of Double DQN (DDQN)
  • The modifications to the architecture and the values of the hyperparameters have been minimal when testing the model in SpaceInvaders-v0. We ought not forget that Space Invaders is a much more complex game that Breakout, where the number of events that can cost you a life is higher, and you also start with two fewer lives
  • One could explore new hyperparameter values e.g. reducing the batch size, discount factor, starting point of the learning rate etc
  • One could investigate the effect of modifying the reducing of epsilon (ϵ) value, making it slower
  • From the experience obtained in Breakout, it would be helpful to improve the control policy, removing the option to “FIRE” (action 1). This could save training time and avoid skewing the learning
  • For both environments, during the training, one could set a minimum percentage of actions that the agent may take
  • In SpaceInvaders-v0, one could eliminate the actions “NOOP”, “RIGHTFIRE” and “LEFTFIRE”, with the intention of improving the exploration of the environment and accelerating the learning process
  • One could attempt more aggressive pre-processing techniques, in particular for games that are more complex. There is a pending alternative to study, which could give good results and accelerate the processing speeds (even allowing us to include a larger number of frames in an individual state). This alternative is the use of Principal Component Analysis (PCA) to speed up the Machine Learning Algorithm. The application of this alternative would allow us to drastically reduce the dimensionality of the input, making it possible to reduce the number of layers and the size of each layer in the neural network.

Don’t miss out on a single post. Subscribe to LUCA Data Speaks.

You can also follow us on Twitter at: @Telefonica@LUCA_D3@ElevenPaths

Leave a Reply

Your email address will not be published. Required fields are marked *