Behavioral and Cognitive Robotics
An adaptive perspective

Stefano Nolfi

© Stefano Nolfi, 2021   |   How to cite this book   |   Send your feedback   |   Collaborate

Index Next Chapter

13. Learning to Train Robots through Evolutionary and Reinforcement Learning Methods

13.1 Introduction

This chapter explains how to train robots in simulation through evolutionary and reinforcement learning methods. The objective is that to enable the reader to get a hands-on experience with these methods and with the software tools used by researchers working in these areas.

A basic knowledge of the Python programming language is required. A basic knowledge of C++ is recommended. 

The chapter includes 15 exercises that can be completed in about 25 hours.

13.2 AI Gym

An useful tool which can be used to simulate the robot, the environment, and their interaction is AI-Gym ( Brockman et. al., 2016). The tool includes a library of ready-to-use adaptive problems (called environments), facilitates the implementation of new environments, and introduces a programming standard which permits to create algorithms that can be applied directly to any AI-Gym environment without modifications.

The ready-to-use environments include: (i) a set of classic control problems (e.g. the cart-pole problem, the pendulum, the acrobat, etc.), (ii) the MuJoCo locomotors problems involving multi-segment robots with actuated joints that can be trained for the ability to swim, jump or walk (Todorov, Erez and Tassa, 2012), (iii) a series of robotic environments based on the Fetch robotic arm and on the Shadow Hand robot, and (iv) the Atari 2600 games (Bellemare et. al., 2013). AI-Gym and the associated environments are available as free software with the exception of the MuJoCo dynamic simulator that is commercial (free one-year licenses are available for students only). Additional AI-Gym environments developed by third parties, including a variation of many of the problems based on MuJoCo, are available as free software (see below).

AI-Gym environments are implemented as sub-classes of the Gym environmental class that includes methods for determining the number and type of sensors and actuators, accessing the state of the sensors, setting the state of the actuators, and advancing the simulation of one step.

To start practicing with Gym, install the AI Gym package by following the instruction included in the page.

The principal commands available in Gym are summarized in the python script included below that: creates the CartPole-v0 environment (line 2), resets the position of the agent (line 3), and performs 200 simulation steps (line 4) during which renders graphically the agent and the environment (line 5) and sets the state of the actuator randomly (line 6).

The instructions included in the line 3-6 can be used to perform an evaluation episode. In concrete applications, however, the action should be determined by a neural network that receives as input the observation vector returned by the function env.step().

import gym
env = gym.make('CartPole-v0')
for _ in range(200):
    observation, reward, done, info = env.step(env.action_space.sample())

The “CartPole-v0” environment (indicated in line 2) consists of a cart that can move along the x axes with a pole attached on the top through a passive joint. The task of the agent consists in controlling the actuator of the cart, by applying a positive or a negative force with a fixed intensity along the x axis, to maintain the pole balanced. The agent is evaluated during episodes lasting up to 200 steps that should be terminated prematurely when the inclination of the pole exceeds a given maximum threshold. When this happens, the done variable returned from the function env.step() become True.

The gym.make(env-name) function creates the environment named env-name.

The reset() function initializes the position of the agent and eventually the properties of the environment. This function is usually called at the beginning of each evaluation episode. In the case of the “CartPole-v0” environment, it initializes the position of the cart in the center of the x axis with the pole inclined of an angle selected randomly within a given range. The function returns the observation vector.

The render() function displays graphically the agent and the environment.

The step(action-vector) function updates the state of the actuators of the agent on the basis of the values included in the action-vector, performs a simulation step, computes the reward for the step, and checks whether the episode should continue or should terminate. It returns the observation vector, the reward gained in the current step, a Boolean value that indicates whether the episode should continue or not, and a dictionary that might contain additional information. In the case of the “CartPole-v0” environment the observation includes the position and the velocity of the cart along the x axes, the angle of the pole, and the velocity of the tip of the pole. The reward is 1 when the inclination of the pole does not exceed the maximum threshold and 0 otherwise.

The action_space.sample() function returns a random action vector. In the case of the CartPole-v0 environment, the action vector includes a single number (1 or 2) that specifies one of the two possible actions: applying a torque to the cart toward the left or toward the right direction.

The type and the dimensionality of the observation and action vectors are stored in the observation_space and action_space variables. The observation and action spaces can be continuous (Box) or discrete (Discrete). For example, in the case of the  CartPole-v0 problem, the observation are continuous and are constituted by a vector of four real numbers. The action space is discrete and is constituted by an integer that can assume two alternative values. The number of elements forming a vector of continuous values or the number of alternative values that a discrete space can assume are indicated by the integers that follow the ‘Box’ or ‘Discrete’ labels.  The minimal and maximum values that the elements of the observation and action vectors can assume are stored in the observation_space.high, observation_space.low, action_space.high and action_space.low variables. See the example below:

import gym
env = gym.make('CartPole-v0')
#> Discrete(2)
#> Box(4,)
#> array([ 2.4       ,         inf,  0.20943951,         inf])
#> array([-2.4       ,        -inf, -0.20943951,        -inf])

The implementation of the Gym functions (i.e. step(), reset(), render() etc.) is contained in the Python file of the environment. For example, the implementation of the “CarPole-v0” environment is included in the file.

Exercise 1. Familiarize with AI-Gym commands by using the “Cartpole-v0” environment and the other classic control environments.  Exploit the possibility to execute one command at a time and to print variables to familiarize with the structure of the data. Inspect the and files and verify how the reward is calculated. Clearly agents performing random actions are not able to solve their task. The agents should determine their action appropriately on the basis of the current observation and eventually of the previous observation/internal/action states. We will see how to realize that in the next sections.

13.3 Implement a neural network policy

This section illustrates how to implement a neural network policy which determines the action of the agent on the basis of the agent’s observation. The network will be constituted by a 3-layers feed-forward neural network in which the first layer includes the sensory neurons encoding the observation, the second layer includes internal neurons, and the third layer includes the motor neurons encoding the action. Each sensory neurons is connected to each internal neuron and each internal neuron is connected to each motor neuron. The activation of the internal and motor neurons is computed with the tanh() function. We assume that the observation is constituted by a vector of real numbers. The number of sensory neurons correspond to the length of the observation vector. Moreover, we assume that the action is constituted by a vector of real numbers or by an integer in the range [1, amax]. In the former case the number of motor neurons correspond to the length of the action vector. In the latter case, the network includes amax motor neurons and the action is set to the index of the most activated motor neuron. In the case of the CartPole-v0 environment, the observation vector is composed of four floating point values and the action vector is constituted by an integer in the range [1, 2]. Consequently the network will include four sensory neurons and two motor neurons.

Exercise 2. Implement the neural network controller for an AI-Gym environment. Initialize the network with random parameters. You can use the following code to create the variables required and to initialize the connection weights and the biases parameters.

import gym
import numpy as np
env = gym.make('CartPole-v0')

pvariance = 0.1     # variance of initial parameters
ppvariance = 0.02   # variance of perturbations
nhiddens = 5        # number of internal neurons
# the number of inputs and outputs depends on the problem
# we assume that observations consist of vectors of continuous value
# and that actions can be vectors of continuous values or discrete actions
ninputs = env.observation_space.shape[0]
if (isinstance(env.action_space,
    noutputs = env.action_space.shape[0]
    noutputs = env.action_space.n

# initialize the training parameters randomly by using a gaussian
# distribution with average 0.0 and variance 0.1
# biases (thresholds) are initialized to 0.0
W1 = np.random.randn(nhiddens,ninputs) * pvariance      # first connection layer
W2 = np.random.randn(noutputs, nhiddens) * pvariance    # second connection layer
b1 = np.zeros(shape=(nhiddens, 1))                      # bias internal neurons
b2 = np.zeros(shape=(noutputs, 1))                      # bias motor neurons

During each step, the state of the first layer of neurons corresponds to the observation. The state of the second and third layer of neurons, instead, should be computed by propagating the activation through the connection weights from the sensory neurons to the internal neurons and then, from the internal neurons to the motor neurons. The activation state of the motor neurons is used to determine the action vector. You can use the following code to create the neural network and to compute the activation of the neurons.

# convert the observation array into a matrix with 1 column and ninputs rows
# compute the netinput of the first layer of neurons
Z1 =, observation) + b1
# compute the activation of the first layer of neurons with the tanh function
A1 = np.tanh(Z1)
# compute the netinput of the second layer of neurons
Z2 =, A1) + b2
# compute the activation of the second layer of neurons with the tanh function
A2 = np.tanh(Z2)
# if the action is discrete
#  select the action that corresponds to the most activated unit
if (isinstance(env.action_space,
    action = A2
    action = np.argmax(A2)

Finally, evaluate the agent for multiple evaluation episodes (e.g. 10 episodes) lasting up to 200 steps and compute the fitness by summing the reward collected during all steps and by averaging the fitness for the number of episodes. Episodes should terminate prematurely when the done value returned by the env.step() function is True.

13.4 Implement a steady state evolutionary strategy

This section illustrates how to implement the steady state evolutionary strategy (Pagliuca, Milano and Nolfi, 2018) illustrated in the pseudocode included below and how to use it to train the parameters of the neural network policy implemented in the previous section.

The algorithm starts by initializing the random seed (line 1) and the parameters of the population randomly (line 2). The population (θ) consists of a matrix of λ vectors of length p, where p corresponds to the number of parameters, i.e. the number of connection weights and biases of the neural network. Each vector thus encodes the parameters of a corresponding individual (genotype). Subsequently, for a certain number of generations, the algorithm measures the fitness of the individuals (line 5), ranks the individuals of the population on the basis of their fitness (line 6), and replaces the parameters of the worse λ2 individuals with a copy with variations of the λ2 fittest individuals (line 9). Variations consists of a vector of random numbers of length p generated with a normal distribution with average 0 and variance σ. The fitness is measured by: (i) creating the policy network of the robot with the parameters specified in the vector, (ii) allowing the robot to interact with its environment for one or more evaluation episodes, and (iii) storing the sum of the rewards collected during evaluation steps normalized for the number of episodes.

When the population includes only 2 individuals, the algorithm is equivalent to a stochastic hill-climber that operates on a single candidate solution by: (i) adding random variation to the parameters, and (ii) by retaining or discarding the variations depending on whether they produce an increase or a decrease of the fitness, respectively. The usage of larger populations, however, generally leads to better results.

Exercise 3. Implement the evolutionary strategy described above and use it to train the neural network policy of an agent on the CartPole-v0 environment. Initialize the population with random values extracted with a Gaussian distribution with average 0.0 and standard deviation 0.1. Initialize the parameters encoding the biases to 0.0. Run the evolutionary process few times, by varying the seed of the random numbers generator. Check whether the evolving robots develop an ability to solve the problem by rendering graphically the behavior of evolved agents and analyze how performance vary in different replications of the experiment. Finally, test your program on other classic problem, e.g. the Pendulum-v0.

13.5 Evorobotpy2: A software tool for evolving robots

Evorobotpy2 ( is a software library implemented by the author and by Paolo Pagliuca that permits to evolve robots for AI-Gym environments. It is an extended version of the Evorobot and Farsa software tools, developed over many years (Nolfi, 2000; Nolfi and Gigliotta, 2010; Massera et al., 2013; Massera et al., 2014). Evorobotpy2 is characterized by simplicity and computational efficiency, includes additional environments complementary to those available in AI-Gym, and run on Linux, Mac OS, and Windows.

Evorobotpy2 requires Python3, the cython, pyglet and matplotlib python packages, a command line C++ compiler, and the gsl GNU Scientific Library available from: To install the gsl libraries download and un-compress the installation file, enters in the directory, and configure, compile and install the libraries with the commands “./configure”, “make”, and “make install”. You need root privileges to install the library for all users. The default installation directory might vary depending on the operating system and on the version of the library. Consequently, keep note of the installation directory and be ready to correct the include_gsl_dir variable specified in the setup*.py files, when you compile C++ modules of evorobotpy2.

The ./evorobotpy2/bin folder contains the python scripts that can be used to run evolutionary experiments and to analyze the results. The ./evorobotpy2/lib folder includes a series of C++ libraries wrapped in Python. More specifically, the net module that permits to create and update neural network policies and a series of environments illustrated in the next sections. The evorobotpy2/pybullet folder includes files for customizing the Pybullet environments and other utilities illustrated in Section 13.11. The evorobotpy2/exercises folder includes source codes samples for the exercises. The folders starting with the letter x (e.g. evorobotpy2/xacrobot) include initialization files with recommended hyperparameters.

The source code of the net module (evonet.cpp, evonet.h, utilities.cpp, utilities.h, net.pxd, net.pyx, and can be compiled and installed from the evorobotpy2/lib folder with the following commands:

python3 build_ext --inplace
cp net*.so ../bin # or cp net*.dll ../bin in Windows

In case of compilation errors, please check and eventually correct the link to the folder that contains the gsl library in the file The compiled net*.so or net*.dll library files should be copied in the evorobotpy2/bin folder. For more information on the net library see Section 13.18

The evorobotpy2/bin/ script permits to evolve and test robots. It uses the net module and the following python scripts: (i) evorobotpy2/bin/ that creates and update the policy network, (ii) evorobotpy2/bin/ that contains an implementation of the OpenAI-ES algorithm (Salimans et al., 2017; Section 6.2.2), and (iii) evorobotpy2/bin/ that includes functions for saving and loading data. For an explanation of the implementation of the OpenAI-ES algorithm and for directions on the usage of the multi-core version of the algorithm see Section 13.9.

To evolve the policy network for one of the problems listed in the ‘x’ folders, e.g. the xacrobot folder, enter in the folder and run the command:

python3 ../bin/ -f acrobot.ini -s 11

The program parameter -f is mandatory and should be followed by the name of a text file containing the setting of the hyperparameters, i.e. the file acrobot.ini in this case. The program parameter -s specifies the seed of the random number generator. The seed number is also used to differentiate the name of the files containing evolved robots and associated performance data. To visualize the explanation of program parameters and hyperparameters execute the script without specifying any program parameter.

The program will start to evolve a policy for the Acrobot problem with the hyperparameters specified in the evorobotpy2/xacrobot/acrobot.ini file. During the evolutionary process, the program displays after each generation the fraction of evaluation steps performed to date, the best fitness obtained to date, the best fitness obtained during post-evaluation to date, the fitness of the best individual of the current generation, the average fitness of the population, and the average absolute value of parameters. Moreover, during the evolution process, the script saves the parameters of the individuals that achieved the best fitness during evaluations and post-evaluations in the files bestSs.npy and bestgSs.npy, respectively, where s corresponds to the number of the seed. Moreover, the program saves performance data across generations in the file statSs.npy and FitSs.txt.

The behavior of an evolved agent can be observed by running the following command from the ./evorobotpy2/xacrobot folder:

python3 ../bin/ -f acrobot.ini -g bestgS11.npy -p

where the program parameter -g is used to indicate the name of the file containing the genotype of the evolved robot and -p is used to post-evaluate the robot by graphically rendering the robot's behavior. To display the behavior of a robot at the beginning of the evolutionary process specify the program parameter -p only, i.e. do not specify the file containing the genotype to be loaded.

To display the fitness throughout generations and the average and standard deviation of performance among multiple runs, you can use the following commands:

python3 ../bin/
python3 ../bin/

Exercise 4. Run few replications of the acrobot experiment by using different seeds. You can use the evorobotpy2/acrobot.ini file to set the hyperparameters. While the program is running check the implementation of the Acrobot environment from to figure out the content of the observation vectors, the content of the action vector, and the way in which the reward is calculated. Plot the fitness across generations and then observe the behavior of the best evolved robot stored in the bestgSs.npy file, where s is the seed number. You might want to load this file instead than the BestSs.npy file since the robot that achieved the best performance during the post-evaluation is usually better than the robot that achieved the best performance during the normal evaluation.

13.6 Evolve robots with co-adapted body and brain through Evorobotpy2 and Box2D

In this section we will see how to evolve the morphology and the brain of robots selected for the ability to walk with evorbotpy2. Moreover, we will briefly introduce the PyBox2D dynamic simulator (

The co-evolution of body and brain can be realized in different manners (see Chapter 3). Here we use the method described in Pagliuca and Nolfi (2020) in which the body is formed by a fixed number of elementary body elements with variable properties. The genotype of the evolving individuals consists of a vector of floating point values with a fixed length which includes the morphological parameters and the connection weights and biases of the robot’s neural network.

Video 13.1. An example of robot with coevolved body and brain (,

More specifically, the body of the robot is composed by 11 cylindrical segments attached in sequence through actuated hinge joints (see Video 13.1). Each joint has a default angle, which determines the intermediate angular orientation of each element with respect to its previous element, and two angular limits, which determine the minimum and maximum rotation angles of the joint. The morphological parameters include, for each element, the length, the diameter, the default angle with respect to the previous element, and the clockwise and anti-clockwise maximum rotation angles with respect to the default angle.

The observation of the robot includes the position angle and the angular velocity of each joint, the state of a contact sensor located on each body element, and the velocity and the absolute angle of the intermediate element. The action vector encodes the torque applied to each joint.

The implementation of this environment is contained in the file evorobotpy2/bin/ and is based on the PyBox2D simulator which is normally installed together with AI-Gym. If not, see installation instructions in

The robot, the environment, and the interaction between the two can be simulated in different manners. In simple cases, one can implement a simulation from scratch by using the Python programming language. This is the method used to implement the classic control problems in AI-Gym, i.e. the cart-pole, the pendulum, the acrobat etc. Alternatively, the  simulation can be realized by using a more efficient programming language wrapped in python. This is the method used for the environments included evorobotpy2 that are implemented in C++ and wrapped in Python. In more complex cases, however, the usage of a physical simulation library is strongly recommended. These libraries take care of the simulation of Newtonian physics and of collisions. Consequently, they reduce the programming burden to the description of the body elements composing the robot and the environment.  Box2D is a fast and easy to use dynamical simulation library restricted to 2D environments. We will introduce a tool that permits to simulated 3D systems in Section 13.11.

Exercise 5. Run few replications of the ./evorobotpy2/xbodybrain experiment. Observe the behavior and the morphology of the evolved agents. Describe how they vary in different replications. Analyze whether the morphological and behavioral properties are co-adapted, i.e. whether the morphological characteristics of the evolved agents are adapted to the characteristics of the controller and vice versa. Check the file ./evorobotpy2/bin/ to verify how the agent is implemented by using the Box2D primitives and how the morphological parameters are encoded in the genotype of the evolving robots.

13.7 Evolving wheeled robots by using a fast kinematic simulator

In this Section we will see how to carry on experiments with wheeled robots by using the fast kinematic simulator included in evorobotpy2. More specifically, we will use this software to replicate the experiment of the foraging robot described in Section 4.4.

Dynamical simulators, like Box2D and Pybullet, are necessary for simulating accurately the dynamics of articulated robots and the effects of collisions. In the case of wheeled robots moving on a flat surface, instead, kinematic simulations can suffice. Kinematic are much faster than dynamic simulations and consequently permits to reduce considerably the time required to evolve the robots. For this reason, evorobotpy2 includes a set of environments involving wheeled robots that rely on kinematic simulations implemented in C++. The environment that we will consider here is ErDiscrim environment which involves a Khepera robot (Mondada, Franzi & Ienne, 1993), i.e. a small wheeled robot with infrared sensors and two motorized wheels. The robot is situated on a flat arena surrounded by walls which contains a cylindrical object located in a random selected position. The task of the robot consists in finding and remaining near the cylinder.

The source code of the experiment (discrim.cpp, discrim.h, utilities.cpp, utilities.h, robot-env.cpp, robot-env.h, ErDiscrim.pxd, ErDiscrim.pyx, and can be compiled and installed from the ./evorobotpy2//lib folder with the following commands:

python3 build_ext --inplace
cp ErDiscrim*.so ../bin # or cp ErDiscrim*.dll ../bin

In case of compilation errors, please check and eventually correct the name of the directory of your GSL library in the file

Give a look to the source files to understand how the environment is simulated. The discrim.cpp and discrim.h files include the definition of the gym functions (i.e. env-reset(), env.step(), env.render() ext.) and the calculation of the reward. The robot-env.cpp and robot-env.h files include a set of functions that simulate the translation and rotation movements of the robot over the plane, the sensors of the robot, and the occurrence of collisions. Evaluation episodes are terminated prematurely when a collision occur. Consequently, there is no need to model the effect of collisions in detail. The robot-env.cpp and robot-env.h files permit to simulate also two other wheeled robots: the ePuck (Mondada et al., 2009) and the MarXbot (Bonani et al., 2010).

Exercise 6. Run 10 replications of the experiment from the ./evorobotpy2/xdiscrim folder by using different seeds. To speed-up the training process you can run 2 instances of the program in parallel or eventually more, depending on the cores available on your computer. Inspect the behavior displayed by the best evolved robot of each replication. Group the evolved robots in families relying on similar strategies and describe how the different strategies approach the problem. In this experiment the robot is provided with a LSTM network (see Section 10.3). Describe how the evolved robots use memory to vary the way in which they react to the current observations.

13.8 Plot the phase portrait of an agent evolved to balance a pendulum

This Section describes how to plot the phase portrait of an actuated pendulum controlled by a neural network with random and evolved connection weights.

As in the case of Figure 5.6, we should plot a two-dimensional state space in which the horizontal and vertical axes indicate the angle and the velocity of the pendulum. To generate the data to be plotted we can use the AI-Gym simulator of the actuated pendulum included in the file ./gym/gym/envs/classic_control/ More precisely, we can generate the data by evaluating the pendulum with its neural network controller for few episodes and by storing the angular position and velocity of the pendulum during each step in a file. We can then plot the data by writing a python program that loads the data from the file and plots, for each couple of succeeding steps, an arrow that indicates the variation of the angular position and velocity of the pendulum. 

You can generate multiple plots displaying the phase portraits of the agent at different stages of the evolutionary process, by starting from the agent of generation 0 that has randomly generated connection weights. To evolve the pendulum agent you can use the .ini file included in the ./evorobotpy2/xpendulum folder.

When you run an evolutionary process, evorobotpy2 save the parameters of the agent that achieved the best performance during post-evaluation to date in the file bestgSs.npy. This implies that the best agents of previous generations are overwritten. To obtain the best agent after a certain number of generations, you can make a copy of the bestgSs.npy file immediately after evorobotpy2 processes the relevant generation or you can extend evorobotpy2 with instructions for saving the best genotypes every n steps or every n generations.

Exercise 7. Plot the phase portrait of the actuated pendulum. Compare the phase portrait of an agent at different stages of the evolutionary process. Describe how the dynamics displayed by the agent at the beginning of the evolutionary process is altered throughout generations in a way that permits to bring up and balance the pendulum.

13.9 The Evorobotpy2 implementation of the OpenAI-ES algorithm

In this section we will analyze the evorobotpy2 implementation of the OpenAI-ES algorithm (Salimans et al., 2017; Section 6.2.2).

The single thread implementation of the algorithm is included in the file ./bin/ The generation and evaluation of the population is performed in the evaluate() method shown below.

def evaluate(self):
        cseed = self.seed + self.cgen * self.batchSize  # Set the seed for current generation = np.random.RandomState(cseed)
        self.samples =, self.nparams)
        self.cgen += 1

        # evaluate samples
        candidate = np.arange(self.nparams, dtype=np.float64)
        for b in range(self.batchSize):
            for bb in range(2):
                if (bb == 0):
                    candidate = + self.samples[b,:] * self.noiseStdDev
                    candidate = - self.samples[b,:] * self.noiseStdDev
                self.policy.nn.normphase(0) # normalization data is collected during the post-evaluation of the best sample
                eval_rews, eval_length = self.policy.rollout(self.policy.ntrials, seed=(self.seed + (self.cgen * self.batchSize) + b))
                self.samplefitness[b*2+bb] = eval_rews
                self.steps += eval_length

        fitness, self.index = ascendent_sort(self.samplefitness)       # sort the fitness
        self.avgfit = np.average(fitness)                         # compute the average fitness

        self.bfit = fitness[(self.batchSize * 2) - 1]
        bidx = self.index[(self.batchSize * 2) - 1]
        if ((bidx % 2) == 0):                                     # regenerate the genotype of the best samples
            bestid = int(bidx / 2)
            self.bestsol = + self.samples[bestid] * self.noiseStdDev
            bestid = int(bidx / 2)
            self.bestsol = - self.samples[bestid] * self.noiseStdDev

        self.updateBest(self.bfit, self.bestsol)                  # Stored if it is the best obtained so far
        # postevaluate best sample of the last generation
        # in this is done the next generation, move this section before the section "evaluate samples"
        gfit = 0                                                                   # to produce identical results
        if self.bestsol is not None:
            self.tnormepisodes += self.inormepisodes
            for t in range(self.policy.nttrials):
                if self.policy.normalize == 1 and self.normepisodes < self.tnormepisodes:
                    self.normepisodes += 1  # we collect normalization data
                    self.normalizationdatacollected = True
                eval_rews, eval_length = self.policy.rollout(1, seed=(self.seed + 100000 + t))
                gfit += eval_rews
                self.steps += eval_length
            gfit /= self.policy.nttrials
            self.updateBestg(gfit, self.bestsol)

Line 117 creates the matrix of random numbers that constitutes the population. The sizes of the matrix is determined by the batchSize hyperparameter, that encodes the number of individual forming the population, and by nparams, that encodes the number of parameters of each individual. Lines 125 and 127 create symmetrical offspring, i.e. two copies of the parameters of the parent (center) perturbed by adding and by subtracting a vector of random numbers (samples[b]). The perturbation vectors consists of random numbers generated with a standard gaussian distribution normalized on the basis of the noiseStdDev hyperparameter (that is usually set to 0.02). The genotype of each individual is used to set the parameters of the corresponding neural network controller and eventually to set the parameters of the corresponding body through the function set_trainable_flat() (line 128). The individuals are then evaluated in line 130 through the policy.rollout() function. Finally, the vector containing the fitness of the population is sorted in an ascending order in line 134.

The remaining part of the evaluate method is used to post-evaluate the best offspring and eventually to update the vectors containing the best individual and the best post-evaluated individual obtained to date.

The calculation of the gradient and the update of the center of the population is performed in the optimize() method shown below.

def optimize(self):
        popsize = self.batchSize * 2      # compute a vector of utilities [-0.5,0.5]
        utilities = zeros(popsize)
        for i in range(popsize):
            utilities[self.index[i]] = i
        utilities /= (popsize - 1)
        utilities -= 0.5
        weights = zeros(self.batchSize)   # Assign the weights (utility) to samples on the basis of their fitness rank
        for i in range(self.batchSize):
            idx = 2 * i
            weights[i] = (utilities[idx] - utilities[idx + 1])    # merge the utility of symmetric samples

        g = 0.0
        i = 0
        while i < self.batchSize:         # Compute the gradient (the dot product of the samples for their utilities)
            gsize = -1
            if self.batchSize - i < 500:  # if the popsize is larger than 500, compute the gradient for each 500
                gsize = self.batchSize - i
                gsize = 500
            g += dot(weights[i:i + gsize], self.samples[i:i + gsize,:])
            i += gsize
        g /= popsize                      # normalize the gradient for the popsize
        if self.wdecay == 1:
            globalg = -g + 0.005 * # apply weight decay
            globalg = -g

        # adam stochastic optimizer
        a = self.stepsize * sqrt(1.0 - self.beta2 ** self.cgen) / (1.0 - self.beta1 ** self.cgen)
        self.m = self.beta1 * self.m + (1.0 - self.beta1) * globalg
        self.v = self.beta2 * self.v + (1.0 - self.beta2) * (globalg * globalg)
        dCenter = -a * self.m / (sqrt(self.v) + self.epsilon)
	 += dCenter                                    # move the center in the direction of the momentum vectors
        self.avecenter = np.average(np.absolute(

The instructions included in lines 170-175 create a vector of utilities of size batchSize ranging from -0.5 to 0.5. Lines 177-180 compute the weights that determine the contribution of each perturbation vector to the estimated gradient of the expected fitness. Notice that the weight is calculated for each perturbation vectors by summing the utilities associated to the two symmetrical individuals generated by adding and subtracting the perturbation vector, respectively. The gradient of the expected fitness is calculated in line 190 by performing the average dot product of the weights and of the perturbation vectors (samples). The center of the population is then updated in line 205 on the basis of the Adam stochastic optimizer that is implemented in lines 200-203. Notice that the center of the population is not updated directly on the basis of the gradient estimated in the current generation but rather on the basis of the m and v momentum vectors that are updated on the basis of the gradient estimated in each generation. Notice also how the stepsize used to update the center of the population (a) starts small and reaches the value indicated by the stepsize hyperparameter after 1000 generations. The initial reduction of the stepsize is used by Adam to reduce the effect of the bias caused by the fact that the momentum vectors are initialized to null values, i.e. to an arbitrary value. The weight decay normalization is performed in line 195. The virtual batch normalization, instead, is implemented in the net library, i.e. in the library that is used to set the activation of the sensory neurons and to calculate the activation of the internal and motor neurons on the basis of the parameters encoded in the genotype.

At each generation, the algorithm print a text line with the number of the seed, the generation, the number of steps performed so far, the fitness of the best individual of the current generation, the fitness obtained by the best individual during the post-evaluation, the best fitness achieved so far, the best fitness achieved during post-evaluations so far, and the average absolute size of the parameters (see line 235 of the file ./bin/

Exercise 8. Run few evolutionary experiments by using the classic control problems. Verify the stability of the Openai-ES evolutionary algorithm, i.e. whether the fitness keep increasing monotonically over generations, whether it increases over generations by recovering from temporary and moderate regression phases, or whether it increases over generation while eventually displaying severe regression phases. For this purpose use the data printed during each generation or the data stored in the statSn.npy files which can be visualized with the python ../bin/ command.

To use the parallel implementation of the algorithm you should run the program by using the -w program parameter that indicates the number of workers that you want to use. The workers evaluate the members of the population in parallel. This implies that you can reduce the time necessary to complete the evolutionary process approximately n times by using n workers.

To determine the number of workers you should consider: (i) the maximum number of cores available in your machine, (ii) the size of the population, (iii) the fact that one worker is used to post-evaluate the best individual of the previous generation. For example, if the size of the population is 40 and your machine has 6 cores, you can set the number of workers to 5 so that each of the four workers evaluate 10 individuals and one worker post-evaluate the best individual of the last generation. Instead, if your machine has 48 cores, you can set the number of workers to 41.

13.10 Setting Evorobotpy2 hyperparameters

Evolutionary algorithms generally require to set few hyperparameters (i.e. the population size, the mutation rate and few other hyperparameters) and are robust with respect to hyperparameter setting, i.e. tend to work reasonably well also when the value of the hyperparameters is not optimized. Setting-up an adaptive experiment, however, requires to set also additional hyperparameters that determine the architecture of the neural network and the conditions in which the adapting agents are evaluated. The setting of these additional parameters is often critical.

For many standard problems you will find recommended hyperparameters in the .ini text files included in the experimental folders of evorobotpy2, i.e. the folder starting with the letter ‘x’.

Evorobotpy2 hyperparameters are divided in three sections that include the name of the environment (EXP section), the characteristics of the policy network and the way in which agents are evaluated (POLICY section), the evolutionary algorithm used and the associated hyperparameters (ALGO section). You can print the list of all hyperparameters and the associated default values by executing the ./evorobotpy2/ script without specifying any program parameter.

The evolutionary algorithm can be specified with the hyperparameter algo. The algorithms available in evorobotpy2 are: OpenAI-ES, i.e. a single-thread and a parallel implementation of the OpenAI-ES algorithm (one of the most effective state of the art algorithms, see Salimans et. al. 2017, Pagliuca, Milano and Nolfi, 2020), SSS, i.e. the stochastic steady state algorithm described in Section 13.4 (see also Pagliuca and Nolfi, 1998), and coevo2, i.e. a competitive co-evolutionary algorithm (see Section 13.14 and Simione and Nolfi, 2020).

Commonly used policies include: feed-forward multilayer networks with one or more internal layers, recurrent networks, long short term memory network (LSTM), and convolutional networks. Feed-forward multilayer networks constitute the default choice. Recurrent and LSTM networks should be used in problems that benefit from the possibility to determine the current action vector also on the basis of previous observations/internal/action states. Unfortunately, as we discussed in Chapter 4, identifying whether a problem requires a recurrent network or not is far from trivial since problems often admit qualitatively different solutions that are difficult to identify from the experimenter in advance. Consequently, you should rely on your intuition or you should verify experimentally whether the usage of recurrent or LSTM networks is advantageous. In general LSTM networks works better than simple recurrent neural networks. Convolutional networks are generally useful in problems that require to process visual information. You should rely on your intuition also to determine the number of layers and the number of internal neurons per layer by considering that the optimal number of layers and of neurons correlates with the size of the action and observation vector and with the complexity of the problem. The type of the policy, the number of layers and the number of internal neurons per layer can be specified with the hyperparameters architecture, nlayers and nhiddens. The activation functions of the internal and motor neurons can be set with the afunction hyperparameter. You can specify the activation function of the motor neurons with the out_type hyperparameter, if it differs from the activation function of the internal neurons. The tanh function can be considered the default choice for the activation function. The relu function usually works better in combination with convolutionary networks. For the motor neurons, the linear function is often the better choice since it permits to compute a gradient for all activation ranges. The advantage of using the tanh function is that it naturally squash the activation of the motors within a bounded range. The winit hyperparameter determines the connection weights initialization method. The xavier method (Glorot & Bengio, 2010) is usually a suitable choice in combination with tanh neurons. The norm incoming method (see Salimans at al., 2017) works well in combination with linear motor neurons.

The action_noise and action_noise_range hyperparameters can be used to apply noise to the motor neurons. The addition of noise permits to increase the robustness of the evolving solutions. Usually the addition of a small amount of noise with a fixed distribution works well. However, the usage of a diagonal gaussian policy method, in which the distribution of noise applied to each motor neuron is encoded in a genetically encoded parameter, can be beneficial for problems in which exploration play an important role. This method implies that the amount of noise affecting each motor is initially high and can vary during the course of the evolutionary process.

The normalized hyperparameter normalizes the distribution of the value of sensors through the virtual batch normalization method (Salimans at al., 2017). This form of normalization is useful when the range of sensors is highly variable. An example of problem of this kind are the problems that include velocity sensors. Indeed, the activation values that velocity sensors can assume can vary widely during the course of the evolutionary process.

The maxsteps hyperparameter determines the maximal duration of evaluation episodes in steps. This value should be sufficiently high to evaluate the relative efficacy of learning agents including agents displaying poor performance.

The episodes hyperparameter determines the number of evaluation episodes. Evaluating the robots for multiple episodes permits to maximize the robustness of the robots with respect to environmental variations and is particularly important in problems in which the environmental conditions vary significantly among evaluation episodes. In the case of modern evolutionary strategies, like the OpenAI-ES, a similar effect can be obtained by using large population sizes.

The pepisodes hyperparameter determines the number of episodes used to post-evaluate the best individual of each generation. The higher the number of post-evaluation episodes, the higher the chance that the individual selected as the best is the truly best.

Finally, for what concerns the parameters of the evolutionary algorithm, the hyperparameter maxmsteps determines the duration of the evolutionary process in million steps. The minimal duration necessary to discover effective solution depends on the complexity of the problem. The duration can also be expressed in number of generations. However, the total number of steps correlates better with the actual duration and the computational cost of the evolutionary process.

The hyperparameters of the evolutionary algorithm depends on the algorithm chosen. In the case of the OpenAI-ES algorithm, the size of the population can be set with the sampleSize hyperparameter. Since each sample is used to generate two symmetrical offspring, the actual population size is the double of sampleSize.  A suitable value for this hyperparameter is 250 but it can be reduced up to 20 for problems that are not particularly difficult. The variance of perturbations is encoded in the noiseStdDev hyperparameter. The default value of this hyperparameter (0.02) works well for most of the problems. The amplitude of the step in the direction of the gradient is determined by the stepsize hyperparameters. Also in this case the default value of 0.01 works well in the majority of the cases. The wdecay hyperparameter determines the application of the weight decay normalization, that is usually beneficial.

In the case of the SSS algorithm, instead, the population size and the amplitude of perturbations can be set through the popsize and mutrate hyperparameters. The default value of 20 and 0.02 works well in the majority of the cases.

13.11 Pybullet Locomotors

In this section we introduce the Pybullet dynamic simulator (, Coumans & Yunfei, 2016-2019) and the locomotors problems, an interesting family of environments that are widely used to benchmark adaptive algorithms.

Pybullet is a powerful 3D dynamical simulator that comes as a free software and can be used to carry on experiments with several interesting robotic platforms including the Ghost Minitaur quadruped (, KUKA robotics arms (,  the Atlas humanoid robot (, the iCub Humanoid robot (Sandini et al., 2004), the Baxter robot (, the Franka Emika Panda manipulator (, the Cassie biped robot (, the Laikago quadruped robot (, the MIT racecar (, and the Turtlebot wheeled robot (

Pybullet also includes additional AI-Gym environments. A particularly interesting family of these environments is constituted by the locomotors, a series of robots with varying shapes (ant-like, quadruped, humanoid ext.) that can be trained for the ability to swim, jump, and walk. They are a more realistic implementation of the original locomotors implemented with the MuJoCO dynamic simulator (Todorov, Erez and Tassa, 2012) included in the AI-Gym library.

Follow the instructions included in to install the library.

You can familiarize with the morphology of the locomotors and with their actuators by running the following command from the ./evorobot2/pybullet/showrobot folder:


Follow the instructions displayed by the program to load one of the locomotors model and to manually control its actuators so to familiarize with the actuators and the morphology of the robots. Notice that some locomotors, e.g. the Ant and the Humanoid, operate in a real 3D environment while other, e.g. the Hopper and the Halfcheetah, are constrained on the third dimension through fake joints.

The version 0 of the locomotors (e.g. HopperBulletEnv-v0, AntBulletEnv-v0 ext.) included in the Pybullet installation has reward functions optimized for reinforcement learning algorithm. To use the Version 5  (e.g. HopperBulletEnv-v5, AntBulletEnv-v5 ext.) with reward functions optimized for evolutionary strategies (see Pagliuca, Milano and Nolfi, 2020) you should: (1) add the registration instruction included in the evorobotpy2/pybullet/ file in the <pybullet-folder>/pybullet_envs/ file,  and (ii) copy  the files evorobotpy2/pybullet/ and  evorobotpy2/pybullet/ in the <pybullet-folder>/pybullet_envs/ folder. The first modification is required to register the version 5 of the problems. The two additional files contains the implementation of the version 5 o the locomotors with reward functions optimized for evolutionary algorithms.

Observe the behavior displayed by robots pre-evolved with evorobotpy2 by loading the bestgSs.npy files included in the evorobotpy2/xhopper, evorobotpy2/xhalfcheetah, evorobotpy2/xant, evorobotpy2/xwalker and evorobotpy2/xhumanoid folders.

Exercise 9. Inspect the and files and/or check (Pagliuca, Milano and Nolfi, 2020) to identify the differences between the reward function optimized for evolutionary and reinforcement learning algorithms. Evolve the robots with the version 0 of the problems and then compare the behavior of the robots evolved by using version 5. Could you explain why the original rewards functions are not suitable for evolutionary strategies ?

13.12 Implementing a new AI-Gym environment

In this Section we will see how to create and a new AI-Gym based on the Pybullet dynamical simulator. The example that we will consider is an adapted version of a tutorial developed by Backyard Robotics. It will constitute the Exercise 10.

The environment to be implemented includes a simple wheeled robot that can be trained for the ability to balance. We can thus name the environment BalancebotBulletEnv-v0 and the folder containing the implementation of the environment balance-bot.

13.12.1 Structuring the balancing bot project

The project requires the following folders and files (for more detailed instructions see the AI-Gym documentation):


The file should include the libraries required, in this case gym and pybullet:

from setuptools import setup
      install_requires=['gym', 'pybullet'])

The file should include a register instruction with the name of the environment (the name that will be passed to the function gym.make() to create the environment) and the entry point of the environmental class in the format:  {base module name}.envs:{Env subclass name}:

from gym.envs.registration import register

Finally, the file /envs/ should include the instruction for importing the environment:

from balance_bot.envs.balancebot_env import BalancebotEnv

13.12.2 Creating the robot

The balancing bot is a robot with two motorized wheels on the same axis and a central body that should be maintained upright by controlling appropriately the motor of the wheels. The robot has three sensors that encode the orientation and angular velocity of its body and the angular velocity of its wheels. Moreover, it has an actuator that controls the target angular velocity of the wheels. The central body element consists of a cuboid. The two wheels are attached to one side of the cuboid.

The physical structure of the robot can be specified in a text file that follows the URDF (Universal Robot Description Format) introduced in ROS (Quigley et al., 2009). A prefilled version of the balancebot_simple.xml file is included in the ./evorobotpy2/exercises/folder. Analyze the file to understand how bodies and joints are specified. Pybullet includes the URDF files of many robotic platforms. Consequently, constructing an environment based on those platforms simply require to load the appropriate URDF file.

13.12.3 Implementing the environment class

The file should include the class that inherit from the gym.Env class and the implementation of the associated methods: reset(), step(), render(), and seed().

The head of the file should include the instructions for importing the components required and the definition of the environmental class:

import os
import math
import numpy as np

import gym
from gym import spaces
from gym.utils import seeding

import pybullet as p
import pybullet_data

class BalancebotEnv(gym.Env):
    metadata = {
        'render.modes': ['human', 'rgb_array'],
        'video.frames_per_second' : 50

The __init__() method should allocate and initialize the observation and action vectors. The properties of the observation and action space can be initialized with the spaces.Box() method. In the case of the balance-bot environment, the output space includes a single continuous value which encodes the desired velocity of the wheels. The observation space include three continuous values which encode the inclination of the robot, the angular velocity of the robot, and the angular velocity of the wheels, respectively.

    def __init__(self, render=False):
        # action encodes the torque applied by the motor of the wheels
        self.action_space = spaces.Box(-1., 1., shape=(1,), dtype='float32')
        # observation encodes pitch, gyro, com.sp.
        self.observation = []
        self.observation_space = spaces.Box(np.array([-math.pi, -math.pi, -5]),
                np.array([math.pi, math.pi, 5]), dtype='float32')
        self.connectmode = 0
        # starts without graphic by default
        self.physicsClient = p.connect(p.DIRECT)
        # used by loadURDF

The seed() method sets the seed of the random number generator which is used to initialize of posture of the robot at the beginning of evaluation episodes.

    def seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

The reset() method loads the files containing the description of the environmental plane and of the robot, initializes the position, velocity, and orientation of the robot, and computes and returns the observation. The orientation of the robot is set randomly within a given range.

    def reset(self):
        self.vt = 0      # current velocity pf the wheels
        self.maxV = 24.6 # max lelocity, 235RPM = 24,609142453 rad/sec
        self.envStepCounter = 0
        p.setGravity(0, 0,-10) # m/s^2
        p.setTimeStep(0.01)    # the duration of a step in sec
        planeId = p.loadURDF("plane.urdf")
        robotStartPos = [0,0,0.001]
        robotStartOrientation = p.getQuaternionFromEuler([self.np_random.uniform(low=-0.3, high=0.3),0,0])
        path = os.path.abspath(os.path.dirname(__file__))
        self.botId = p.loadURDF(os.path.join(path, "balancebot_simple.xml"),
        self.observation = self.compute_observation()
        return np.array(self.observation)

The step() method sets the state of the actuators, advances the simulation of one step, computes the new observation, computes the reward, checks whether the conditions for terminating prematurely the episode are satisfied, and increases the step counter. The method receives as input the action vector and returns the observation vector, the reward, a Boolean value that indicates whether the episode should terminate, and an empty dictionary.

    def step(self, action):
        self.observation = self.compute_observation()
        reward = self.compute_reward()
        done = self.compute_done()
        self.envStepCounter += 1
        status = "Step " + str(self.envStepCounter) + "  Reward " + '{0:.2f}'.format(reward)
        p.addUserDebugText(status, [0,-1,3], replaceItemUniqueId=1)
        return np.array(self.observation), reward, done, {}

The render() method initializes the graphic window that visualizes the robot and of the environment.

    def render(self, mode='human', close=False):
        if (self.connectmode == 0):
            # connect the graphic renderer
            self.physicsClient = p.connect(p.GUI)
            self.connectmode = 1

Finally, the following four methods: (i) update the desired velocity of the actuators, (ii) compute the observation on the basis of the current position, orientation and velocity of the robot, (iii) compute the reward, and (iv) check whether the episode should terminate.

    def set_actuator(self, action):
        deltav = action[0]
        vt = np.clip(self.vt + deltav, -self.maxV, self.maxV)
        self.vt = vt

    def compute_observation(self):
        robotPos, robotOrn = p.getBasePositionAndOrientation(self.botId)
        robotEuler = p.getEulerFromQuaternion(robotOrn)
        linear, angular = p.getBaseVelocity(self.botId)
        return (np.array([robotEuler[0],angular[0],self.vt], dtype='float32'))

    def compute_reward(self):
        # receive a bonus of 1 for balancing and pay a small cost proportional to speed
        return 1.0 - abs(self.vt) * 0.05
    def compute_done(self):
        # episode ends when the barycentre of the robot is too low or after 500 steps
        cubePos, _ = p.getBasePositionAndOrientation(self.botId)
        return cubePos[2] < 0.15 or self.envStepCounter >= 500

13.12.4 Installing and using the environment

You can install the environment with the following instructions:

cd balance-bot
pip install -e .

Finally, you can use BalanceBotBulletEnv-v0 as a standard gym environment after importing it with the following command:

import balance_bot

For example, you can train the robot with the evorobotpy2/bin/ script after adding the import statement in the script.

13.12.5 Creating C++ environments

For an example of an environment implemented in C++ and wrapped in Python, see the long double-pole balancing problem available from A version of this environment optimized for evorobotpy2 is included in the evorobotpy2 software.

13.13 Back-propagation

Reinforcement learning algorithms use back-propagation to reduce the loss, i.e. the error between the desired and actual state of the motor neurons. It is reduced by computing the gradient of the parameters with respect to the loss and by varying the parameters in the direction of the gradient. As illustrated in Chapter 11, back-propagation can also be used to train feature-extracting networks.

The implementation of back-propagation is facilitated by the availability of tools, like torch (, Collobert, Bengio, & Mariéthoz, 2002) and tensorflow (, Abadi et al., 2016), which provide ready-to-use methods. These tools also permit to use GPU processors to speed-up the process.

An example of a back-propagation algorithm implemented in pytorch which can be used to train a neural network to perform a regression problem, i.e. to map a list of input vectors into a corresponding list of desired output vectors, is included in the ./evorobotpy2/exercises/ script shown below:

import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.optim as optim
import numpy
import matplotlib.pyplot as plt

# hyper-parameters
TrainingEpochs = 5000                # the duration of the training process in epochs
nhiddens = 2                         # the number of internal neurons
DisplayEvery = 100                   # the frequency with which the program display data
                                     # the training set can be changed by modifying the inputs and targets matrices below

# define the input patters. They consists of a list of tensors. Tensors are special data type
# analogous to vectors and matrices used by pytorch which can be processed also on in parallel on GPU
inputs = torch.Tensor([[0,0],[0,1], [1,0], [1,1]])

# define and print the desired output patters. They consists of a list of tensors
targets = torch.Tensor([[0],[1],[1],[0]])

# define the class which implements the neural network
class Net(nn.Module):                         # the network class

    def __init__(self):                       # initialize the network
        super(Net, self).__init__()
        self.c1 = nn.Linear(inputs.size(1), nhiddens, True)  # creates the first layer of connection weights from 2 input to 3 internal neurons
        self.c2 = nn.Linear(nhiddens, targets.size(1), True) # creates the second layer of connection weights from 3 internal to 1 output neurons

    def forward(self, inp):                   # update the activation of the neurons
        hid = torch.sigmoid(self.c1(inp))     # computes the activation of the internal layer by using the sigmoid activation function
        out = self.c2(hid)                    # computes the activation of the output layer by using a linear activation function
        if ((epoch % DisplayEvery) == 0):
            for i in inp:
                print("%.2f " % (i.numpy()), end="")          # print the activation of the input neurons
            print("-> ", end="")
            for h in hid:
                print("%.2f " % (h.detach().numpy()), end="") # print the activation of the hidden neurons
            print("-> ", end="")
            for o in out:
                print("%.2f " % (o.detach().numpy()), end="") # print the activation of the output neurons
        return out

net = Net()                                      # creates the network
print("3-layers feed-forward neural network with %d input, %d internal, and %d output neurons created " % (inputs.size(1), nhiddens, targets.size(1)))

lossfunction = nn.MSELoss()                      # creates a Mean Square Error loss function
print("We use the Mean Squared Error Loss Function") # for other type of loss function see the pytorch documentation

optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9) # create a Vanilla Gradient Descent Optimizer
print("We use the SGD Optimizer with learning rate set to 0.01 and momentum set to 0.9")

print("We train the network")
for epoch in range(0, TrainingEpochs):
    if ((epoch % DisplayEvery) == 0):
        print("Epoch %d, we print the activation of the neurons, the loss, the weights and the biases" % (epoch))
    epochloss = 0
    for inputp, targetp in zip(inputs, targets):
        optimizer.zero_grad()                  # initialize the gradient buffer to 0.0
        output = net(inputp)                   # activate the network and obtain the output
        loss = lossfunction(output, targetp)   # compute the loss
        epochloss +=         # compute the loss on the entire training set
        if ((epoch % DisplayEvery) == 0):
            print(" loss %.2f" % (
        loss.backward()                        # compute the gradient
        optimizer.step()                       # call the optimizer which updates the parameters by using the gradient to reduce the loss
    if ((epoch % DisplayEvery) == 0):          # display the parameters, i.e. the weights of the two layers and the biases of the internal and output layer
        w1 = net.c1.weight.detach().numpy()
        w2 = net.c2.weight.detach().numpy()
        b1 = net.c1.bias.detach().numpy()
        b2 = net.c2.bias.detach().numpy()
        print(numpy.concatenate((w1.flatten(), w2.flatten(),b1.flatten(), b2.flatten())))
    if (epochloss < 0.0001):                   # terminate the training when the loss is lower than a threshold

print("post-evaluation at the end of training")
for inputp, targetp in zip(inputs, targets):                       # post-evaluate the network at the end of training
    outputp = net(inputp)
    print("input: ", end="")
    for i in inputp:
        print("%.3f " % (i.detach().numpy()), end="")              # print the activation of the output neurons
    print("output: ", end="")
    for o in outputp:
        print("%.3f " % (o.detach().numpy()), end="")              # print the activation of the output neurons
    print("error: ", end="")
    for o, t in zip(outputp, targetp):
        print("%.3f " % (t.detach().numpy() - o.detach().numpy())) # print the error, i.e. the difference between the desired and actual outputs

In this example the regression problem consists in mapping the [1.0, 0.0] and [0.0, 1.0] observation vectors to [1.0] and to the [0.0, 0.0] and [1.0, 1.0] observation vectors to [0.0]. The training set, i.e. the list of observation and desired output vectors is defined in line 35 and 38. The size of the observation and action vectors is used to determine the size of the input and output layer of the network (line 45-46). The architecture of the neural network is defined in lines 45-49 by specifying the size of the connection layers (line 45-46) and the activation function of neurons (line 49). By default neurons are updated with a linear function. Line 65 and 68 specify the loss function and the optimizer to be used, i.e. the mean square error loss function and the SGD optimizer in this case. The calculation of the loss and of the gradient and the update of the parameters is performed automatically during each training epoch by calling loss.backward() and optimizer.step() methods (line 84-85).

You should install the torch and matplotlib python packages to run this script. The latter package is used to display the features extracted by the internal neurons, in case of problems with a bi-dimensional observation space. In case you do not want to install matplotlib, you can comment lines 108-128.

Exercise 11. Inspect closely the source code of the script. Run the program several times and observe the results. Notice that the script also illustrates how to access and print the parameters and the activation of sensory, internal, and motor neurons. This can require to convert the data from tensor vectors, the data type required by torch, to numpy vectors. Modify the regression problem by modifying the training set. You can change the number of patterns, the length of observation and action vectors, and the content of the vectors.

The efficacy of back-propagation can be improved by diving the training set in batches, by varying the vectors included in each batch randomly during each training epoch, by computing the gradient and updating the parameters after each batch, and by normalizing the activation of neurons.

13.14 Cross-entropy

Cross-entropy is a minimal reinforcement learning algorithm. It uses a stochastic policy network to generate variations, collects the observations and the actions experienced and produced in each step during evaluation episodes, and use this data to generate a training set that increases the probability to produce good actions. It simply operates by increasing the probability to produce the actions performed in the evaluation episodes in which the robot collected the highest total reward.

In the example illustrated in the ./evorobotpy2/exercises/ script shown below, the algorithm is applied to a problem with a discrete action space. The policy network includes N motor neurons that encode the probability to execute any possible action. During each step the action is selected randomly on the basis of the probability distribution encoded in the action vector of the policy network.

import gym
from collections import namedtuple
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

# hyper-parameters
NINTERNAL = 128              # Number of internal neurons
NEPISODES = 16               # Number of episode performed in each iteration
PERCENTILE = 70              # Fraction of the best evaluation episode used to train the policy network

class Net(nn.Module):        # The policy network class
    def __init__(self, obs_size, ninternal, nactions):
        super(Net, self).__init__() = nn.Sequential(
            nn.Linear(obs_size, ninternal),
            nn.ReLU(),                        # The internal neurons use the ReLU activation function
            nn.Linear(ninternal, nactions)    # The output neurons encode the probability if execution of the n different possible actions

    def forward(self, x):

# define two tuple helper classes by using the collection package
Episode = namedtuple('Episode', field_names=['reward', 'steps'])               # store the total undiscounted reward and the steps, i.e. the list of observations and actions of an episode
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action']) # list of observations and action of an episode

def rollout(env, net, nepisodes): # carry on the evaluation episodes and store data (cumulative reward, observations and actions)
    rolldata = []           # initialize the rollout data list
    episode_reward = 0.0    # initialize the episode_reward
    episode_steps = []      # initialize the list containing the rollout data
    obs = env.reset()       # initialize the episode
    sm = nn.Softmax(dim=1)  # create a softmax class
    while True:
        obs_v = torch.FloatTensor([obs])                                    # convert the observation vector to a tensor vector
        act_probs_v = sm(net(obs_v))                                        # compute the action vector with the policy network and convert the action vector to a probability distribution with softmax
        act_probs =[0]                             # convert the action tensor vector in a numpy vector
        action = np.random.choice(len(act_probs), p=act_probs)              # choses the action randomly by using the action probabilities
        next_obs, reward, is_done, _ = env.step(action)                     # perform a step
        episode_reward += reward                                            # update the total undiscounted reward
        episode_steps.append(EpisodeStep(observation=obs, action=action))   # append the observation and the action to the data
        if is_done:                                                         # at the end of the episode:
            rolldata.append(Episode(reward=episode_reward, steps=episode_steps))   # create the data of the episode, i.e. the summed undiscounted reward and the list of observation and actions
            episode_reward = 0.0                                                # re-initialize the reward
            episode_steps = []                                                  # re-initialize the steps
            next_obs = env.reset()                                              # re-initialize the episode
            if len(rolldata) == nepisodes:
                yield rolldata                                                  # return a list of data collected during the rollout
                rolldata = []
        obs = next_obs                                                       # set the next observation

def filter_rollout(rolldata, percentile): # filter out the data of the worse episode
    rewards = list(map(lambda s: s.reward, rolldata))     # extracts the list of comulative rewards collected in the corresponding episodes
    reward_bound = np.percentile(rewards, percentile)  # computes the minimum reward which episodes should have to be used for training the policy network
    reward_mean = float(np.mean(rewards))              # computes the reward mean

    train_obs = []                                     # initializes the matrix of observation to be used for training
    train_act = []                                     # initializes the matrix of actions to be used for training
    for episode in rolldata:
        if episode.reward < reward_bound:              # check whether the reward of the episode exceed the minimal bound
            continue                                                        # data of low performing episodes are discarded
        train_obs.extend(map(lambda step: step.observation, episode.steps)) # store the observation for training
        train_act.extend(map(lambda step: step.action, episode.steps))      # store the actions for training

    train_obs_v = torch.FloatTensor(train_obs)         # transform the observation matrix in a tensor
    train_act_v = torch.LongTensor(train_act)          # transform the action matrix in a tensor
    return train_obs_v, train_act_v, reward_bound, reward_mean

if __name__ == "__main__":
    print("CartPole-v0 trained through a crossentropy reinforcement learning algorithm")
    print("Nepisodes x epoch: %d fraction episode discarded %.2f" % (NEPISODES, PERCENTILE))
    env = gym.make("CartPole-v0")               # create the environment
    nsensory = env.observation_space.shape[0]   # extract the number of sensory neurons
    nmotor = env.action_space.n                 # extract the number of motor neurons by assuming that the action space of the environment is discrete

    net = Net(nsensory, NINTERNAL, nmotor)      # create the network policy
    print("The policy network has %d sensory, %d internal and %d motor neurons" % (nsensory, NINTERNAL, nmotor))
    objective = nn.CrossEntropyLoss()           # the Cross Entropy Loss function combine cross entropy and softmax. It works with raw values (digits) thus avoiding the need to add a softmax layer in the policy network
    optimizer = optim.Adam(params=net.parameters(), lr=0.01) # initialize the Adap stochastic optimizer

    for epoch, rolldata in enumerate(rollout(env, net, NEPISODES)):
        obs_v, acts_v, reward_b, reward_m = filter_rollout(rolldata, PERCENTILE)
        action_scores_v = net(obs_v)
        loss_v = objective(action_scores_v, acts_v)
        print("epoch %d: loss=%.3f, reward_mean=%.1f, reward_threshould=%.1f" % (epoch, loss_v.item(), reward_m, reward_b))
        if reward_m > 199:

During each training epoch, the algorithm: (i) performs N evaluation episodes (lines 45-67), (ii) stores the total cumulative reward and the list of observations and actions produced during the episodes (line 58), (iii) eliminates the data of the episodes with the worst cumulative reward (lines 70-85), and (iv) uses the remaining data to train the policy, i.e. to increase the probability to produce the actions performed in the best episodes (lines 101-107).

Despite its simplicity, this method works well in simple environments and is robust with respect to the setting of the hyperparameters.

The aspects that are missing in this algorithm and which will be illustrated in the next section are: the computation and usage of the discounted expected reward, the implementation and usage of the critic network, and the calculation of the advantage.

The script requires the torch and collections python packages.

Exercise 12. Inspect closely the source code of the script. Run the program several times and observe the results. Modify the script to train a policy network for another environment of your choice.

13.15 The Stable Baseline3 implementation of the PPO algorithm

In this section we will analyze the StableBaseline3 ( implementation of the PPO algorithm (Shulman et al., 2017), a state-of-the-art reinforcement learning algorithm particularly effective for robotic applications. The PPO is an extended version of the vanilla policy gradient algorithm described in Section 6.3.3 which includes a mechanism for penalizing excessive variations of the policy during training.

The main loop of the algorithm is included in the train() method of the PPO class [file ./stable_baseline3/ppo/, line 166-285]:

    def train(self) -> None:
        Update policy using the currently gathered rollout buffer.
        # Update optimizer learning rate
        # Compute current clip range
        clip_range = self.clip_range(self._current_progress_remaining)
        # Optional: clip range for the value function
        if self.clip_range_vf is not None:
            clip_range_vf = self.clip_range_vf(self._current_progress_remaining)

        entropy_losses = []
        pg_losses, value_losses = [], []
        clip_fractions = []

        continue_training = True

        # train for n_epochs epochs
        for epoch in range(self.n_epochs):
            approx_kl_divs = []
            # Do a complete pass on the rollout buffer
            for rollout_data in self.rollout_buffer.get(self.batch_size):
                actions = rollout_data.actions
                if isinstance(self.action_space, spaces.Discrete):
                    # Convert discrete action from float to long
                    actions = rollout_data.actions.long().flatten()

                # Re-sample the noise matrix because the log_std has changed
                # TODO: investigate why there is no issue with the gradient
                # if that line is commented (as in SAC)
                if self.use_sde:

                values, log_prob, entropy = self.policy.evaluate_actions(rollout_data.observations, actions)
                values = values.flatten()
                # Normalize advantage
                advantages = rollout_data.advantages
                advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

                # ratio between old and new policy, should be one at the first iteration
                ratio = th.exp(log_prob - rollout_data.old_log_prob)

                # clipped surrogate loss
                policy_loss_1 = advantages * ratio
                policy_loss_2 = advantages * th.clamp(ratio, 1 - clip_range, 1 + clip_range)
                policy_loss = -th.min(policy_loss_1, policy_loss_2).mean()

                # Logging
                clip_fraction = th.mean((th.abs(ratio - 1) > clip_range).float()).item()

                if self.clip_range_vf is None:
                    # No clipping
                    values_pred = values
                    # Clip the different between old and new value
                    # NOTE: this depends on the reward scaling
                    values_pred = rollout_data.old_values + th.clamp(
                        values - rollout_data.old_values, -clip_range_vf, clip_range_vf
                # Value loss using the TD(gae_lambda) target
                value_loss = F.mse_loss(rollout_data.returns, values_pred)

                # Entropy loss favor exploration
                if entropy is None:
                    # Approximate entropy when no analytical form
                    entropy_loss = -th.mean(-log_prob)
                    entropy_loss = -th.mean(entropy)


                loss = policy_loss + self.ent_coef * entropy_loss + self.vf_coef * value_loss

                # Calculate approximate form of reverse KL Divergence for early stopping
                # see issue #417:
                # and discussion in PR #419:
                # and Schulman blog:
                with th.no_grad():
                    log_ratio = log_prob - rollout_data.old_log_prob
                    approx_kl_div = th.mean((th.exp(log_ratio) - 1) - log_ratio).cpu().numpy()

                if self.target_kl is not None and approx_kl_div > 1.5 * self.target_kl:
                    continue_training = False
                    if self.verbose >= 1:
                        print(f"Early stopping at step {epoch} due to reaching max kl: {approx_kl_div:.2f}")

                # Optimization step
                # Clip grad norm
                th.nn.utils.clip_grad_norm_(self.policy.parameters(), self.max_grad_norm)

            if not continue_training:

The train() method collects the rollout_data, i.e. the observations, the log_probs (i.e. the logarithmic probability of actions), the actions, and the values (i.e. the predicted discounted returns) experienced and produced by the policy network during multiple evaluation episodes. Moreover it computes the discounted returns and the advantages on the basis of the reward received in each step during the evaluation episodes and on the basis of the values computed by the critic network (see Section 6.3.3). This is realized by calling the rollout_buffer.get() method (line 188) which is defined the file ./stable_baseline3/commons/ Notice that the rollout_buffer.get() method divides the rollout data in multiple batches and returns the data as a list of batches. Then, for each batch, the algorithm:

Exercise 13. Install stable baseline 3 and run some experiment by following the documentation and the examples illustrated in Inspect closely the source code until you have a detailed understanding with particular reference to the portion illustrated above. Run the program several times and observe the results. In addition observe the feature extracted by the back-propagation example illustrated in Section 13.13.

13.16 Run reinforcement learning experiments with Stable Baseline

Stable Baseline (, Hill et al., 2018) is a library that includes the most powerful reinforcement learning algorithms, i.e. A2C, ACER, ACKTR, DDPG, DQN, GAIL, HER, PPO, SAC, TD3 and TRPO.  It is an improved and documented version of the OpenAI Baselines (Dhariwal et. al., 2017).

Stable Baseline can be used in combination with Stable Baselines Zoo (,, Raffin, 2018) a companion library which includes Python scripts for training and testing agents, a collection of collection of pre-trained agents, and a tool for optimizing hyperparameters. The hyperparameters optimized for different algorithms and environments can be inspected in the folder ./rl-baselines-zoo/hyperparameters (see also Section 13.13.14).

Another valuable resources that you might use is Spinning Up (, Achiam, 2018), a well-documented library designed for educational purposes. Both tools require Tensorflow (, Abadi et al., 2016). Stable Baseline3 ( is a version based on Pytorch ( which, however, is not yet complete as the original version.

13.16.1 Installing Stable Baselines and RL-Baselines-Zoo

To install stable baselines follow the directions included in Then clone the rl-baselines-zoo repository with the following command:

git clone
cd rl-baselines-zoo

13.16.2 Training and testing with the rl-baselines-zoo scripts

To train an agent, you can use the script from the ./rl-baseline-zoo folder by specifying with the --algo and --env program parameters the name of the algorithm and the name of the environment. For example, to train a policy network for the CartPole-v1 environment with the PPO algorithm you can use the following command from the rl-baseline-zoo folder:

python --algo ppo2 --env CartPole-v1 -f logs -tb logs/ppo2/cartpole1

where -f indicates the folder in which the program will save the trained policy, and -tb indicates the folder in which the program will save the monitoring data that can be displayed with tensorboard (see below). The trained policy will be saved in the file ./logs/ppo2/CartPole-v1_1/ where the _1 number corresponds to the ID number of the experiment. If you run a second experiment with the same algorithm and environment the ID number is automatically incremented.

To visualize the behavior of a trained agent you can use the script from the rl-baseline-zoo folder as shown below:

python --algo ppo2 --env CartPole-v1 -f logs --exp-id 1 --load-best

To visualize the behavior of a pre-trained agent you should only specify the algorithm and the environment, as shown in the example below. The program will automatically load a pre-trained agent from the ./rl-baseline-zoo/trained_agents folder.

python --algo ppo2 --env HumanoidBulletEnv-v0

To visualize the learning data with tensorboard you can run the command below from the rl-baseline-zoo folder and then follow the instruction to visualize the data on your browser. Tensorboard will update the graphic display automatically during the training process as soon as new data is generated.

tensorboard --logdir logs/ppo2/cartpole1

Finally, to create a video file with the behavior of the agent you can use the following command from the rl-baseline-zoo folder:

python -m utils.record_video --algo ppo2 --env CartPole-v1 -n 1000 -f logs -f logs --exp-id 2

where the -n program parameter indicates the length of the video in steps.

13.16.3 Monitoring the training process

Monitoring the course of the training process is particularly important in the case of reinforcement learning since the success of the learning process often depends on the setting of the hyperparameters and since monitoring the course of the learning process can provide hints on how to set the hyperparameters. 

The indicators that can be visualized with tensorboard include: the cumulative reward obtained in the current episode (episode reward), the discounted reward, the advantage, the probability distribution of actions, the policy loss, the value loss, the entropy loss, and the distribution divergence. Entropy indicates the predictability of the effect of the agents actions on the expected reward. The distribution divergence, that can be calculated with the kullback-leibler method, indicates the variation of the distribution of the actions performed by the policy before and after the update of the parameters.

The output text produced by the program include additional indicators: the average episodes reward (ep_reward_mean), the average episodes length (ep_len_mean), and the explained variance.

13.16.4 Setting hyperparameters

The ./rl-baseline-zoo/ script has two mandatory flags (--algo and --env) that should be used to specify the algorithm and the environment and the optional --hyperparameters or -params flags that can be used to set the other hyperparameters. The default hyperparameters are loaded automatically from the file ./hyperparams/algo.yml (where algo is the name of the algorithm chosen). The --hyperparameters flag can be used to overwrite the default parameters. The two examples below show how we can overwrite the default value of the learning_rate and n_timesteps hyperparameters or the number of internal neurons of the policy network.

python --algo ppo2 --env CartPole-v1 -params learning_rate:0.001 n_timesteps:30000
python --algo ppo2 --env CartPole-v1 -params policy_kwargs:"dict(net_arch=[32, 32])"

For what concern the choice of the algorithm, according to the benchmarking data included in rl-baselines-zoo, the ppo2, sac, td3 and trpo produce the best results on problems with continuous action space such as the Pybullet locomotors problems (AntBulletEnv-v0, HalfCheetahBulletEnv-v0, HopperBulletEnv-v0, HumanoidBulletEnv-v0, Walker2DBulletEnv-v0),  the BipedalWalker-v2, and the BipedalWalkerHardcore-v2. The a2c, acer, actkr, ddpg and dqn algorithms produce the best results on problems with discrete action spaces such as the BeamRiderNoFrameskip-v4, the MsPacmanNoFrameskip-v4, and the QbertNoFrameskip-v4 Atari games. In some cases the algorithms of the first group (e.g. the PPO) achieve the best performance also on problems with discrete action space such as the EnduroNoFrameskip-v4 and the SeaquestNoFrameskip-v4.

The policy hyperparameter can be used to specify the architecture of the policy. The multi-layer feed-forward policy (‘MlpPolicy’) can be considered the default option. By default this policy includes two layers with 64 neurons.  For problems that benefit from memory, you can use the multi-layer Short Term Long Term memory policy (‘MlpLstmPolicy’).  For problems that require to process visual information, such as the Atari problems, you can use the convolutional neural network policy (‘CnnPolicy’) or eventually the convolutional long term short term neural network policy (‘CnnLstmPolicy’). All these policy support actor-critic networks.

The policy_kwargs hyperparameter can be used to customize the characteristic of the policy network. For example, the number of layers and the number of neurons per layer can be specified with the string "dict(net_arch=[N1, N2])" where the length of the vector corresponds to the number of layers and the elements of the vector to the number of neurons per layer. For an exhaustive description of the hyperparameters see:

Eventually, you can declare your own custom policy as a sub-class of an existing policy. An example of custom policy is the CustomMlpPolicy shown below that is defined in the file ./rl-baseline-zoo/utils/ This feed-forward network policy, which includes a single layer composed of 16 neurons, is the policy recommended by default for the AntBulletEnv-v0 problem trained with the PPO2 algorithm.

class CustomMlpPolicy(BasePolicy):
    def __init__(self, *args, **kwargs):
        super(CustomMlpPolicy, self).__init__(*args, **kwargs,
register_policy('CustomMlpPolicy', CustomMlpPolicy)

Different algorithms have partially different hyperparameters. Here we refer in particular to the hyperparameters of the PPO2.

The number and duration of evaluation episodes, and consequently the length of data collected and used to optimize the policy, can be specified with the n_envs=1 and n_steps=128 hyperparameters, where the values indicated after the equal sign represent the default values for the PPO algorithm.  The number of mini batches in which the data is divided is indicated with the hyperparameter nminibatches=4. The number of epochs used to update the parameters of the policy on the basis of the current training data can be set with the hyperparameter noptepochs=4. The learning rate is specified with the learning_rate=0.00025 hyperparameter. Finally, the total duration of the training process in step can be set with the n_timesteps hyperparameter.

The n_steps hyperparameter should be sufficiently large to enable the agents to exhibit the desired behavior and to ensure that the agent keep behaving correctly over time. For example a value of 1000 steps can be appropriate for the robot locomotors problems to evaluate whether the robot manage to start walking and keep walking without falling down for a while. A much smaller value can be sufficient for problems that admit a quick solution, like the MountainCar-v0 environment. Please consider, however, that the time required to achieve the goal during the first stages of the learning process can be significantly longer than the time required by high-performing agents. The value of the n_env hyperparameter should matches the variation range of the environmental conditions among evaluation episodes. Consequently, it should be greater for problems in which the characteristics of the environments vary significantly among episodes, such as in the case of the BipedalWalkerHardcore-v3. Instead, in can be smaller in the case of the robot locomotors problems in which the environmental variations are restricted to minor modifications of the initial posture of the robots. The n_env hyperparameter can also be used to exploit parallel computation since some algorithms are able to evaluate multiple episodes in parallel. The minimal size of the n_timesteps hyperparameter depends on the complexity of the problem. It can be in the order of 1e5 in the case of simple classic control problem such as the Cartpole-v1, in the order of 2e6 in the case of the robot locomotors problems, in the order of 1e7 in the case of the Atari problems, and in the order of 1e8 in the case of particularly difficult problems such as BipedalWalkerHardcore-v3. In general a suitable value for the notepochs hyperparameters is 10. For the learning_rate hyperparameter the default value of 0.00025 is usually good. For some problems, however, linearly decreasing the learning rate during the training process leads to better performance.

The gamma=0.99 hyperparameter determines the importance of future rewards in the calculation of the expected reward. Consequently, one might want to set this value as large as possible. On the other hand, the utilization of smaller gammas can speed-up learning. A good practice is to use a value in the range [0.9, 0.99] and to choose a value closer to the minimum or to the maximum depending on whether one want to prioritize learning speed or learning efficacy. The lam=0.95 (lambda) hyperparameter determines how much the algorithm bootstrap on earlier learned value versus new experiences. This implies a trade-off between more bias (low lambda) and more variance (high lambda). The default value of this hyperparameter works well in most of the cases.

The normalize=False hyperparameter can be used to normalize the observation vector. Normalization is generally useful in particular when the distribution of the observation values vary widely during the course of training. It however produces lower performance in the case of the Atari problems and in the case of simple classic control problems such as the Cartpole-v1 and the Pendulum-v0. The ent_coef=0.01 (entropy coefficient) hyperparameter encourage the exploration of new strategies. It can be set to 0.1 in the case of the Atari problems but should be reduced to 0.01 in the case of the BipedalWalker-v3 and BipedalWalkerHardcore-v3 problems and to 0.0 in the case of  robot locomotors problems.

The PPO algorithm includes a method which reduces the amplitude of the variations of the parameters to avoid introducing changes that alter too much the current policy with respect to the previous policy. The reduction of the size of variation is realized by clipping the gradient to the threshold specified with the cliprange=0.2 hyperparameter. Eventually, you can also set a separate clipping threshold for the gradient of the value function network by assigning a value to the cliprange_vf=None hyperparameter. In general the default values of these hyperparameters work well. Smaller values can be beneficial in problems that lead to instabilities with the default values.

Exercise 14. Train robots on two problems of your choice with the PPO2 algorithm. Repeat the experiment by varying some of the parameters and describe the effects on the indicators illustrated in section 13.6.3.

13.17 Competitive Co-evolution

In this section we will see how to replicate and analyze the competitive co-evolutionary experiment involving predator and prey robots described in Section 12.4 (see also Simione & Nolfi, 2019).

Open the ./evorobotpy2/expredprey/ErPredprey.ini file that contains the default hyperparameters of this experiment. Notice how the file include an nrobots hyperparameter set to 2. Indeed, in this case the environment contains two robots: a predator robot that is evolved for the ability to catch the prey robot, and a prey robot that is evolved for the ability to avoid being caught from the predator robot. The parameter heterogeneous=1 indicates that the connection weights of the two robots are different. The parameter algo=coevo2 indicates that the robots are evolved with a competitive coevolutionary algorithm which handles two populations. The algorithm is implemented in the file ./evorobotpy2/expredprey/ The evaluation of each robot is carried out in a in a series of episodes in which the robot is evaluated in interaction with different competitors selected from the opponent population.

Notice also how the popsize and selsize hyperparameters are set to 80 and 10, respectively. This means that the population of each species of robot include 80 individuals. However, during each evolutionary phase, a sub-set of 10 individuals are evolved against a subset of 10 competitors for 20 generations. The utilization of part of the individuals enables the algorithm identify and discard opportunistic individuals by post-evaluating at the end of each evolutionary phase the 10 evolved individuals also against the 70 other opponents that they did not encounter during the evolutionary phase.

To run the experiment you should compile and install the predprey environment (predprey.cpp, predprey.h, utilities.cpp, utilities.h, robot-env.cpp, robot-env.h, ErPredprey.pxd, ErPredprey.pyx, and from the ./evorobotpy2/lib folder with the following commands:

python3 build_ext --inplace
cp ErPredprey*.so ../bin # or cp ErPredprey*.dll ../bin

You can then evolve the robots from the /evorobotpy2/xpredprey/ folder with the command:

python3 ../bin/ -f ERPredprey.ini -s 1

As you will see from the text printed by the program during the evolutionary process, the algorithm starts by evaluating each predator of the initial generation against each prey of the initial generation. Then it performs 1000 generations during which it: (i) selects 10 differentiated competitors, (ii) selects 10 evolving individuals by choosing those that perform better against the selected competitors, (iii) evolves the 10 selected individual against the 10 selected competitors, one at a time, for 20 sub-generations, (iv) post-evaluate the new 10 evolved individuals against the remaining 70 competitors, (v) discard the worse 10 individual from the 90 individuals composed of the 80 original individuals and the 10 additional evolved individuals. This implies that only the individuals which perform well against all competitors are selected.

Notice that since 20 sub-generations are performed during each phase, 1000 generations correspond to 20,000 sub-generations. The two populations are evolved in an alternated fashion. During even generations 10 selected predators are evolved against 10 selected prey. During odd generations, 10 selected prey are evolved against 10 selected predator.

The ./evorobotpy2/xpredprey folder includes the result of a pre-evolved evolutionary experiment performed by using 12 as seed. You can observe the behavior of evolved predators evaluated against evolved prey with the command:

../bin/ -f ErPredprey.ini -s 12 -g P-1000-5 -p

Notice that the coevolutionary algorithm interpret the -g program parameter in a different manner from the other algorithms. When the string start with the letter ‘P’ it interprets the following two numbers as the number of the generation and the number of individuals to be post-evaluated, respectively. This means that the command specified above will evaluate the 5 best predators of generation 1000 against the 5 best prey of generation 1000 by performing 5x5=25 evaluation episodes.

Finally, you can post-evaluate evolved robots against competitors of current and previous generations with the following command:

python ../bin/ -f ErPredprey.ini -s 12 -g m-1000-100 -p

A string starting with the letter ‘m’ followed by the 1000 and 100 numbers run a master tournament in which predators and then prey of generation 1000 are post-evaluated against competitor of the same generation and of previous generations, every 100 generations.

Exercise 15. Perform a master tournament analysis to verify whether predator of successive generations become stronger and stronger, i.e. achieve higher and higher performance when evaluated against ancient and more ancient competitors. Observe the behavior of the robots at different generations and describe how it changes during the evolutionary process.

13.18 The net library

The usage of the net library is summarized in the python script /bin/ shown below. The script shows how to import the compiled library, create a recurrent neural network policy, and update the network for 10 steps by setting the value of the sensory neurons randomly.

To save the computation time required to pass data from Python to C++ and vice versa, Evorobotpy2 allocates the vectors of the observation and action in Python and passes the pointer of the vectors to the C++ library only once, when the policy is created. The same technique is used to speed-up the communication between the training algorithms implemented in Python and the environments implemented in in C++, e.g. the ErDiscrim, the ErDpole, and the ErPredprey environments. Overall this permits to exploit the advantages of the Python programming language by maintaining an efficiency analogous to a pure C++ application.

The library also permits to create and update multiple networks with identical topologies. This permits to realize experiment with multiple robots, as the experiment illustrated in Section 13.14. A description of the parameters and associated default values can be obtained by executing the script without parameters.

import numpy as np
import net
# parameters to be set on the basis of the environment specifications
ninputs = 4            # number of input neurons
noutputs = 2           # number of output neurons
# set configuration parameters
nnetworks = 1          # number of networks
heterogeneous = 0      # whether multiple networks are heterogeneous or not
nlayers = 1            # number of internal layers
nhiddens = 10          # number of hidden units
nhiddens2 = 0          # number of hiddens of the second layer, if any
bias = 1               # whether we have biases
architecture = 2       # full recurrent architecture
afunction = 2          # activation function of internal neurons is tanh
out_type = 3           # activation function of output neurons is linear
winit = 0              # weight initializaton is xavier
clip = 0               # whether the activation of output neuron is clipped in the [-5.0, 5.0] range
normalize = 0          # whether the activation of input is normalized
action_noise = 0       # we do not add noise to the state of output neurons
action_noise_range = 0 # the range of noise added to output units
wrange = 0.0           # range of the initial weights (when winit=2)
nbins = 1              # number of outputs neurons for output values
low = 0.0              # minimun value for output clipping, when clip=1
high = 0.0             # maximum value for output clipping, when clip=1
seed = 101             # random seed
# allocate the array for inputs, outputs, and for neuron activation
inp = np.arange(ninputs, dtype=np.float32)
out = np.arange(noutputs, dtype=np.float32)
nact = np.arange((ninputs + (nhiddens * nlayers) + noutputs), dtype=np.float64)
# create the network
nn = net.PyEvonet(nnetworks, heterogeneous, ninputs, nhiddens, noutputs, nlayers, nhiddens2, bias, architecture, afunction, out_type, winit, clip, normalize, action_noise, action_noise_range, wrange, nbins, low, high)
# allocate an array for network paramaters
nparams = nn.computeParameters()
params = np.arange(nparams, dtype=np.float64)
# allocate the array for input normalization
if (normalize == 1):
    normvector = np.arange(ninputs*2, dtype=np.float64)
    normvector = None
# pass the pointers of allocated vectors to the C++ library
if (normalize == 1):
# You can pass to the C++ library a vector of parameters (weights) with the command:
#nn.copyGenotype(params)  # the type of the vector should be np.float64
# set the seed of the library
#initialize the parameters randomly
# Reset the activation of neurons to 0.0
# For 10 steps set the input randomly, update the network, and print the output
for step in range(10):
    # generate the observation randomly
    inp = np.random.rand(ninputs).astype(np.float32)
    # pass the pointer to the input vector to the C++ library
    # update the activation of neurons
    # print the activation of neurons
    print("step %d ,  print input, output, and input-hidden-output vectors" % (step))


Abadi M., Barham P., Chen J., Chen Z., Davis A., Dean J. et al. (2016). Tensorflow: A system for large-scale machine learning. In 12th Symposium on Operating Systems Design and Implementation (16): 265-283.

Achiam J. (2018). Spinning up in deep reinforcement learning. Github:

Bellemare M.G., Naddaf Y., Veness J & Bowling M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279.

Bonani, M., Longchamp, V., Magnenat, S., Rétornaz, P., Burnier, D., Roulet, G., Vaussard, F., Bleuler, H., & Mondada, F. (2010). The marXbot, a miniature mobile robot opening new perspectives for the collective robotic research. In Proceedings of the 2010 IEEE/RSJ international conference on intelligent robots and systems (IROS 2010) (pp. 4187–4193). New York: IEEE Press.

Brockman G., Cheung V., Pettersson L, Schneider J., Schulman J., Tang J. and Zaremba W. (2016). OpenAI Gym. arXiv:1606.01540

Collobert R., Bengio S., & Mariéthoz J. (2002). Torch: a modular machine learning software library (No. REP_WORK). Idiap.

Coumans E. & Yunfei Bai (2016-2019). PyBullet, a Python module for physics simulation for games, robotics and machine learning.

Dhariwal P., Hesse C., Plappert M., Radford A., Schulman J., Sidor S., and Wu Y. (2017). Openai baselines.

Hill A., Raffin A., Ernestus M., Gleave A., Kanervisto A., Traore R., Dhariwal P., Hesse C.,  Klimov O., Nichol A., Plappert M., Radford A.,  Schulman J., Sidor S. and Wu Y. (2018). Stable baselines.

Glorot X. & Bengio Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249-256).

Massera G., Ferrauto T., Gigliotta O., and Nolfi S. (2013). FARSA: An open software tool for embodied cognitive science. In P. Lio', O. Miglino, G. Nicosia, S. Nolfi and M. Pavone (Eds.), Proceeding of the 12th European Conference on Artificial Life. Cambridge, MA: MIT Press.

Massera G., Ferrauto T., Gigliotta O., Nolfi S. (2014). Designing adaptive humanoid robots through the FARSA open-source framework. Adaptive Behavior, 22 (3):255-265

Mondada F., Bonani M., Raemy X. et al. (2009). The e-puck, a robot designed for education in engineering. Proceedings of the 9th Conference on Autonomous Robot Systems and Competitions.

Mondada, R., Franzi, E. & Ienne, P. (1993) Mobile robot miniaturization: A tool for investigation in control algorithms. In: Proceedings of the Third International Symposium on Experimental Robots, eds. T. Yoshikawa & F. Miyazaki. Kyoto, Japan.

Nolfi S. (2000). Evorobot 1.1 user manual. Technical Report, Roma, Italy: Institute of Psychology, National Research Council. 

Nolfi S. and Gigliotta O. (2010). Evorobot*: A tool for running experiments on the evolution of communication. In S. Nolfi & M. Mirolli (Eds.), Evolution of Communication and Language in Embodied Agents. Berlin: Springer Verlag.

Pagliuca P., Milano N. & Nolfi S. (2018). Maximizing adaptive power in neuroevolution. PLoS One, 13(7): e0198788.

Pagliuca P. & Nolfi S. (2020). The dynamic of body and brain co-evolution. arXiv preprint arXiv:2011.11440.

Quigley M., Conley K., Gerkey B., Faust J., Foote,T., Leibs J., ... & Ng A.Y. (2009). ROS: an open-source Robot Operating System. In ICRA workshop on open source software (Vol. 3, No. 3.2, p. 5).

Raffin A. (2018). RL Baselines Zoo. Github repository.

Salimans T., Ho J., Chen X., Sidor S & Sutskever I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864v2

Sandini G., Metta G. & Vernon D. (2004).  Robotcub: An open framework for research in embodied cognition, International Journal of Humanoid Robotics (8) 2: 18–31.

Schulman J., Wolski F., Dhariwal P., Radford A. & Klimov O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

Simione L. & Nolfi S. (2019). Long-term progress and behavior complexification in competitive co-evolution. arXiv:1909.08303.

Todorov E., Erez T. & Tassa Y. (2012). Mujoco: A physics engine for model-based control. In Proceeding of the IEEE/RSJ Intelligent Robots and Systems Conference (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE.