Behavioral and Cognitive Robotics
An adaptive perspective

Stefano Nolfi

© Stefano Nolfi, 2021   |   How to cite this book   |   Send your feedback   |   Collaborate

Index Next Chapter

6. Adaptation

6.1 Introduction

In this chapter I will introduce the methods enabling robots to adapt to their task environment, i.e. to develop the behavioral and cognitive skills required to perform a desired function. I will introduce evolutionary methods in Section 6.2, reinforcement learning in Section 6.3 and learning by demonstration in Section 6.5. Each family is loosely inspired by a corresponding natural adaptation process. The three classes of methods should not be considered as alternatives. Indeed, they can be usefully combined.

The description is necessarily synthetic and does not attempt to be exhaustive. After briefly introducing each method, I will explain in details only a few state-of-the art algorithms selected among those that are particularly suitable for robotic applications.

6.2 Evolutionary ­robotics

A straightforward way to generate adaptive robots consists in evolving them through a process analogous to natural evolution, an approach that is referred as evolutionary robotics (Nolfi and Floreano, 2000; Nolfi, Bongard, Husband & Floreano, 2016). This can be realized by creating populations of artificial genotypes, which encode the characteristics of the robots (phenotypes), and by allowing the best genotypes to produce copies of themselves with variations. We already introduced the method in Section 2.4 and we already encountered several examples of evolved robots in Chapters 3-5.

The process is schematized in Figure 6.1. The genotypes of the initial generation are created randomly. Each genotype is used to produce a corresponding robot (phenotype) that interacts with the environment for one or more evaluation episodes. During the episode/s the performance of the robot is evaluated with a scalar value (fitness).  Eventually, the characteristics of the robots might also change while they interact with the environment as a result of development and/or learning. The best robots are allowed to reproduce, i.e. to produce copies of their genotype (offspring) including variations (introduced through mutations or other genetic operators). The offspring form the new generation. The evolutionary process continues until a total maximum number of evaluations is performed.

Figure 6.1. Schematization of the evolutionary robotics method. See text for explanation.

The role of the experimenter is limited to the design of a fitness function, i.e. the function that rates the ability of a robot. For example, in the case of robots evolved for the ability to walk, the fitness function might measure the distance traveled by the robot during an evaluation episode. We will illustrate the criteria that can be used to design effective fitness functions in Section 6.5.

The experimenter should also specify the developmental process, i.e. the rules that determine the relationship between the genotype and phenotype. These rules, however, are generally problem independent. Consequently, to evolve a robot for the ability to solve a given problem the experimenter should design the fitness function and select a developmental process among those proposed in the literature (see Section 6.2.3). The experimenter should also specify the characteristics of the robots that remain fixed and that are not subjected to the adaptive process.

6.2.1 Evolutionary algorithms

The evolutionary algorithm determines the way artificial genotypes are selected and varied. Several approaches have been proposed over the years.

A first family of algorithms known as genetic algorithm was developed by John Holland and collaborators (Holland, 1975, 1992). In typical implementations, these algorithms operate on genotypes constituted by vectors of binary values, select the reproducing genotypes stochastically with a probability proportional to their relative fitness, and generate variations through recombination and mutations. Recombination is usually realized by dividing the genotype of two reproducing individuals in two or three parts and by pasting (crossing over) some of the parts copied from the first genotype with the complementary parts copied from the second genotype. Mutations are usually realized by flipping each bit with a low probability, e.g. 1%. New generations are formed by offspring. Eventually, a non-varied copy of the best genotype of the previous generation can be included in the new generation (elitism).

A second family of algorithm known as evolutionary strategies was developed by Ingo Rechenberg (Rechenberg & Eigen, 1973) and Hand-Paul Schwefel (Schwefel, 1981, 1995). In typical implementations, these algorithms operate on genotypes formed by vectors of real numbers and generate variations by perturbing all real numbers with small randomly generated values. Often the best parents are allowed to remain in the new generation, i.e. only the worse parents are replaced with offspring.  

Modern evolutionary strategies are an extended version of evolutionary strategies that combine the matrix of random variations and the fitness received by offspring to estimate either the correlation among variations leading to high fitness (Hansen & Ostermeier, 2001) or the gradient of the expected fitness (Wierstra et al., 2014). This information is used to explore the most promising directions of the search space. In typical implementations, these methods use a single parent, generate population of individuals distributed in the multi-dimensional search space around the parent, and vary the genes of the parent slightly in the direction of the fitness gradient. A particularly effective instance of these methods is the OpenAI-ES algorithm (Salimans et al., 2017; see also Pagliuca, Milano & Nolfi, 2019) that will be illustrated in the next section.

The methods relying on multiple parents stress the advantages that can be gained by exploring different areas of the search space in parallel. These advantages can be maximized by introducing mechanisms that favor the diversification of the population like: (i) the preferential selection of parents that differ from other selected parents (Mouret & Doncieux, 2009; Pugh, Soros & Stanley K.O., 2016), or (ii) the subdivision of the population in sub-populations that evolve almost independently of each other (Whitley, Rana & Heckendorn, 1997; Skolicki & Jong, 2004). The methods that rely on a single parent, instead, tend to attribute more importance to the resolution adopted to explore a single portion of the search space. In principle one can use hybrid methods combining multiple parents and numerous offspring for each parent. In practice, however, hybrid solutions of this type are too computationally expensive.

6.2.2 The OpenAI-ES algorithm

The OpenAI-ES algorithm (Salimans et al., 2017) is a natural evolutionary strategy (Sehnke et al., 2010; Wierstra et al., 2014) that estimates the gradient of the expected fitness by using uniform distributed variations and updates the centroid of the population through the Adam stochastic gradient optimizer (Kingma & Ba, 2014).

The pseudo-code of the algorithm with standard parameters is included below. In a nutshell, it evolves a distribution over policy parameters centered on a single parent θ composed of λ × 2 individuals. At each generation, the algorithm generates the gaussian vectors ε that are used to perturb the parameters (line 4), and evaluates the offspring (line 4-5). To improve the accuracy of the fitness estimation the algorithm generates mirrored samples (Brockhoff et al., 2010), i.e. generates λ couples of offspring receiving opposite perturbations (line 4-5). The difference of the fitness obtained by each offspring couple is then ranked and normalized in the range [-0.5, 0.5]] (line 7). This normalization makes the algorithm invariant to the distribution of fitness values and reduces the effect of outliers. The estimated gradient g is then computed by averaging the dot product of the samples ε and of the normalized fitness values (line 8). Finally, the parameters of the parent θ are updated for one step by using the Adam optimizer (line 9). The update of the parameters can include a weight decay normalization operation.

To better clarify how the gradient is estimated, let's consider, for example, the case in which: (i) the parent (θ) includes only two parameters that are initially set to 0.0, (ii) the size of the population is 6, and (iii) the randomly generated samples are: σ * εi = ((-.0193, .0083) (-.0133, .0211) (-.0167, -.0212)). The resulting population will correspond to the red, green, and grey circles and to the three symmetric blue, cyan, and orange circles (Figure 6.2). The population is distributed around the parent (θ) which, in this example, is located at the origin of the axes. Let's now assume that the evaluation of the six individuals produce the fitness values displayed near the circles (first numbers). The normalized ranked fitness will correspond to the following values: gray=0.5, cyan=0.3, orange=0.1, blue=-0.1, green=-0.3, red=-0.5 (indicated by the second number near the dots in the figure). The contribution of each offspring to the estimated gradient will correspond to the colored arrows, where the module and the direction of the arrows correspond to the normalized fitness and their orientation corresponds to the relative positions of the offspring with respect to the parent in the parameter space. The estimated gradient is indicated by the black arrow which is the vector sum of the six colored arrows (Figure 6.2). Notice that, for graphical purpose, the length of the black arrow was reduced of 95%.

Figure 6.2. Visualization of gradient estimation. The x and y axes represent the value of the first and second parameter, respectively. The circles represent the population. The two numbers on the top-right of the circles indicate the natural and normalized value of the fitness, respectively. The colored arrows represent the contribution of each offspring to the estimated gradient. The black arrow indicates the gradient of the expected fitness. Please notice that the green and cyan arrows overlap. For visualization purpose the length of arrows has been reduced of 95%.

The parameters of the parent θ will then be modified in the direction of the black arrow and the process will be repeated for several generations. The exact modification of the parameters depends on the Adam stochastic optimizer (Kingma & Ba, 2014) that computes a moving average of the estimated gradient over time and uses the momentum and the squared momentum of each parameter to accelerate the movement in the directions of the gradients that are consistent over time and to decelerate the movements as a function of the steepness of the gradient.

The accuracy of the gradient estimation and the effectiveness of the algorithm can be improved by exposing symmetrical offspring to the same environmental conditions (Nolfi, unpublished result). Moreover, the effectiveness of the method can be improved by normalizing the observation vectors during the evolutionary process (Salimans et al., 2017; see also Pagliuca, Milano & Nolfi, 2020).

6.2.3 Development

The developmental process prescribes how the information encoded in the genotypes, i.e. the information structures that are subjected to variation and selection, determines the phenotypes, i.e. the robots that are situated in the environment.

The simplest way to realize the developmental process consists in using a direct encoding method in which each element of the genotype specifies a corresponding property of the phenotype. For example, as in the case of the experiment of Pagliuca & Nolfi (2020) reported in Section 3.4, the genotype can be formed by a vector of real numbers that specifies the connection weights of the robot’s neural network and the properties of the elements forming the body of the robot. Direct encoding methods operating on genotypes formed by vectors of real numbers can be used in combination with modern evolutionary strategies, such as the one illustrated in the previous section, which estimates the gradient of the expected fitness. However, direct encoding methods might not be suitable to encode phenotypes characterized by a variable number of features (e.g. a variable number of body elements with parametrized properties).

A second possible method consists of using genotypes formed by a list of tuples with a type and a set of associated parameters, as in the case of the experiment of Lipson & Pollack (2000) reviewed in Section 3.4. Each tuple encodes a corresponding phenotypical element (e.g. a body element or a connection among two neurons), as in the case of the direct encoding methods. However, the relation among elements can vary and can be encoded by using indexes pointing to other elements. For example, the tuples encoding neural connections can include two indexes pointing to the two neurons from which the connection originates and ends. As discussed in Chapter 3, this method can be used to evolve bodies formed by a variable number of elements assembled in variable topologies (as in Lipson & Pollack, 2000) or neural networks formed by a variable number of neurons connected in variable architectures (as in the case of the NEAT algorithm developed by Stanley & Miikkulainen, 2002). The length of the genotype and the number of elements forming the phenotype can be increased or decreased by using genetic variation operators which add or remove tuples.

A third method consists in encoding a developmental neural network in the genotype by using one of the two methods described above. The developmental neural network is then used to determines the presence/absence of the elements in the body or brain space and the property of the elements. An example of this method is the work of Stanley, D’Ambrosio & Gauci (2009) who evolved neuro-agents provided with neural network policies with variable architectures. This is realized by postulating the existence of a series of neurons arranged in successive planes embedded in an Euclidean space and by evolving a developmental network with six input neurons encoding the x, y, and z position of two neurons, and two output neurons encoding the presence/absence of the corresponding connection and the connection weight. Connections and the connection weights are then determined by querying the developmental network with the spatial coordinates of all possible couple of neurons. A similar method was used by Cheney, Clune & Lipson (2014) in the experiment reviewed in Section 2.3 to encode the body properties of the evolving agents. In that case, the input neurons of the developmental network encode the x,y and z position of a voxel in the 10x10x10 body space. The output neurons specify the type of cell that should be placed in the voxel or whether the voxel should remain empty. The network is then queried 1000 times, for each voxel position, to determine the content of each voxel.

Finally, the fourth method consists in encoding in the genotype the developmental rules that determine how an initial stem cell divide in two daughter cells that, in turn, subdivide multiple times generating 4, 8, 16 and more cells. During the division process the cells can differentiate and specialize. This produces a developmental process analogous to that of natural multicellular organisms. This method has been used by Joachimczak, Suzuki & Arita (2016) to evolve agents formed by 2D spherical cells connected through elastic springs.

Some authors explored the possibility to combine evolution and learning (Nolfi & Floreano, 1999). For example, in the model proposed by Floreano, Durr & Mattiussi (2008), some of the connection weights are genetically encoded while others are set randomly and are adapted on the basis of genetically encoded Hebbian rules while the robot interacts with its environment. We will describe more recent attempts to combine evolution with un-supervised and self-supervised learning in Chapter 11.

6.3 Reinforcement learning methods

Another way to build adaptive robots is reinforcement learning (Sutton & Barto, 2018), a form of trial and error learning analogous to that used by animals and humans to adapt during the course of their life. The process is driven by a reward function that rates how well or how badly the robot is doing.

Reinforcement learning differs from evolution in two respects: (i) it operates on a single individual, instead than on a population of individuals, and (ii) it varies the parameters of the robot on the basis of a reward that rates the performance of the robot on each single step, instead than on the entire evaluation phase. The overall objective of learning is that to maximize the cumulative reward, as in the case of evolutionary methods. Although, the method is usually used to maximize the discounted expected return (illustrated below) which is an approximation of the cumulative reward.

In the large majority of cases, reinforcement learning is used to adapt only the connection weights of the neural network policy of the robot. Recently, however, some authors used it also to adapt selected features of robots’ body (see for example Schaff et al., 2019; Luck, Amor & Calandra, 2019).

In general reinforcement learning methods are more sample efficient than evolutionary methods, i.e they are faster than evolutionary methods. On the other hand, they are more complex, depend on a larger set of hyperparameters, and are more subjected to instability, i.e. to degenerative processes that prevent progress and eventually produce regressions.

6.3.1 Basic concepts

Reinforcement learning usually introduce random variations in the actions performed. This is another difference with respect to evolutionary methods that instead introduce random variations in the parameters of the control network. The introduction of random variation in the action vector is realized by using stochastic policies. The two most common kinds of stochastic policies are categorical policies and diagonal Gaussian policies. The former are used in the case of discrete action spaces, i.e. in the case of problems where the agent can perform an action chosen within a finite set of possible alternative actions (like in the case of the Atari games [Mnih et al., 2015]). The latter are used for problems involving a continuous action space. Categorical policies use an output layer that encodes the probability to execute each possible alternative action followed by a softmax layer that normalizes the probabilities so that their sum equal 100%. Diagonal Gaussian policies use an output layer that encodes the action vector that is then perturbed through the addition of random values with mean 0.0. The distribution of the random values is encoded in a vector containing the diagonal elements of the covariance matrix representing the multivariate normal distribution of actions which is updated together with the other parameters of the policy,

The trajectory or rollout is the sequence of observations, actions performed, and rewards received during each step of evaluation episodes.

As specified above, the goal of the agent is to maximize some form of cumulative reward or return. One kind of return is the finite-horizon undiscounted return, which is just the sum of rewards obtained in a fixed window step. An alternative return is the infinite-horizon discounted return, which is the sum of all rewards later obtained by the agent discounted by how far they are in the future. The discount factor gives more importance to short term than to long term future reward.

Finally, the expected return is the return that an agent can expect to receive from a certain state on, under the assumption that the agent will act on the basis of a given policy.

Online learning indicates algorithms that modify the parameters of the control network by using the rollout data collected on the last performed evaluation episodes. This data is later discarded. Offline learning algorithms, instead, use all the data generated during the training process to update the control network. Experience replay refers to an intermediate case in which the data used is sampled from the rolling history.

On-policy algorithms work with a single policy network which determines the actions on the basis of the observations. Off-policy algorithms, instead, use two policy networks. A target policy that is learned and a behavioral policy that is used to determine the actions performed on the basis of the observations.  

6.3.2 Q-learning

Reinforcement learning comes in different varieties. A first variety, that is suitable for problems with discrete action space is action-value function learning or Q-learning (Sutton & Barto, 2018). In short, Q-learning approximate the function Q(s, a) that computes the expected return of each possible action a in each possible state s. The Q function is often approximated by a neural network that receives as input the observation and the action and produces in output the expected return. The action executed by a Q-learning agent is selected by using the current Q function to evaluate the expected return of each possible action in the current state and by selecting the action with the highest expected return.

One way to calculate the expected reward is the Montecarlo method that consists in waiting the end of the evaluation episode and then calculating, for each step, the sum of the reward received in successive steps.  The sequence of observations and associated expected rewards are then used as training set for training with back-propagation the neural network computing the Q(s, a) function. The process is normally performed offline, i.e. by using data collected not only during the last evaluation episode but also during recent evaluation episodes (experience replay) or during all episodes.

As an example of the Montecarlo method, consider an agent learning to play the Pong Atari game at an intermediate stage of the learning process. During an evaluation episode it will select actions that produced good returns in previous evaluation episodes but will also select random action with a probability ε where ε is a parameter that is progressively reduced during the course of the training process. Given the scoring rule of the Pong game, the agent will always receive a reward 0.0 during evaluation episodes and a reward of 1.0 and -1.0 at the end of won and lost episodes-games, respectively. Consequently, the expected undiscounted return will be 1 or -1. Overall, this implies that the training data generated in this way will alter the parameters of the policy so to increase and decrease the Q-value of actions produced in won and lost episodes, respectively. Clearly, not all actions selected in successful and unsuccessful episodes necessarily contributed positively or negatively to the outcome, respectively. However, over many episodes, only the actions that really contributed to produce the positive result will be associated with high Q values consistently.

An alternative method for estimating the expected return consists in the temporal difference method that permits to calculate a gradient after each step, without waiting the end of the episode. This is realized by using the Q-values computed by the policy itself as a proxy of the expected return. The Q-values computed by the policy are initially wrong but become more and more accurate as learning proceeds. The idea can be illustrated with the following example. Suppose that on Friday you want to predict the weather for Sunday and you have some model of weather forecast. According to the Montecarlo method, you have to wait until Sunday to update your model. However, on Saturday your model will provide a more accurate forecast than on Friday. This means that you can use the difference between the prediction made on Friday and the prediction made on Saturday to improve your prediction model already on Saturday.

The advantages of temporal difference is that it permits using a teaching signal that is less noisy than that computed with Montecarlo. The advantage of Montecarlo is that it is not biased by the inaccuracy of the value prediction used by the temporal difference method that necessarily drives the learning process toward partially wrong directions. Montecarlo methods are slower but tend to converge on good solutions in the long run. Temporal difference methods are usually faster but might diverge, i.e. might produce phases in which the quality of the policy decreases instead of increasing. A good compromise consists in using a hybrid approach in which the policy is updated every n steps, where n is greater than 1 and smaller that the length of the evaluation episodes.

A notable example of Q-learning algorithm is the Deep Q-Learning (DQN) which enabled to achieve human-level performance on many Atari games (Mnih et al., 2015).  Unfortunately, Q-learning does not perform well on problems involving continuous action spaces. Probably this is due to the fact that attempting to learn a function that estimate the expected value of all possible actions in all possible states is too complex.

6.3.3 Policy learning

Policy learning methods determine the actions on the basis of the observations directly, as in the case of evolutionary methods. Learning is normally performed online and on-policy, i.e. the update of the parameters of the policy is based on the data collected recently by performing the actions prescribed by the current version of the policy. In the case of actor-critic methods, the policy is also trained to approximate a value function, i.e. to predict the return expected at each step. As we will see below, this predicted value can be used to compute an advantage, i.e. how much better or worse the executed action is with respect to the average action executed in the current situation by the current policy. The advantage is then used to determine whether the probability to execute the actions performed should be increased or decreased.

An example of algorithm of this class is the Vanilla Policy Gradient (VPG) described in Achiam (2018). The pseudocode of the algorithm is included below. The policy and the value function are approximated by an actor and a critic neural networks with parameters θ and Φ, respectively, that receive as input the current observation. The policy network, which consists of a diagonal Gaussian policy, outputs the action vector. The critic network, instead, outputs the value (i.e. the predicted cumulative reward).

During each iteration k, the agent is evaluated for multiple episodes until a certain number of steps (e.g. 4000) are collected. The data collected at each step include: the observation vector, the action vector, the reward, the log probabilities of the action vector, and the value, i.e. the predicted cumulative reward (line 3). The rewards and the value data (produce by the critic) are then used to compute the cumulative reward from each step on, and the advantage of the actions executed (line 4 and 5).

The cumulative reward is computed by discounting future rewards by a parameter  set to 0.99 on the basis of the following equation:

where rt is the reward received at time t.

The advantage, i.e. how much better the discounted return of the action executed is with respect to the average return obtained in the state, is computed on the basis of the following equation:

Where the value V is the output of the critic network and λ is an additional discounting parameter set to 0.97. The vector of advantages is then normalized so to assume an average of 0.0 and normal standard deviation.

At this point the algorithm can estimate the gradient for the actor network that maximizes the advantages by using the log derivative trick (line 6), update the parameters of the policy network with the Adam stochastic optimizer (line 7), and update the parameters of the critic network based on the difference between the estimated discounted return computed by the critic network and the actual discounted return (line 8). The critic network is also updated by using the Adam stochastic optimizer.

Policy learning algorithms include the A2C/A3C (Min, Badia, Mirza et al., 2016), the TRPO (Schulman et al., 2015), and the PPO (Schulman et al., 2017). The last two methods are extensions of the vanilla policy gradient algorithm described above. The extension consists in the addition of a rule that reduces the amplitude of the step taken along the gradient to avoid the instability that can be caused by the introduction of changes which alter too much the behavior of the policy. In the case of the PPO this is achieved by penalizing excessive divergence from the previous policy or by clipping the gradient so to keep it’s size below a desired threshold.

Policy learning and Q-learning can be also combined. An example of hybrid method of this type is the DDPG algorithm (Lillicrap, Hunt, Pritzel et al, 2015) that concurrently learns a deterministic policy and a Q-function and use the former to improve the latter and vice versa.  

6.4 How to design the fitness or reward function

Designing the fitness function or the reward function that enables a robot to develop a certain skill can be challenging although it is certainly much less challenging than designing the robot directly, without relying on an adaptive process. Such difficulty concerns the design of incentive plans for animals and humans as well. The problem arises from the fact that incentives can often be maximized in many alternative ways including ways that are unsatisfactory from the point of view of the designer.

A first aspect to consider is that rewards can be sparse, i.e. can reward the accomplishment of a function without necessarily rewarding the steps that enable the achievement of the goal. For example, predator robots evolved for the ability to ‘catch’ prey robots can be rewarded only for reaching and touching the prey robot (Simione and Nolfi, 2020). A sparse reward of this type is sufficient for enabling the development of a series of behaviors that are instrumental with respect to this objective and that are not rewarded explicitly. For example, the ability of avoiding obstacles, pursuing a moving target, optimizing the motion trajectory with respect to multiple constraints, integrating sensory information over time, anticipating the behavior of the prey robot, managing appropriately available energy resources, adapting on the fly to the behavior of the current prey, etc. Similarly, robotic hands trained through reinforcement learning for the ability to move and rotate an object to a desired position and orientation (Andrychowicz et. al., 2018) can be rewarded only when the goal is achieved, without providing any feedback for the sequence of actions and behaviors that are necessary to achieve the goal. Consequently, the usage of fitness/reward functions that reward the robot for the ability to achieve its goal only should be considered the most straightforward solution.

In certain cases, however, sparse rewards of this type might create a bootstrap problem. The problem originates from the fact that the behavior displayed at the beginning of the training process is poor. The combination of a sparse reward with poor behaviors can lead to a situation where the robots keep receiving the same null rewards. In this condition, the adaptive process is unable to detect and retain adaptive variations and consequently to produce progress.

The best way to avoid the bootstrap problem consists in exposing the adaptive agents to variable environmental conditions, i.e. varying the characteristics of the environment in which the robot is situated during evaluation episodes and varying the initial position, orientation, and posture of the robot at the beginning of evaluation episodes. Indeed, the greater the variability of the learning conditions is, the greater the chance that the robot will get a reward is. A good example of this issue is the case of the robotic hand of Andrychowicz et. al. (2018) mentioned above, that we will review in more detail in the next chapter. The chance to be rewarded even once at the beginning of the training process, when the offset between the initial and final position and orientation of the object is significant, is negligible. However, the chance to be rewarded when the offset is minimal is much higher, even at the beginning of the training process. By selecting the target position and orientation of the object randomly, the authors ensure the exposure to highly variable environmental conditions including particularly favorable conditions from which the chance to be rewarded is sufficiently high.

When the bootstrap problem cannot be solved by increasing the variability of the learning experiences, the experimenter can be forced to reward the robot also for the ability to achieve sub-goals that are instrumental to the achievement of the robot’s main goal or that discourage the exhibition of behaviors that are dysfunctional to the achievement of the robot’s goal. In this case, the fitness/reward function will include a primary reward component for the achievement of the goal and one or more secondary reward or punishment components. The utilization of additional components can solve the bootstrap problem and speed-up learning but also introduce biases in the learning process that can steer the process toward sub-optimal solutions. For example, consider the case of a robot trained to play tennis against a pre-programmed opponent player. A function that rewards the robot only when it manages to win a game might cause the bootstrap problem since the chance of winning a game at the beginning of the adaptive process is extremely low. The experimenter might thus decide to give a primary reward for winning a game, a secondary smaller reward for scoring a point, and a tertiary even smaller rewards for shooting balls inside the opponent court. After all, scoring points and complying with the game rules are necessary pre-requisite for winning a game. However, the addition of secondary reward components will encourage the achievement of the corresponding sub-goals independently of whether they are instrumental or not to the achievement of the primary goal. For example, they can favor the development of strategies that maximize the number of valid shots by using a prudent behavior that minimizes the risk of producing nets and outs but generate situations that are easy to handle from the opponent. In other words, progress with respect to secondary components can reduce, rather than increase, the chances to win games. This is due to the fact that the secondary components can be maximized in different manners, and only some of these manners are instrumental to the achievement of the robot’s primary goal. Overall this implies that the experimenter should limit as much as possible the usage of additional reward or punishment components and should formulate them so to minimize their biasing effect.

The problem is complicated by the fact that the effect of secondary reward components on the adaptive process vary as a function of the learning phase and of the behavior currently exhibited by the robot. Consequently, the utilization of additional components can be beneficial in some replications of the experiment or during a certain phase of the adaptive process but can have a negative effect in other replications or in other phases.

Finally, the fitness/reward function should also be tuned to the characteristics of the method used. In evolutionary methods the learning process is driven by the cumulative reward obtained during evaluation episodes. In reinforcement learning, instead, is usually driven by the expected return computed on a finite horizon. Consequently, functions that are suitable for methods of a class are not necessarily suitable for methods of a different class, see for example the analysis reported in Pagliuca, Milano & Nolfi (2020).

6.5 Learning by demonstration

Finally, a third and last class of adaptive approaches consists in learning by demonstration or imitation learning (Argalla et al., 2009; Billard & Grollman, 2013; Billard, Calinon & Dillmann, 2016). In these methods, the experimenter shows to the robot a behavior performing a given desired function and the robot learns to approximate the demonstrated behavior.

An example is constituted by the robot arm shown in Video 6.1 that is trained for the ability to prepare orange juice (Billard & Grollman, 2013). The training occurs in three phases: a demonstration phase in which the teacher demonstrate the appropriate behaviors in variable conditions, a training phase in which the training set generated during the demonstration phase is used to train a neural network policy through a supervised learning method, and a testing phase in which the ability of the robot to perform the task and to generalize to partially new conditions is evaluated.

Video 6.1. Top: demonstration phase. Bottom: post-evaluation phase. ([], Figure 1 and 2, Billard & Grollman, 2013).

The demonstration phase can be realized in different manners. In the case of kinesthetic teaching the actuators of the robot are set in a passive mode and experimenter physically manipulates the robot so to force it to produce the desired behavior. This is the modality used in the case of the experiment shown in Video 6.1. The demonstration phase is used to generate a training set containing the observations and the state of the actuators inferred from the passive movements of the robot on the basis of the state of proprioceptors. Kinesthetic teaching constitutes the most straightforward method to demonstrate the target behavior but is not free from drawbacks and limits. A drawback concerns the need to “clean” the training set from the stimuli generated by the presence of the demonstrator.  A limit is that the demonstrator can control only a limited number of DOFs during the demonstration.

In teleoperation the actuators of the robot are controlled by the experimenter through a joystick and/or a haptic device. The demonstrator teleoperates the robot on the basis of visual information gained from her/his perspective through her/his own sensing system. Alternatively, the demonstrator can teleoperate the robot on the basis of sensory information extracted from the robot’s own sensors and accessed through some form of virtual reality tool. An advantage of teleoperation is that the demonstrator does not need to stay close to the robot and consequently does not interfere with the data collected. The limit is that, as for kinesthetic teaching, controlling robots with many DOFs can be difficult and might require long practicing sessions.

In methods that use direct imitation of human behavior, the training set is generated from the observation of a human displaying the desired behavior. An example is constituted by the work of Ude, Atkeson & Riley (2004) in which a humanoid robot is trained to produce the walking behavior on the basis of a training set including the position and the velocity that the robot’s joints should have over time. This information is extracted from videos displaying a walking human through a standard machine vision tool that identifies the position and orientation of the human limbs over time. The advantage of this method is that it enables the demonstrator to move freely and naturally. A downside is that it can only be applied to robots with a human morphology that matches in detail the characteristics of a real human. Another difficulty is that the information extracted from the observation of the demonstrator can be inaccurate and incomplete due, for example, to parts visually occluded by other parts.

Approaches also vary with respect to the level of organization at which they are applied. Indeed, learning by demonstration can be applied to train the robot to display low-level behaviors capable of achieving a desired function, as in the examples illustrated above, or to learn to combine pre-existing behaviors to achieve new functions (Dilmman, 2004; Skoglund et al., 2007).

A general drawback of learning by demonstration is that minor differences in the action produced by the robot with respect to demonstrated actions can lead to significant differences over time which might prevent the robot from achieving the expected goal during the testing phase.

One way to limit this problem consists in using an iterated approach in which the demonstration, training and evaluation phases are repeated multiple times and in which the training set is extended with additional demonstrations selected with the purpose of improving the current unsatisfactory aspects of the robot’s behavior. Yamashita & Tani (2008) used this strategy to train a humanoid robot to hold and to move an object with two hands. After the first training phase the authors observed that the robot failed to hold the object since it did not use enough force to push the object within its hands. They thus created a second training set that was generated by allowing the robot to act on the basis of its trained controller and by pushing the robot’s hands on the object so to encourage the robot to learn to put more pressure on the object. We will describe this experiment in more details in Chapter 11.

Learning by demonstration methods can be usefully combined with evolutionary or reinforcement learning. The learning by demonstration component can be used to acquire a behavior that approximates a suitable effective behavior but which eventually fails to achieve the goal due to small errors accumulating over time. The evolutionary or reinforcement learning component can be used to refine the behavior acquired through the learning by demonstration method in a way that is functional to the achievement of the desired goal. Hybrid approaches of this type have been used to train a robot arm to flip pancakes (Kormushev et al 2010) and to solve the ball in a cup problem (Kober & Peters, 2011).

6.6 Learn how

Read section 13.9 and follow the instructions included in the Exercise 8 to understand in details the implementation of the OpenAI-ES evolutionary algorithm included in the evorobotpy2 software. Read section 13.10 which provides a guideline for setting hyperparameters in evolutionary experiments.

You can experiment with reinforcement learning algorithm after reading the next chapters.


Achiam J. (2018). Openai spinning up. GitHub, GitHub repository. See also

Andrychowicz M., Baker B., Chociej M. et. al. (2018). Learning dexterous in-hand manipulation. arXiv:1808.00177v5

Argalla B.D, Chernova S., Veloso M & Browning B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems, (57) 5: 469-483.

Billard A. & Grollman D. (2013). Robot learning by demonstration. Scholarpedia, 8 (12): 3824.

Billard A.G., Calinon S. & Dillmann R. (2016). Learning from humans. In B. Siciliano & O. Khatib (Eds.), Handbook of Robotics, II Edition. Berlin: Springer Verlag.

Brockhoff D., Auger A., Hansen N., Arnold D.V. & Hohm T. (2010). Mirrored sampling and sequential selection for evolution strategies. In International Conference on Parallel Problem Solving from Nature. Berlin, Germany: Springer Verlag.

Cheney N., Clune J. & Lipson H. (2014). Evolved electrophysiological soft robots. In H. Sayama, J. Reiffel, S. Risi, R. Doursat & H. Lipson (Eds.), Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems (ALIFE 14). New York: The MIT Press.

Dillman R. (2004). Teaching and learning of robot tasks via observation of human performance. Robotics and Autonomous Systems, (47) 2-3: 109-116.

Floreano D., Dürr P. & Mattiussi C. (2008). Neuroevolution: from architectures to learning. Evolutionary intelligence, 1(1): 47-62.

Hansen N. & Ostermeier A. (2001). Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, (9) 2: 159–195.

Holland J.H. (1975). Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, 1975.

Holland J.H. (1992). Adaptation in Natural and Artificial Systems, 2nd edition. MIT Press, Cambridge, MA, 1992.

Joachimczak M., Suzuki R. & Arita T. (2016). Artificial metamorphosis: evolutionary design of transforming, soft-bodied robots. Artificial life, 22 (3): 271-298.

Kingma, D. P. & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Kober J. & Peters J. (2011). Policy search for motor primitives in robotics. Machine Learning, 84 (1-2): 171-203.

Kormushev P., Calinon S. & and Caldwell D. (2010). Robot motor skill coordination with EM-based reinforcement learning. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Lillicrap T.P., Hunt J.J., Pritzel A. et al. (2015). Continuous control with deep reinforcement learning. arXiv:1509.02971.

Lipson H. & Pollack J.B. (2000). Automatic design and manufacture of robotic lifeforms. Nature, 406 (6799): 974-978.

Luck K.S., Amor H.B. & Calandra R. (2019). Data-efficient co-adaptation of morphology and behaviour with deep reinforcement learning. arXiv:1911.06832v1

Min H., Badia A.P., Mirza M. et al. (2016).  Asynchronous methods for deep reinforcement learning. Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016.

Mnih V., Kavukcuoglu K., Silver D., Rusu A. A., Veness J., Bellemare M. G., Graves A., Riedmiller M., Fidjeland A. K., Ostrovski G. et al.  (2015). Human-level control through deep reinforcement learning. Nature, 518: 529–533.

Mouret J-B. & Doncieux,S. (2009). Overcoming the bootstrap problem in evolutionary robotics using behavioral diversity. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC-2009). Washington, DC: IEEE Press.

Nolfi S. & Floreano D. (1999). Learning and evolution. Autonomous robots, 7(1): 89-113.

Nolfi S. & Floreano D. (2000). Evolutionary Robotics: The Biology, Intelligence, and Technology of Self-Organizing Machines. Cambridge, MA: MIT Press/Bradford Books.

Nolfi S., Bongard J., Husband P. & Floreano D. (2016). Evolutionary Robotics, in B. Siciliano and O. Khatib (eds.), Handbook of Robotics, II Edition. Berlin: Springer Verlag

Pagliuca and Nolfi (2020). The dynamics of body and brain co-evolution. arXiv:2011.11440.

Pagliuca P., Milano N. & Nolfi S. (2019). Efficacy of modern neuro-evolutionary strategies for continuous control optimization. arXiv:1912.05239.

Pugh J.K., Soros L.B. & Stanley K.O. (2016). Quality diversity: A new frontier for evolutionary computation. Frontiers in Robotics and AI, 3, 40.

Rechenberg I. & M. Eigen. (1973). Evolutionsstrategie: Optimierung Technischer Systeme nach Prinzipien der Biologischen Evolution. Frommann-Holzboog Stuttgart.

Salimans T., Ho J., Chen X., Sidor S. & Sutskever I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864v2

Schaff C., Yunix D. Chakrabarti A. & Walter M.R. (2019).  Jointly learning to construct and control agents using deep reinforcement learning. International Conference on Robotics and Automation (ICRA) Palais des congres de Montreal, Montreal, Canada.

Schulman J., Levine S., Abbeel P., Jordan M.I. & Moritz P. (2015). Trust region policy optimization. In ICML, pp. 1889–1897.

Schulman J., Wolski F., Dhariwal P., Radford A. & Klimov O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

Schwefel H.P. (1981). Numerical Optimization of Computer Models, Wiley, New York, 1981.

Schwefel H.P. (1995). Evolution and Optimum Seeking, Wiley, New York, 1995.

Sehnke F., Osendorfer C., Ruckstieb T., Graves A., Peters J., and Schmidhuber J. (2010). Parameter-exploring policy gradients. Neural Networks, 23(4):551–559.

Simione L. & Nolfi S. (2019). Long-term progress and behavior complexification in competitive co-evolution. arXiv:1909.08303.

Skoglund A., Iliev B., Kadmiry B. & Palm R. (2007). Programming by demonstration of pick-and-place tasks for industrial manipulators using task primitives. International Symposium on Computational Intelligence in Robotics and Automation (CIRA 2007).

Skolicki Z. & Jong K. D. (2004).  Improving evolutionary algorithms with multi-representation island models. In Parallel Problem Solving from Nature – PPSN VIII 8th International Conference. Springer-Verlag.

Stanley K.O. & Miikkulainen R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, vol. 10 (2): 99–127.

Stanley K.O., D’Ambrosio D.B. & Gauci J. (2009). A hypercube-based indirect encoding for evolving large-scale neural networks. Artificial Life, vol. 15 (2): 185–212.

Sutton R.S. & Barto A.G. (2018). Reinforcement learning: An introduction. 2nd Edition. MIT press.

Ude A., Atkeson C.G. & Riley M. (2004). Programming full-body movements for humanoid robots by observation, Robotics and Autonomous Systems, (47):2–3: 93–108.

Whitley D., Rana S., & Heckendorn R.B. (1997). Island model genetic algorithms and linearly separable problems. In Selected Papers from AISB Workshop on Evolutionary Computing, volume 1305 of Lecture Notes In Computer Science, pages 109-125. Springer-Verlag.

Wierstra D., Schaul T., Glasmachers T., Sun Y., Peters J. & Schmidhuber J. (2014). Natural evolution strategies. The Journal of Machine Learning Research, 15(1), 949-980.

Yamashita Y. & Tani J. (2008). Emergence of functional hierarchy in a multiple timescale neural network model: a humanoid robot experiment. PLoS Computational Biology, 4 (11): e1000220.