Behavioral and Cognitive Robotics
An adaptive perspective

Stefano Nolfi

© Stefano Nolfi, 2021   |   How to cite this book   |   Send your feedback   |   Collaborate

Abstract

This book describes how to create robots capable to develop the behavioral and cognitive skills required to perform a task through machine learning methods. It focuses on model-free approaches with minimal human intervention in which the behavior used by the robots to solve their task and the way in which such behavior is produced is discovered by the adaptive process automatically, i.e. it is not specified by the experimenter. The book, which is targeted toward researchers, PhD and Master students with an interest in machine learning and robotics: (i) introduces autonomous robots, evolutionary algorithms, reinforcement learning algorithms, and learning by demonstration methods, (ii) uses concrete experiments to illustrate the fundamental aspects of embodied intelligence, (iii) provides theoretical and practical knowledge, including tutorials and exercises, and (iv) provides an integrated review of recent research in this area carried within partially separated research communities.

Table of Contents

Preface
  1. Autonomous Robots
  2. From Braitenberg's Vehicles To Neuro-Robots
  3. Embodiment
  4. Situatedness
  5. Behavior And Cognition As Complex Dynamical Systems
  6. Adaptation
  7. Robustness, Plasticity and Antifragility
  8. Swarm Robotics
  9. Communication and Language
  10. Neural Basis of Cognition
  11. Cognition
  12. Long-Term and Open-Ended Learning
  13. How to Train Robots Through Evolutionary And Reinforcement Learning Algorithms

References

Download

Last Edition


Preface

This book describes how to create robots capable to develop the behavioral and cognitive skills required to perform a task autonomously, while they interact with their environment, through  evolutionary and/or learning processes. It focuses on model-free approaches with minimal human-designed intervention in which the behavior used by the robot solve its task and the way in which such behavior is produced is discovered by the adaptive process automatically, i.e. it is not specified by the experimenter.

The first objective of the book is to introduce autonomous robots and adaptive methods: evolutionary robotics, reinforcement learning, and learning by demonstration. The book cannot and does not aim to be exhaustive in that respect. It focuses on the most effective methods at the current state of the art and on the relation between methods that are closely related but that are normally studied independently within separated research communities.

A second objective is to use the analysis of the behavioral and cognitive solutions discovered by adaptive robots in concrete experiments to illustrate the fundamental aspects of embodied intelligence: the relation between the body and the “brain” of the robot, the role of sensory-motor coordination, the consequences of under-actuation, the implications of the dynamical and multi-levels nature of behavior, the importance of robustness, the role of emergence and self-organization, the impact of the learning experiences on the adaptive process, the role of anticipation and world models, the role of cooperation and competition among robots, the factors that can promote continuous and open-ended learning.

Finally, the third objective is that to enable the reader to acquire hand-on knowledge by experimenting with adaptive robots. This final objective is realized by introducing the reader to easy-to-use and powerful software tools that permit to create adaptive robots, replicate representative state-of-the-art experiments, and acquire the practical skills required to carry on high-quality research on this area. Moreover, it is realized by providing directions on how to set the hyperparameters and how to implement the reward function. This last objective is covered in the last chapter of the book and on the “learn how” sections, which introduce the exercises suggested after each Chapter. For examples of experiments which can be performed, see Video 1.

The electronic format of the book allows the usage of videos to illustrate the experiments reviewed, an aspect that is particularly important for robotics. Moreover, it enables the author to update the content of the book more easily. This is crucial to keep the book updated with respect to the theoretical and technical progresses of the field.

Video 1. Examples of experiments which can be performed following the guidelines included in Chapter 13.

Acknowledgments: The author thanks Paolo Fazzini for detailed feedbacks on the manuscript.

Copyright: © 2021 Stefano Nolfi. This is an open access book distributed under the terms of the Creative Commons Attribution License 4.0 (CC BY) which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

The figures shown in the left and right banners are modified versions of the original figures included in (from left to right and from top to bottom): Nolfi (2009); Sagita & Tani (2008) https://www.youtube.com/watch?v=n9NYcG8xlYs; Simione & Nolfi (2021); Andrychowicz et al. (2017), https://www.youtube.com/watch?v=Dz_HuzgMxzo; Lipson & Pollack (2000), (www.demo.cs.brandeis.edu/golem/) ; Carvalho & Nolfi (2016), https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0160679; Sperati, Trianni & Nolfi (2011); Achiam (2018); Nolfi (2005); Andrychowicz et al. (2018), https://www.youtube.com/watch?%3F&v=jwSbzNHGflM; Floreano et al. (2007), https://www.youtube.com/watch?v=K3s5uh6-sXo; Balsal et al. (2018), https://www.youtube.com/watch?v=OBcjhp4KSgQ; Cheney, Clune & Lipson (2014), https://www.youtube.com/watch?v=z9ptOeByLA4; Coumans & Yunfei, 2016-2019, https://pybullet.org/; Floreano, Nolfi & Mondada (1998); Nolfi (2021); Tuci et al. (2011); Baldassarre et. al. (2007).



1.Autonomous robots

1.1 Introduction

This book examines autonomous robots, namely systems capable of operating without human intervention for prolonged periods of time in partially unknown environmental conditions. An instance of this class are vacuum cleaner robots capable of operating autonomously in a natural domestic environment for hours, regardless the type and position of the furniture present in the house. Autonomous robots differ significantly from industrial robots that perform pre-determined sequence of movements and operate in well-defined environment structured to support robots’ operation (Figure 1.1). Moreover, they differ from tele-operated robots that are controlled manually by humans.

Figure 1.1. An example of industrial robots used for car manufacturing. (Wikipedia commons. https://commons.wikimedia.org/wiki/File:KUKA_Industrial_Robots_IR.jpg)

In this chapter we will introduce autonomous robots and their constituting hardware and software components.

1.2 Robots

For the purpose of this book, whose focus is on behavioral and cognitive robots, we can define a robot as an artificial system that: (i) has a physical body that includes actuators, sensors, and a brain, (ii) is situated in a physical environment and eventually in a social environment including other robots and/or humans, and (iii) exhibits a behavior performing a function.

1.3 The body

Robots’ bodies vary considerably depending on the environment in which they should operate and on the task that they should perform. For example, robots that need to move on irregular terrains can include legs. Robots operating on water can be characterized by aerodynamical fish-like shapes. Robots interacting with humans can have a humanoid morphology, etc.

Robots’ body also vary with respect to their material. Traditionally, robots were made of rigid materials enabling precise movements. More recently, however, robots are often made also with soft materials that do not enable precise motion but provide other important advantages. These advantages include compliance, i.e. the ability of the body to spontaneously adapt its shape during the interaction with the environment, and safeness, i.e. a reduced risk of inadvertently harm humans. Exploitating compliance permits to build robots capable of performing complex operations with simple control rules. For example, it allows to build robot manipulators capable of performing effective pick and place operations without knowing the exact position and the exact shape of the object to be grasped (Video 1.1).

Video 1.1. A soft gripper realized by Soft Robotics Inc (https://www.youtube.com/watch?v=czNY4pTjYuM, https://www.softroboticsinc.com/)

Finally, robots’ body vary with respect to the fine-grained characteristics of the elements that compose it, e.g. the length, the mass, the shape, and the elasticity of body parts. As we will see in Chapter 3, these aspects can also play an important role.

1.4 The sensors

Sensors permit to extract information from the robot’s environment by measuring continuous physical properties, e.g. the frequency of sound or light waves, air temperature, the velocity of the robot etc. Proprioceptive sensors measure properties of the robots’ internal environment, e.g. the speed of the robot’s wheels, the angle of the joints connecting different body parts, the voltage of the battery, the position and/or velocity of actuated joints, etc.. Exteroceptive sensors, instead, measure properties of the external environment. Frequently used exteroceptive sensors include contact sensors (eventually distributed over an artificial skin), distance sensors (infrared, sonar or laser), cameras, microphones, and depth cameras.

Sensors also differ with respect to their passive/active nature. Passive sensors, such as microphones and cameras, measure physical quantities present in the environment. Active sensors, instead, also emit energy. An example of active sensors are sonar or laser sensors that extract information from the environment by emitting ultrasound or light waves and by measuring the time required by the wave to be reflected back. Active sensors can be more informative than passive sensors but consume more energy and can interfere with each other.

Sensors can be combined with actuators and/or other passive parts to extract more or more accurate information. For example, laser sensors can be mounted on a rotating platform to measure distance over multiple directions. Cameras can be combined with conic mirrors to enable omnidirectional vision.

The state of the sensors in a particular moment can be indicated with the term observation.

1.5 The actuators

Actuators enable the robots to move and eventually to modify their environment by converting stored energy into movements.

The most widely used actuators are electric motors. They perform rotational movements that can be eventually converted into linear movements. Alternative options are constituted by hydraulic, pneumatic, or active material objects (e.g. electroactive polymers which expand their volume as a result of electric stimulation).

Actuators are often combined with rigid or elastic supporting structures like wheels, tendons, joints, that co-determine their effect.

The most straightforward way to control actuators consists in controlling directly the torque exerted (torque control). In the case of electric actuators, for example, this can be done by varying the voltage of the electric current powering the motor. Alternatively, it is possible to control the desired velocity of the motor or the position of the joint actuated by the motor by leaving the determination of the voltage to a software routine that determines it from the offset between the desired and actual position or velocity (velocity or position control). Variable stiffness actuators allow controlling both the position and the stiffness of actuated joints.

The state of the actuators in a particular moment, e.g. the voltage applied to electric motors, can be indicated with the term action.

1.6 The brain

The brain of the robot regulates the state of the actuators from the current and previous states of the sensors. In robotics the brain is usually indicated with the term controller. I prefer to use the term brain because the term controller suggests that the brain determines the behavior of the robot while in reality the brain only contribute to determine it. Alternatively, the brain can be indicated with the term policy. As we will see in the next chapters, the brain orchestrates the interaction between the robot and the environment. The behavior of the robot is the result of the robot/environmental interaction orchestrated by the brain --- it is not a direct product of the brain.

In principle the brain of a robot could consists only of wires and electronic components like transistors. In practice, however, the brain normally consists of a computer, embedded in the body of the robot and connected with the sensors and the actuators through wires, running a software. Alternatively, the computer can be located outside the robot and can communicate with sensors and actuators wireless. The software running on the computer includes a standard operating system, the drivers to communicate with the sensors and the actuators, and a high-level software that determine the state of the actuators (action vector) on the basis of the current and previous state of the sensors (observations).

The organization of this software and the way in which it is designed depends on the approach chosen. The main approaches used in robotics are illustrated in the next three Sections.  

1.6.1 The deliberative approach

The deliberative approach was elaborated and became dominant during the classic AI period, i.e. during the 60s and the 70s, following the advent of digital computers and under the influence of Cognitive Psychology.

The approach assumes that to achieve a given goal a robot (or more generally an intelligent system) should elaborate a mental solution (plan) and then execute the sequence of actions forming the plan. In other words, it assumes that the system should think and then act. Planning consists in searching a sequence of actions that permits to achieve the goal and is realized by looking ahead at the outcome of the possible actions that the robot can perform. The method requires: (i) an internal representation of the external environment (extracted from observations or compiled manually by the experimenter), (ii) a description of the goal, and (iii) a description of the actions that the system can perform (including a description of the prerequisites and outcomes of each possible action in each possible context).

Brains realized following this approach include three rather independent brain modules: (i) a perceptual module that is responsible for extracting the internal representation of the external world from observations, (ii) a reasoning or planning module that is responsible for elaborating a plan capable to achieve the current goal, and (iii) an action module that is responsible for executing the sequence of actions that compose the plan (Figure 1.2).

Figure 1.2. Deliberative architectures include three main modules: a perception module, a planning or reasoning module, and an action module.

The most remarkable example of the application of the deliberative approach is the robot Shakey (Nilsson, 1984) developed at the Stanford Research Institute. Shakey was able to perform a series of tasks. It operated by relying on a representation of the environment compiled manually by the experimenters. The robot accomplished the current task by elaborating a suitable plan. The output of the planning system consisted of a sequence of macro actions further translated into a longer sequence of micro actions by the action module. These micro actions were finally executed and modulated on the basis of the robot’s observation.  Shakey, however, was able to operate effectively only in a static and carefully engineered environment made of large empty rooms populated with few objects of uniform colors. Later attempts to generalize the approach to more dynamic and/or natural environment were much less successful.

The advantage of the deliberative approach is that it is general at the level of its core component: the planning system. In principle, in fact, the same planning system can be used to perform any type of task and to control any type of robot.

The deliberative approach, however, has several drawbacks. A first problem derives from the fact that elaborating a plan might require too much time, especially when the state space and the number of possible actions available to the robot is large. This constitutes a serious problem in robotic applications in which decision delays often have serious consequences. A second problem is that manually elaborating or automatically extracting an exhaustive representation of the external environment is not feasible. The inevitable inaccuracy and incompleteness of the internal representation implies that the execution of a plan, that leads to the desired outcome in the robot’s mental simulation, does not necessarily lead to the same outcome in the real world. A third problem is that plans can become obsolete during their own execution due to intervening variations of the environmental conditions. These problems can be mitigated, but apparently not solved, through the usage of hybrid methods that combine deliberative approaches with alternative approaches. For these reasons, pure deliberative approaches are rarely used today in robotics. They are rather used in domains such as chess playing or robot surgery in which it is possible to operate on the basis of a complete, accurate, and updated representation of the environment and in which time is not a constraint.

1.6.2 The behavior-based approach

A radically different approach, known as behavior-based robotics, has been proposed by Rodney Brooks in the 80s (Brooks, 1991). 

This approach looks at robots as systems that are situated and embodied. Situatedness means that they “are situated in the world --- they do not deal with abstract description, but with the ‘here’ and ‘now’ of the environment that directly influence the behavior of the system.” (Brooks, 1991, p. 1227). Embodiment means that “robots have bodies and experience the world directly --- their actions are part of a dynamic with the world, and the actions have immediate feedback on the robot’s own sensations”. (Brooks, 1991, p. 1227).

Behavior-based robots do not rely on an internal model of the external world and do not mentally simulate the effect of actions before acting in the environment. They determine how to act on the basis of the current and eventually previous observations. They can use internal states to store information extracted from observations. However,  the function of those internal states is not that to represent the external environment, as in deliberative methods, but rather that to regulate the actions produced by the robot.

The brain architecture of behavior-based robots includes multiple modules called layers that are responsible for the production of the behaviors enabling the robot to perform its task/s. For example, the brain of a robot might include four modules capable of generating an obstacle avoidance behavior, a wandering behavior, a grasping behavior, and a navigation behavior (see Figure 1.4). Each module incorporates the perceptual, and action-decision capabilities relevant for the production of the corresponding behavior and has direct access to sensors and actuators.

Figure 1.3. An example of behavior-based architecture. Rectangles represent the layers responsible for the production of the behavior indicated. Each module has direct access to sensors and actuators. The dotted lines represent the fact that higher layers can subsume lower layers.

The designer of a behavior-based robot starts by identifying the behaviors that the robot should produce to perform its task and then proceeds by implementing the corresponding behavioral layers. The latter is realized incrementally, by starting with the most elementary behaviors and by later proceeding with higher and higher-levels behaviors. Once the first elementary layer has been implemented, tested, and refined, the designer proceeds with the implementation of the second layer and then of the other layers dedicated to the production of progressively more complex behaviors.

The higher layers can rely on the contribution of the lower-layers. For example, the navigate layer, that enables the robot to travel toward a target destination, can rely on the avoid-obstacle layer, that enables the robot to turn to avoid nearby obstacles (see Figure 1.4).

Behavioral modules can be triggered directly by observations that invite (afford) the execution of the corresponding behavior, i.e. that indicate the opportunity of executing a given behavior. For instance, the observation of a near obstacle by means of distance sensors, which suggests the opportunity to perform an obstacle avoidance behavior, can trigger directly the avoid-obstacle layer responsible for the production of the corresponding behavior. This implies that observations affording multiple behaviors can trigger the execution of the appropriate multiple behavioral layers in parallel.  In some cases, however, the execution of a certain afforded behavior might conflict with the execution of another afforded behavior. For example, the execution of a grasping behavior can conflict with the execution of an avoidance behavior. In these cases, the designer can resolve the conflict by enabling the higher level layer to subsume, i.e. to inhibit, the conflicting lower-level layer (Brooks, 1986). In our case, this implies that the designer should include in the grasping layer a command that subsumes the obstacle-avoidance layer during the execution of the grasping behavior.

A strength of the behavior-based approach is that it can operate effectively in real time and in dynamic environments. A limit of the approach, however, is that it does not scale easily to problems that involve the production of a rich set of behaviors. This happens as controlling the conflicts among behaviors becomes exponentially more complex as the number of behavioral modules increase. A second limit is that identifying the control rules that can be used to produce a given desired behavior can be challenging for the designer. Finally, a third limit concerns the difficulty to realize effective transitions among behaviors. The realization of effective transitions often requires to adapt the way in which the preceding behavior is produced to the characteristics of the following behavior, and vice versa. However, this possibility is prevented by the incremental modality with which behavioral layers are designed.

1.6.3 The adaptive approach

Finally, the adaptive approach aims to create robots capable to develop the behavioral and cognitive skills required to perform a task autonomously, while they interact with their environment, through  evolutionary and/or learning processes. It focuses on model-free approaches with minimal human-designed intervention in which the behavior used by the robot solve its task and the way in which such behavior is produced is discovered by the adaptive process automatically, i.e. it is not specified by the experimenter.

The role of the experimenter consists instead in the design and implementation of the learning algorithm and of the reward function (i.e. a function that estimates how good or how bad a robot is doing). We will discuss adaptive approaches in details in Chapter 6.

The first pioneering attempt to design adaptive robots were carried by cyberneticists in the 50s, before the digital computers era and before the development of the deliberative approach. The first systematic studies, instead, were carried in the 90s after the development of evolutionary, reinforcement learning, and regression learning algorithms.

Adaptive methods present several advantages. A first advantage is that they free the designer from the need to identify a solution of the task. Identifying the behavior that a robot should produce to perform a task and the characteristics that the robot should have to produce such behavior can be challenging (we will discuss this aspect in more detail in Chapter 5). Moreover, the solutions eventually identified by the experimenter might result ineffective or unrealistic. Designing the reward function can also be challenging, as we will discuss in Chapter 6, but is certainly less challenging than designing a complete solution.

A second benefit is that the adaptive approach allow synthesizing integrated solutions where each characteristic is co-adapted to each other characteristic. This is due to the fact that the adaptive process is driven by a reward function that rate the overall effectiveness of the robot’s behavior. Moreover, it is due to the fact that all characteristics are subjected to variations during the adaptive process. This property clashes with the incremental approach used by the behavior-based robotics in which the characteristics of a layers can be modified during the implementation of that layers only and remain fixed during the implementation of the successive layers.

The downside of the adaptive approach is that the training process is time-consuming and does not guarantee the development of effective solutions. We will discuss how these problems can be handled in the next chapters.

1.7 Learn how

To learn how to create adaptive robots go to Section 1 and 2 of Chapter 13 and do Exercise 1. You will be introduced to AI-Gym, a software library developed by OpenAI that permits to experiment with robots in simulation and allow comparing adaptive algorithms in a large set of problems.



2.From Braitenberg Vehicles to Neuro-Robots

2.1 Introduction

The first systematic attempts to build autonomous robots were carried at the end of the first half of the last century by cyberneticists, an international group of researchers interested in combining ideas and theories of control theory, information theory, biology and neuroscience

An example of these pioneering robots is the machina speculatrix (Figure 2.1) built by William Grey Walter, one of the most influential exponents among the cyberneticists. Walter’s robot has contact, light and battery sensors. The sensors are connected to two motors, that control the rotation speed of the driving wheel and the steering gear, by means of electric wires and relays. The state of the motors varies as a direct function of the state of the sensors. These simple components enable the robot to appropriately alternate a wandering and a recharging behavior. The latter behavior is produced when the battery is low and is realized by approaching the light located over the recharging station, waiting until the battery is recharged, and moving away from the recharging station to resume the wandering behavior.

Interest in these studies unfortunately declined in the 1960s as a consequence of the booming of Artificial Intelligence and of the popularity of the deliberative approach. However, the interest in the cybernetic approach, that looked at natural organisms as an important source of inspiration and considered intelligent behavior as a property arising from the direct interaction between the agent and its environment, resumed in the 1980s. Indeed, the contributions of Valentino Braitenberg, who published a short but influential book called Vehicles: experiments in synthetic psychology (Braitenberg, 1984), and of Rodney Brooks, the visionary roboticist who invented the behavior-based approach, can be considered as a continuation of the cybernetic approach.

Figure 2.1. The machina speculatrix of William Grey Walter (1953).

2.2 Braitenberg’s vehicles

Valentino Braitenberg, who had a background in medicine and psychiatry, was interested in the study of the human brain. However, like the cyberneticists, he liked the synthetic approach. Namely, he believed that building artificial agents capable of exhibiting natural behavior would contribute to deepen our understanding of natural intelligence. Given the difficulties of constructing robots with the technology of the time, he decided to carry on thought experiments imagining robots of increasing complexity that he called vehicles.

Braitenberg’s vehicles include sensors measuring continuous properties of the environment (e.g. light intensity or temperature), and motors, which enable the vehicles to move. The brain of the vehicles is realized by connecting through wires the sensors to the motors directly or indirectly through internal neurons. The way in which sensors and motors are positioned and wired is chosen by taking inspiration from general properties of natural nervous systems such as symmetry, cross-lateral connection, excitation and inhibition, and nonlinearity.

The first vehicle hypothesized by Braitenberg, called vehicle 1, is the simplest possible vehicle that can be built since it includes a single motor, a single sensor, and a single linear excitatory wire that connects the sensor to the motor. The sensor measures the intensity of a certain environmental quantity, for example the temperature. The motor controls the forward speed of the driving wheel.

Figure 2.2. Vehicle 1. The large circle, the small circle, and the rectangle indicate the body, the sensor, and the motor of the vehicle. The green line indicates the excitatory wire.

Wires come in two varieties: they can be excitatory or inhibitory. The presence of a wire connecting a sensor to a motor implies that the activity of the motor depends on the state of the sensor. In the case of excitatory wires, the higher the activation of the sensor is, the higher the speed of the motor will be. Vice versa, in the case of inhibitory wires, the higher the activation of the sensor is, the lower the speed of the motor will be. In the case of linear wires, the relation between the activation of the sensor and the speed of the motor is linear. In the case of non-linear wires, the relation between the two varies according to some non-linear function.

Vehicle 1 will move in the direction it points, whatever it is. However, it will move slowly in cold regions and faster in warm regions. From the point of view of an external observer, therefore, this vehicle appears to “like” cold regions, in which it spends most of its time, and to “dislike” warm regions, from which it moves away at higher speed. The behavior of this vehicle will depend also on the characteristics of its environment and more specifically on the distribution of the temperature in space and on the characteristics of the terrain.  Indeed, it will move exactly straight only in an idealized perfectly flat environment. In more realistic surfaces, instead, it will deviate from its course due to the effects of the irregularities of the terrain. In the long run, it will produce a complicated trajectory, curving one way or another without apparent reason.

Figure 2.3. Vehicle 2a and 2b (left and right, respectively). The vehicles include two sensors and two motors on the left and right portions of their body. The yellow disk indicates a light bulb suspended over the floor. The dashed lines indicate the trajectories produced by the vehicles.

Let’s now imagine a slightly more complex vehicle, vehicle 2, that has two sensors and two motors located in the left and right portion of its body. Let’s assume that the sensors measure ambient light and that the environment contains a light bulb suspended over the floor (Figure 2.3). We can create three versions of this vehicle by using excitatory wires, depending on whether we connect: (a) each sensor to the motor on the same side, (b) each sensor to the motor on the opposite side, and (c) both sensors to both motors. The case (c) will behave like Vehicle 1, so we will consider (a) and (b) only. In the case of vehicle 2a, the motor located nearer the light will run faster than the other motor. Consequently, the vehicle will steer away from the light and will then move straight away from it at decreasing speed once the light becomes located on its rear side. In the case of vehicle 2b, instead, the motor located nearer the light will run slower than the other motor. Consequently, the vehicle will steer in the direction of the light and will later move toward it at increasing speed, once the light becomes located on its frontal side. In principle, also vehicle 2a might move toward the light source, if it happens to be oriented exactly toward it. In practice however, small dis-alignments originating during motion, resulting from friction and/or from noise, will amplify over time as a consequence of a positive feedback mechanism (i.e. as a consequence of the fact that small dis-alignments produce actions that increase the dis-alignment). The “temperament” of vehicle 2a and 2b thus looks quite different. Both vehicles seem to “dislike” light sources. However, Vehicle 2a “fears” the light and runs away from it by slowing down only when the intensity of the light diminishes. Vehicle 2b instead is “aggressive”. It resolutely faces the light and moves toward it at increasing speed, as if it “wants” to destroy it.

Figure 2.4. Vehicle 3a and 3b (left and right, respectively). The red lines indicate inhibitory wires.

By using inhibitory instead than excitatory wires, we can create other two vehicles: vehicle 3a and 3b (Figure 2.4). Vehicle 3a has straight connections between sensors and motors. Vehicle 3b, instead, has crossed connections. Both vehicles will slow down near the light and therefore will spend time near it. Vehicle 3a will steer toward the light since the motor located nearer the light will slow down more than the other motor. The vehicle will later move toward the light with decreasing speed by finally stopping in front the light. Vehicle 3b, instead, will steer away from the light at decreasing speed and will later move away straight away from the light at increasing speed. Both vehicles thus “like” the light, although they do so in different manners. Vehicle 3a “loves” it in a permanent way. It turns toward the light, approach it, and then stop in front of it “to admire its beauty”. Vehicle 3b “loves” the light but also like “to keep an eye open” for others possible lights around.

Clearly, we cannot continue a systematic analysis of all the possible vehicles that can be built since the number of possibilities explodes when the number of sensors and/or motors increase. A slightly more complex case, that is worth mentioning, is a vehicle including two type of sensors, e.g. light sensors measuring the intensity of light and infrared sensors measuring the proximity of nearby obstacles. Let’s consider, for example, the vehicle shown in Figure 2.5 that has two light sensors and two infrared sensors connected to the motors through crossed excitatory and inhibitory wires, respectively. The vehicle will approach the light (by steering toward it and by accelerating as it becomes closer) and will avoid obstacles (by steering away from obstacles and by decelerating near obstacles). If the conductivity of the wires (i.e. the intensity of the effect of excitatory and inhibitory wires) is equal, the vehicle can stall in the attempt to steer in one direction to approach the light and in the opposite direction to avoid the obstacle. Instead, if the conductivity of the wires originating from the infrared sensors is greater than those of the wires originating from the light sensors, the vehicle will correctly prioritize the obstacle avoidance behavior to the light approaching behavior. In other words, the vehicle will avoid proximal obstacles even when this conflicts with the need of approaching the light and will steer toward the light only when this does not conflict with the need to avoid obstacles. Notice also how this simple architecture with four sensors and four wires enables the vehicle to display sequential behaviors, e.g. an obstacle avoidance behavior followed by a light approaching behavior.

Figure 2.5. A vehicle provided with two infrared sensors (IR) measuring the proximity of nearby obstacles and two light sensors (L) situated in an environment that contains an obstacle and a light bulb suspended over the floor. The thickness of the green and red lines represents the conductivity of the corresponding excitatory and inhibitory wires.

Incidentally, vehicle 2.5 exemplifies a method of behavior arbitration that is alternative to the subsumption method introduced by Brooks (1986). The method consists in enabling the modules that are responsible for the production of the different behaviors to operate in parallel while ensuring that the contribution of the module that should be prioritized is significantly greater than the contribution of the other modules (Arkin, 1988). The priority levels of behaviors can be fixed, as in this example, or can regulated in a context dependent manner. In the latter case, the relative contribution of the different modules can be varied depending on the context. The relevant contexts can be discriminated by the robot on the basis of information extracted from the robot’s observations. 

Finally, a through experiments that is worth mentioning concerns the analysis of how the behavior produced by a vehicle can vary as a result of the characteristics of the environment in which it is situated. Consider for example a vehicle 2a (Figure 2.3, left) with infrared proximity sensors situated near an obstacle or near a maze-like structure (Figure 2.6). When placed near the obstacle, the vehicle will steer to avoid it and will then move straight away from it. When placed at the beginning of a maze-like structure, instead, the vehicle will display an articulated behavior that will enable it to navigate in the maze. This example shows how the behavior produced by a robot depends crucially on the local characteristics of the environment. Moreover, the example shows how a robot can produce multiple behaviors on the basis of a single controller. More generally, it shows how the number and the type of behaviors produced by a robot do not depend only on the brain of the robot but also on the environment.

Figure 2.6. The vehicle 2a with infrared proximity sensors situated near an obstacle (left) and near a maze-like structure (right).

In summary, Braitenberg’s vehicles illustrate in a simple and vivid way several important aspects characterizing robots and more generally embodied and situated agents. Firstly, robots can display purposeful behaviors, i.e. behaviors allowing the achievement of a goal, without possessing any representation of their goal. As we will see in Chapter 12, possessing an explicit representation of the goal can provide advantages, but is not a pre-requisite for producing goal-directed behaviors. Secondly, the characteristics of the environment co-determine the behavior produced by a robot. The environment is as important as the brain for the determination of the behavior exhibited by the robot. Thirdly, the complexity of the behavior produced by a robot can exceed the complexity of the robot’s brain. In other words, robots can produce complex behaviors without necessarily possessing complex brains.

2.3 Neuro-Robots

The electric wires and the relays used by the cyberneticists or imagined by Braitenberg can be replaced conveniently with artificial neural networks formed by interconnected sensory, internal, and motor neurons. Modern robots of this type are indicated with the term neuro-robots.

Artificial neural networks are constituted by simple computational units called neurons interconnected by directional weighted links called connections. Neurons produce as output a real number that is a function of the inputs received from incoming connections or, in the case of sensory neurons, from the sensors. The input received from an incoming connection depends on the output of the neuron from which the connection originates (pre-synaptic neuron) and from the connection weight, which is encoded in a real number. Connections with positive and negative weights increase and decrease the output of the post-synaptic neuron, respectively, and are referred as excitatory and inhibitory connections, respectively.

The way in which the neurons are wired together determine the architecture of the neural network. Usually neurons are organized in layers: a sensory layer, one or more internal layers, and a motor layer. In fully connected feed-forward architectures each neuron of a layer is connected to each neuron of the following layer (Figure 2.7, left). The information thus flow in one direction only from sensory neurons to motor neurons. In recurrent neural networks, instead, neurons can be connected also to neurons of the same or of previous layers (Figure 2.7, right). 

Figure 2.7. Left: a feed-forward neural networks with two internal layers. Right: a fully-recurrent neural network with one internal layer. Circles represents neurons. Rectangles represents layers. Arrows represents connections from all pre-synaptic neurons to all post-synaptic neurons. The outputs of the neurons and the connection weights are not shown.

In the case of sensory neurons, the output corresponds to the state of the corresponding sensor. For example, the activation state of 16 distance sensors can be encoded in 16 corresponding sensory neurons. The 640x480 image perceived by a color camera can be encoded in 921,600 sensory neurons encoding the RGB values of the 307,200 pixels. The activation of the sensors should be normalized within a suitable range (e.g. [-1.0, 1.0]).

In the case of the internal and motor neurons, the output is a function f () of the sum of all incoming input  weighted by connection weights  and of a bias term b:

where f() is the activation function, j is pre-synaptic neuron, i is the postsynaptic neuron, xj is the output of the pre-synaptic neuron, Wij is the connection weights between the pre-synaptic and post-synaptic neuron, and b is the bias term of the post-synaptic neuron. The bias is equivalent to a connection weight originating from a neuron that produces always 1.0 as output. Commonly used activation functions include the tanh nonlinear function, the linear function, and the ReLU function.

The output of motor neurons is used to set the state of the robot’s actuators. For example, in the case of a robot arm with 7 actuated degrees of freedom, the torque exhorted by the 7 motors controlling the 7 corresponding actuated joints can be set on the basis of the output of 7 motor neurons normalized in a [-100, 100] Newton range.

The state of the neurons is normally updated with a fixed frequency (e.g. 10 Hz) and with a fixed order, i.e. first the sensory neurons, then the internal neurons of the first layer, then the internal neurons of additional layers if present, and finally the motor neurons. In the case of recurrent neural networks, the input of recurrent connections (i.e. connections that originate from the same of successive layers) is calculated on the basis of the output of the pre-synaptic neuron at time t-1. We will discuss recurrent networks in more details in Chapter 10.

The action of the neural network, i.e. the vector of output of the motor neurons, depends on the observation (i.e. the state of the sensory neurons) and on the connection weights. In the case of recurrent networks, it also depends on the output of internal neurons at time t-1 which depends on the previous states of the neurons. This implies that feed-forward networks respond always in the same manner to the same observation (until the connection weights do not change) while recurrent neural network can vary their response as a function of previous observations.

The connection weights are initially set randomly and are then modified by means of an adaptive algorithm.  

The brain of the robot can be realized by using alternative formalisms to neural networks such as fuzzy rules or programming languages. Artificial neural networks, however, represent the most popular choice for the following reasons: (i) they provide a natural way to encode quantitative information, (ii) they degrade gracefully as a consequence of variations, and (iii) they generalize, i.e. respond to new observations by producing actions similar to those produced for similar observations. Moreover, they constitute an ideal substrate for the realization of an adaptation process thanks to the quantitative nature of their parameters and to the property described above. Finally, they permit to exploit the theoretical and practical knowledge developed over the years by the large community of researchers that studied them.

2.4 Adaptation

Braitenberg managed to correctly predict the behavior produced by his simple vehicles situated in simple environments. Indeed, his prediction were later confirmed by researchers who built the first three vehicles described in his book and reviewed above. Predicting the behavior exhibited by a robot and designing a robot capable to exhibit a given desired behavior, however, becomes more and more challenging as the complexity of the robot and/or of the environment increases.

Braintenberg himself realized this difficulty and proposed to overcome the problem by using a form of artificial selection that could be realized by repeating the following phases several times: (i) placing over a table a bunch of hand-designed vehicles (the table might include lights, sound sources, small cliff etc.); (ii) periodically creating copies of vehicles selected among those that remained on the table without falling down; (iii) introducing random mistakes in the copying process (e.g. including an inhibitory instead of an excitatory wire, adding or eliminating a sensor or a wire, etc.). This process can lead to the evolution of vehicles displaying behaviors which increase their ability to “survive”, e.g. moving away from the border of the table, pushing other vehicles out, resisting to aggressive behaviors displayed by the other vehicles. In other words, it can lead to the evolution of vehicles displaying useful behaviors. Indeed, the mistakes produced during the copying process can occasionally produce vehicles capable to survive longer. These vehicles will later proliferate since will have a greater chance to be selected as blueprint for new vehicles.

This idea was realized in concrete experiments few years later within Evolutionary Robotics (Nolfi and Floreano, 2000; Nolfi et. al. 2006) a research area that elaborated methods to evolve robots. A standard evolutionary robotics method that can be used to evolve the brain of a robot involves the following steps:

  1. The experimenter choses a task and a robotic platform suitable for the problem. For example, the experimenter might decide to evolve a legged robot for the ability to walk on an irregular terrain.
  2. The experimenter designs and implement the fitness function, i.e. a criterion that rate with a scalar value to what extent a robot accomplishes the task. For example, to evolve walking robots the experimenter can decide to use as fitness the distance between the initial and final position of the robot during one or more evaluation episodes.
  3. The experimenter evolves the connection weights of the brain of the robot in simulation by using an evolutionary algorithm like that illustrated below.
  4. The experimenter embeds the best evolved brain in the physical robot and verifies that the solution evolved in simulation transfers successfully to the real world.

A possible evolutionary algorithm (Pagliuca, Milano and Nolfi, 2018) that can be used is illustrated below. The procedure starts by creating a population of vectors that encode the parameters of a corresponding population of robots (line 1). In this case, the parameters encode the connection weights of the robots' brain and consist of a vector of real numbers of length p where p is the number of connection weights and biases of the robots’ neural network. The population θ  thus consists of to a matrix including λ vectors of length p. The population can be initialized with random gaussian number with average 0 and variance τ or by using a standard initialization method that varies the distribution on the basis of the number of pre-synaptic and post-synaptic neurons (Glorot & Bengio, 2010). Then, for a certain number of generations, the algorithm evaluates the fitness of the individuals forming the population (line 3), ranks the individual of the population on the basis of their fitness (line 5), and replace the parameters of the worse λ2 individuals with a copy with variations of the λ2 individuals (line 8). Variations are introduced by adding to copied parameters a vector of gaussian numbers with average 0 and variance σ. The evaluation of an individual (line 4) is realized by creating a robot with the parameters specified in the vector θi and by measuring the fitness while the robot interact with its environment for a given amount of time.

We will illustrate evolutionary algorithms and other adaptive algorithms in more details in Chapter 6.

2.5 Learn how

To acquire hands on knowledge on the topics discussed in this Chapter, you can read Section 3 and 4 of Chapters 13 and do the Exercise 2 and 3. You will learn how to implement a neural network policy and the evolutionary algorithm illustrated in Section 2.4 from scratch. You will also learn how to implement an algorithm compatible with the AI-Gym library that can be applied to all problems available in the AI-Gym library.  



3. Embodiment

3.1 Introduction

The term embodiment is used to indicate agents that possess a physical body. In the case of robots, the body is composed of parts made of different materials and arranged in a given morphology. The role of the body has been largely underestimated  during the 70’s, when the deliberative view of intelligence was dominant. The importance of the body started to be appreciated only later when a series of studies demonstrated the strict interdependence between the body and the brain both in natural systems (Thelen & Smith, 1994; Spivey, 2007) and in robots (Pfeifer and Bongard, 2016).

This drastic change of perspective led to the elaboration of a new scientific paradigm that is indicated with the term embodied mind (Varela Thomson and Rosch, 1991) or embodied cognitive science (Pfeifer and Sheier, 1999; Pfeifer and Bongard, 2016). According to this paradigm, intelligent behavior is the emergent product of many decentralized interactions occurring between the brain of the agent, the body of the agent, and the environment. The function of each of these elements is interdependent and cannot be separated by the function of the other elements. The body plays a role as important as the role of the brain.

3.2 On the importance of fine-grained morphological properties

Clearly, the overall morphology of a robot constraints the behavior that it can produce. Morphologies suitable to fly might not be suitable to swim and vice versa. However, the fine-grained characteristics of the body matter as well. Indeed, robots can exploit the morphological properties of their body to produce effective behaviors. For example, they can exploit the size, the shape and the elasticity of their body parts and the effects that these properties have on the interaction between the robot and the environment to produce a desired behavior. 

The importance of detailed characteristics of robots’ bodies can be illustrated by passive walking robots (McGeer, 1990; Collins et al., 2005), i.e. brainless bipedal robots capable to walk on an inclined slope without actuators. These machines are constituted only by rigid bars linked with passive hinge joints (see Figure 3.1). Despite their simplicity, they are capable to produce a walking behavior, they are energy efficient and they are robust to environmental disturbances. Moreover, they display a natural walking behavior (Video 3.1).

Figure 3.1. Schematization of a 2D passive walker. (Figure adapted from McGear, 1990)

The walking behavior of these machines emerges from the interaction between the body of the robot and the slope mediated by the effect of gravity and collisions. Indeed, the force generated by gravity on the body parts of the robot and the force originating from the collisions of the feet with the ground modify the position of the hip and of the other parts forming the legs. The modification of the positions and the orientations of the body parts, in turn, alters the way gravity acts on the body parts. The sequence of these bi-directional interactions, in which gravity and collisions modify the state of the robot and the new state of the robot combined with its morphological characteristics regulates the effects of gravity and collisions, generate a periodic dynamical process (the walking behavior) that stays stable as long as the machine is on the slope.

Video 3.1. Passive Dynamic Walker. (https://www.youtube.com/watch?v=WOPED7I5Lac; McGeer, 1990).

Clearly, possessing articulated legs constitutes a prerequisite for the production of the walking behavior in these robots. However, the detailed characteristics of the body, i.e. the length of the upper and of the lower segments, the mass of each part, and the rigidity/elasticity of the material matter as well. Indeed, even small variations of these properties can compromise the walking ability. In other words, the walking ability of a passive walker depends on the fine-grained characteristics of its body.

For the sake of analysis, we can divide the periodic walking behavior exhibited by a passive walker robot in four phases:

  1. The robot starts swinging its right leg forward by exploiting the effect of gravity and inertia while maintaining its left leg extended thanks to the support of its left foot (that is in contact with the ground) and thanks to the support of the fully extended left knee joint.
  2. Once the right leg becomes fully extended, the robot touches the ground with its right foot and keeps moving its hip forward as a result of inertia. These movements combined with the pivot action of the right leg produces an elevation of the left leg from the ground.
  3. The robot starts swinging its left leg forward by exploiting the effect of gravity and inertia while maintaining its right leg extended thanks to the support of the right foot, in contact with the ground, and thanks to the support of the fully extended right knee joint.
  4. Once the left leg becomes fully extended, the robot touches the ground with its left foot and keeps moving its hip forward as a result of inertia. This movements combined with the pivot action of the left leg produces an elevation of the right leg from the ground.
  5. Repeat the above four phases.

The sub-behavior produced during each phase crucially depends on the fine-grained characteristics of the body of the robot. For examples, the mass of the shank codetermines the swinging speed during phases 1 and 3. The swinging speed, in turn, codetermines the angle that the knee joints assume at the end of the swinging phase when the foot touches the ground. The angle of the knee at the end of the swinging phase combined with the angular limit of the knee joints determines the pivot action that starts the swinging of the other leg. Overall, this implies that the mass of the parts of the body, the length of the segments, and the angular limits of the joints are crucial for the production of the walking behavior. They should be set carefully to ensure that the interaction between the body and the environment produce a walking behavior.  

3.3 Morphological computation

As we stated in the previous Chapter, the role of the brain is that to codetermine the actions performed by the robot in a context dependent manner. The same role, however, can be played by the body of the robot. Indeed, as we have seen in the previous Section, the velocity with which a passive walker moves its legs during a swinging phase depends on the mass of its shanks. The mass of the shank, therefore, can play the same role of an integrated set of sensors, neurons and actuators able to detect information about the current position of the body parts of the robot, compute the state of the actuators, and exhort a torque on the joint so to appropriately alter the speed of the legs. The ability of the robot’s body to perform the same function of the brain has been named morphological computation (Pfeifer, Iida & Gómez, 2006).

In the case of the passive walkers illustrated above, the role of regulating the actions performed by the robot in a context dependent manner is entirely played by its body. In other cases, such function is carried by both the body and the brain. Examples of the latter cases includes quasi-passive walkers capable of walking also in flat terrain (Omer et al., 2009) and compliant quadruped robots (Spröwitz et al., 2013).

The possibility for a robot to produce a given desired behavior thanks to a proper body and/or to a proper brain has two important implications. The first is that the robot can use morphological computation to perform regulations that would be difficult or impossible to perform through the brain, e.g. very fast regulations. This implies that robots exploiting morphological computation can outperform robots lacking this property. The second implication is that robots can produce “complex” behavior with “simple” brains. Eventually, even without brain, as in the case of the passive walkers.

Morphological computation also implies that the ability of codetermining the robot’s actions in a context dependent manner can be distributed among the brain and the body of the robot. This implies (i) that the functions played by the body and by the brain are not separated, and (ii) that the role played by the brain cannot be explained without considering the role of the body and vice versa.

3.4 Body adaptation

The body of the robot can be adapted so to enable the robot itself to perform a desired function.

An interesting example of body adaptation is constituted by the work of Cheney, Clune & Lipson (2014) who evolved robots formed by cubic cells made of different materials to move as fast as possible over a planar surface. The cells are distributed within a grid of 10x10x10 voxels. Each voxel can be empty or can include a cell of one of the following types: a bone cell made of rigid material, a tissue cell made of soft material, a muscle cell that periodically contracts and expands, or a muscle cell that periodically expands and contract (Video 3.2).

Video 3.2. Evolving robots formed by cubic cells of different type. Rigid, soft, contracting-and-expanding, expanding-and-contracting cells are indicated in blue, cyan, red and green, respectively (https://www.youtube.com/watch?v=z9ptOeByLA4; Cheney, Clune & Lipson, 2014).

In the example shown in the Video, the evolving robots are sensor-less. In other related models, however, the expansion and contraction behavior of the muscle cells is regulated on the basis of the state of local sensors (see Cheney, Clune & Lipson, 2014).

The content of each voxel is determined through an evolved developmental network that receives as input the Cartesian coordinate of each voxel and produce as output the content of the voxel (for details, see Cheney, Clune & Lipson, 2014). Evolving robots are situated over a flat surface. The fitness corresponds to the distance covered by the robots during an evaluation episode.

As illustrated in Video 3.2, the morphology of the robots varies across generations. The evolutionary process produces robots with more and more adapted bodies capable to walk better and better until the process converge on a given morphological solution that remains relatively stable in successive generations. The morphology, the behavior, and the performance exhibited by evolved robots vary considerably in different replications of the experiment.

These robots represent another clear illustration of the principle of morphological computation. They are brainless, like passive walkers. Consequently, their ability to walk comes exclusively from their morphological characteristics.

A following work demonstrated the possibility of using a similar approach to evolve creatures made of living cells capable to swim in a liquid environment (Kriegman et al., 2020). This was realized: (i) by evolving in simulation creatures formed by rigid and contractile cells for the ability to swim in a liquid environment, and (ii) by creating living agents formed by cardiomyocyte and epidermal stem cells assembled to create morphologies similar to those of the evolved creatures (Video 3.3). These creatures display behaviors similar to that displayed by the corresponding evolved creatures in simulation (Video 3.3). Remarkably, they are capable to explore their aqueous environment for days without human intervention.

Video 3.3. Evolved robots made of passive and contractile cells evolved in simulation and then manufactured by using living cells. (https://www.youtube.com/watch?v=1zVG8wptOOI; Kriegman et al., 2020).

3.5 Co-evolving the body and the brain

The robots described in the previous section are brainless. However, one can also evolve the body of robots that include a neural network controller. In other words, one can set-up an experiment in which the body and the brain of the robots co-evolve.

This can be done by using a fixed number of morphological and control parameters and by using a direct encoding method in which each morphological and control parameter is encoded in a corresponding element of the genotype. Alternatively, it can be realized by using an indirect encoding method and genotypes with varying length.

The direct encoding method has been used by Pagliuca & Nolfi (2020) to adapt the brain and the detailed morphological properties of robots with a predetermined bauplan and to evolve robots in which both the general an detailed characteristics of the body are subjected to variations. In both cases the robots are formed by a fixed number of cylindrical elements attached through actuated hinge joints (Figure 3.2). However, in the first case the robots are characterized by an hand-designed morphological layout that is substantially preserved since the range of variation of the morphological parameters is limited. This is the case, for example, of the HalfCheetah robot (Todorov, Erez & Tassa, 2012; Coumans & Bai, 2016) with variable morphological parameters shown in Figure 3.2 (top). In the second case, instead, the robots are constituted by a collection of identical elements assembled in a uniform bauplan that can be transformed substantially. This is the case of the robots shown in Figure 3.2 (bottom) formed by 11 segments attached to each other through actuated hinge joints. In these robots the number of morphological parameters subjected to variation and the range of variation is larger.

Figure 3.2. Top: An Halfcheetah robot with morphological properties that can vary up to 20%. Bottom: A robots formed by 11 segments attached to each other through actuated joints in which the morphological parameters are allowed to vary within larger ranges.

The robots are provided with feed-forward neural network controllers. The sensory neurons encode: the velocity and the orientation of the central segment, the angular position and velocity of the actuated joints, and the state of  contact sensors located on the segments. The motor neurons encode the torque applied to the actuated joints. The morphological parameters that are subjected to variations are the length and the diameter of the segments, the angular limits of the joints, and, in the case of the second set of experiments, the default angle of each segment with respect to the previous segment. The range of variation of the morphological parameters is limited to 20% in the case of the first set of experiments and is larger in the case of the second set of experiments (see Pagliuca and Nolfi, 2020).

Video 3.4 shows the behavior of typical evolved robots. The obtained results indicate that the possibility to vary the morphology of the robot within limits permits to achieve better performance with respect to control experiments where the morphology of the Halfcheetah robots remain fixed (Pagliuca and Nolfi, 2020). The results of a second set of experiments allowing the morphological properties to vary in larger ranges indicate that evolution successfully discover suitable morphological bauplans from scratch (Pagliuca and Nolfi, 2020). The analysis reported by the authors also indicate that the advantage gained by co-evolving the body and the brain derives by the possibility to adapt the features of the body to the current features of the brain and vice versa.

Video 3.4. Top: Behavior of a typical evolved Halfcheetah robot with the parameters of the morphology are allowed to vary up to 20%. Bottom: Behavior of a typical robot when the general bauplan of the robot’s body is evolved as well.

Other works have been realized by using indirect encoding methods and genotypes with variable length (Sims, 1994; Lipson and Pollack, 2000; Auerbach et al., 2014) Lipson and Pollack (2020), for example, proposed evolving robots composed by a variable number of elementary elements (cylinders, actuated telescopic joints, and neurons) arranged in variable configurations. The actuated telescopic joints permits to vary the length of the cylinders while the robot interacts with its environment. The genotypes of the evolving robots are formed by a list of tuples that encode the elements forming the body and the brain of the robot as well as the relation between these elements. More specifically each tuple encodes an element belonging to one of the following categories: vertices (i.e. point in a Euclidean space), cylinders, neurons, and actuators. Moreover, each tuple encodes the following parameters:

Variations are introduced by using genetic operators that: (i) perturbate randomly the real number of parameters (e.g. the coordinates of vertexes, the connection weights of neurons etc.) or replace the integer parameters with a new integer in the available range (e.g. the index of cylinders and neurons), (2) add a new random tuple or remove an existing tuple, (3) replace a tuple encoding a cylinder with two tuples encoding two cylinders extending between the two original vertices.

The robots are evolved for the ability to move as much as possible. The fitness is computed by measuring the distance from the initial and final position of the center of mass of the robot during an evaluation episode.

Video 3.5. Robots with co-evolved body and brain. (https://www.youtube.com/watch?v=qSI0HSkzG1E; Lipson & Pollack, 2000).

Video 3.5 shows the behavior and the morphology of typical evolved robots.  As shown in the Video, the solutions evolved in simulation can be manufactured by creating the body parts with a 3D printer and by manually assembling the motors and the electric wires.

3.6 Defining embodiment

I began this chapter by saying that the term embodiment refers to the fact that robots have physical bodies and that the properties of the body co-determine the behaviors displayed by the robots. From the perspective of this initial definition, embodiment is a categorical property own by robots and missing by other agents, e.g. web bots. It is a constituting property of all robots.

After having illustrated the way in which the body influences the behavior of a robot we can introduce a more specific definition of embodiment that has a quantitative nature. Embodiment indicates the extension to which the body of a robot is adapted to the other constituting elements: the brain of the robot, the environment, and the task that the robot should accomplish. I believe that this definition describes the true essence of embodiment. Moreover, I believe that this definition has a practical utility since it permits to differentiate robots that are capable to exploit their morphological properties from robots lacking this ability.

On the basis of this definition we could state that a robot, with certain morphological properties, is more embodied than a second robot, with different morphological properties. We can say that the higher the level of embodiment is, the greater the chances that the robot will operate effectively are. Moreover, we can say that the higher the level of embodiment is, the greater the amount of computation performed by the body is (and consequently the smaller the amount of computation that needs to be performed by the brain is).

3.7 Learn how

Read the sections 5 of Chapter 13 to familiarize with evorobotpy2, a simple but powerful software that enables you to evolve robots in simulation. Use evorobotpy2 to evolve an agent capable to solve the acrobat problem by following the instructions included in the Exercise 4. Then read section 13.6 and follow the instructions included in the Exercise 5 to learn how to co-evolve the body and the brain of a robot formed by multiple segments and adapted for the ability to locomote.



4. Situatedness

4.1 Introduction

Another fundamental property of robots is situatedness, i.e. being situated in an external environment. The presence of an external environment combined with the fact that the behavior of robots results from the continuous interaction between the robots and the environment has important consequences.

In any step: (i) the robot determines the state of its actuators on the basis of the current and previous observations, (ii) the action of the robot alters the robot/environmental relation and eventually the environment itself (Figure 4.1). Consequently, the action produced by the robot co-determines the sensory states that the robot experiences later.

This implies that the actions performed by a robot have two effects: (1) they enable the robot to achieve its goals, (2) they alter the perceptual environment of the robot and/or the physical environment. Ideally the second effect should be instrumental to the first, i.e. the robot should choose actions that alter the environment in a way that facilitates the achievement of the robot’s goal. In this Chapter we will show that the possibility to alter the environment plays a key role in adaptive robots.

Figure 4.1. The behavior of a robot is the result of the continuous interaction between the robot and the environment.

4.2 Sensory-Motor Coordination

The term sensory-motor coordination indicates the ability of an agent to act so to ensure that it will later experience useful observations, i.e. sensory states enabling the robot to achieve its goals.

It is important to consider that coordinating the sensory and motor processes is not simply an exploitable possibility but a necessity since the actions of the robot will inevitably alter the following observations. Indeed, the observations later experienced by the robot always depends on the actions performed by the robot previously. The perceptual environment of the robot is a relational property that depends both on the environment and on the robot.

The fact that the actions performed by a robot have long term effects on the further observations, which determine the actions that the robot will perform next, is also one of the reasons that explain why robots are difficult to design. To identify the right action that the robot should choose in a given moment, the designer has to predict the long-term effect that actions have on forthcoming observations. This is challenging since, as we will see in the next Chapter, the behavior of a robot is a dynamical system originating from continuous non-linear interactions between the robot and the environment. The behavior of dynamical systems of this kind cannot be predicted, even with a complete knowledge of the robot and of the environment.

4.3 The Foraging Animat

To illustrate the role of sensory-motor coordination, let’s consider an animat (i.e. an abstracted robot) living in a circular world divided in 40 cells (Figure 4.2; Nolfi, 2005). The animat is rewarded or punished with scores 1 and -1 for each step taken in the left and in the right side of the environment, respectively. The animat is initially placed in a randomly selected cell. Consequently, to maximize its fitness the animat should stay in the left side and should move from the right to the left side, in case it finds itself located in the right side.

The animat observes only the color of the cell in which it is currently located. There are 20 different colors indicated by 20 corresponding numbers. Each color is present once in the left and once in the right side of the environment, in different positions. At any step, the animat can move one cell clockwise, move one cell anti-clockwise or remain still.

Figure 4.2. The circular environment of the foraging animat. The environment includes 40 cells of 20 different colors. The color of the cell is indicated by the number. A cell of each color is included both in the left and in the right side of the environment. Consequently, the color of the cell where the animat is located does not provide any clue on whether the animat is in the good or bad side. (Nolfi, 2005).

Now let’s try to guess a policy that could enable the animat to maximize its fitness by moving away from the right side and by remaining in the left side. Take your time before continue reading.

When I ask this question during lectures or seminars, listeners usually identify solutions in which the agent determine its actions on the basis of the current and previous observations, i.e. solutions that require to store information from previous observations in memory and later use that information to decide how to act. For example, the animat can solve the problem by circling clockwise through all cells and stopping on cell 10 after having observed cell 9. As there is a single cell 10 preceded by the cell 9, in the clockwise direction, and since this cell is located in the left side, the animat will correctly reach the left side and stop there. The agent will erroneously abandon the left side, when it happens to start from a succeeding cell, but it will re-enter and stay in the left side later on.

Now let’s try to imagine a solution that does not require memory, i.e. a policy only relying on the color of the cell where the animat is currently located only. Is such solution conceivable ? Again, take your time before continue reading.

I asked this question more than fifty times to small and large audience and nobody never identified a solution that does not require memory. Indeed, solutions exist. They actually sound trivial after seeing one of them (see Figure 4.10). Reactive solutions of this type can be encoded by using only 20 bits specifying the clockwise or anti-clockwise reaction of the animat to each of the 20 colors (“stay still” actions are not useful).

After understanding one of the possible solutions, generating similar solutions becomes easy. What it is crucial is that actions are chosen so to avoid the generation of attractors in the right side of the environment. In principle, actions should be also chosen so to produce attractors in the left side of the environment, but this will occur naturally as a consequence of the fact that the sequence of colors in the two sides differ. Attractors correspond to adjacent cells in which the robot react by moving clockwise and anti-clockwise in the first and in the second cells, in the clockwise direction. Solutions of this type exist for other environments as well, independently from the sequence of the colors, providing that the sequence of colors differ in the two sides.

We can indicate solutions of this type as emergent solutions to emphasize the fact that the ability of the animat in solving the problem emerges from the combined effect of how the animat reacts to many observations. The way the animat reacts to a specific color assumes a function only in combination with the way the animat reacts to the other colors.

4.4 The Foraging Robot

I was not able to imagine emergent solutions of this type either. However, I observed them in experiments that I performed by evolving robots. For example, I observed a solution of this type in an experiment in which a Khepera robot (Mondada, Franzi & Ienne, 1993), situated in an arena surrounded by walls and containing a cylindrical object, was evolved to find and remain near the cylinder (Video 4.1; Nolfi, 1996).

Video 4.1. A foraging robot evolved for the ability to find and remain near the cylindrical object. The video shows the behavior produced by robots evolved in 3 different replications of the experiment. (https://youtu.be/ikXvhT6I4wQ; Nolfi, 1996).

The robot has 6 sensory neurons encoding the state of the 6 frontal infrared proximity sensors, 5 internal neurons, and 2 motor neurons controlling the desired speed of the two rotating (forward and backward) wheels. The sensory neurons enable the robot to detect obstacles up to 4 cm. The evolving robots are initially placed in randomly chosen positions and orientations. Evaluation episodes are terminated after 500 time steps corresponding to 50s. The fitness of the robots corresponds to the fraction of steps the robot spends near the cylinder.

We might reasonably think that to solve this problem the robot should display the following behaviors: explore the environment, discriminate whether a nearby obstacle is a wall or a cylinder, move away from walls, approach the cylinder, and stop near the cylinder. The evolved robots do indeed explore the environment and move away from walls (see Video 4.1). However, they do not stop near the cylinder. Instead they remain near the cylinder by alternating turn-left and turn-right movements and/or move-forward and move-backward movements.

By analyzing the sensory patterns experienced by the robot I observed that those experienced near walls and those experienced near cylinders largely overlapped in sensory space. In other words, as in the case of the foraging animat described in the previous section, a single observation does not allow discriminating  a good location (near the cylinder) from a bad one (near a wall). In principle, these robots could discriminate the class of the obstacle by analyzing how observations vary over time while they approach the obstacle (a strategy that requires to memorize previous observations in internal states and to elaborate the current and the previous observations). However, the evolved robots do not rely on this strategy. Indeed, they solve the problem with the strategy described above regardless of wheather their brain has recurrent connections or not.

The evolved robots rely on an emergent solution analogous to that illustrated in the previous section. This in appreciable in Figure 4.3 showing the translational and rotational movements produced by a typical evolved robots near the walls and near the cylinder. As shown in the top part of the Figure, when the robot approaches a wall it keeps moving toward the object up to a distance of about 25mm and then avoids it by turning right. The robot encounters and avoid three walls in this episode before approaching the cylinder. When the robot approaches the cylinder, instead, it keeps moving toward the object up to a distance of about 10 mm and then starts oscillating by alternating move-forward and move-backward, and turn-left and turn-right actions. In other words, the evolved robots solve the problem by reacting to observations so to generate behavioral attractors near cylinders and not near walls. The presence or the absence of attractors depends on the way the state of the sensors vary near walls and cylinders and on the way in which the robot reacts to observations. By carefully regulating the reaction to observations, the robot ensures that the robot-environmental interaction, mediated by the chosen reaction rules, yield behavioral attractors near cylinders and not near walls.

In other words, the robots can produce different behaviors near walls and cylinders without varying the way in which they react to observations near the two objects by exploiting the fact that the distribution of observations in space vary near walls and cylinders. As we have seen for the Braitenberg’s vehicle illustrated in Figure 2.6, a single control rule can produce qualitatively different behaviors in different environmental circumstances. Consequently, the problem of producing differentiated behaviors in different environmental circumstances admits solutions that rely on a single set of control rules.

Figure 4.3. The arrow indicates the translational and rotational movement of a typical evolved robot near a wall and a cylinder (top and bottom graph, respectively) during a typical evaluation episode. (Nolfi, 1996).

Indeed, the solution discovered by the evolved robots is substantially similar to Braintenberg’s Vehicle 3b, i.e. the robot sets the speed of the two motors to a high positive value when it is far from obstacles and reduces the speed of the left and right wheels proportionally to the activation of the right and left proximity sensors, respectively. These simple control rules enable the robots to avoid the obstacles, move away from walls, and explore the environment by alternating “moving straight” behaviors when far from obstacles and “obstacle avoidance” behaviors near obstacles. By carefully tuning the intensity of the inhibitory effect of the 6 infrared sensors on the two motors, the robots also manage also to alter the robot/environmental dynamics near walls and cylinders so to generate a behavioral attractor only near cylinders.

The solutions found in these experiments generalize to different environmental conditions. Indeed, the evolved robots showed the ability to find and stay near the cylindrical object also in post-evaluation tests with larger and smaller cylinders (Nolfi, 1996).

4.5 Objects Manipulation and Discrimination

The ability to alter the following observations through actions play a key role in most of the solutions discovered by adaptive robots, regardless the nature and the complexity of the problem.

An example involving a more complex problem is the experiment concerning a simulated iCub robot (Metta et al., 2008) evolved for discriminating spherical and ellipsoid objects on the basis of rough tactile information (Figure 4.4, Tuci, Massera & Nolfi, 2010).

Figure 4.4. A simulated iCub robot provided with tactile and joint-position sensors is trained for the ability to discriminate spherical and cylindrical objects located over a table. (Tuci, Massera & Nolfi, 2010).
Figure 4.5. The neural network of the robot. For sake of clarity, only one internal neuron is drawn. (Tuci, Massera & Nolfi, 2010).

The neural network of the robot includes 22 sensory neurons, 8 internal neurons with recurrent connections, and 18 motor neurons (Figure 4.5). 7 sensors encode the angular position of the seven corresponding DOFs of the left arm and wrist, 5 sensors encode the extension/flexion of the fingers, and 10 sensors encode the state of the tactile sensors located on the palm, on the second phalange of the thumb, and on the distal phalange of each other finger (each sensor binarily encodes the presence/absence of pressure). The first 14 motor neurons encode the torque produced by seven couples of antagonistic muscles controlling the seven DOFs of the arm and of the wrist. The 2 following motor neurons encode the desired extension/flexion of the thumb and of the four fingers. Finally, the last 2 motor neurons are used to indicate the category of the object. The robot is free to determine the convention used to indicate spheres and cylinders providing that it produces differentiated activation patterns for the two types of object (see Tuci, Massera & Nolfi, 2010 for details).

The robots are rewarded for discriminating the category of the current object. Moreover, they receive a smaller reward for touching the object. This second fitness component encourages the robot to enter in contact with the object since it constitutes a pre-requisite for discriminating the category of the object. The fitness does not specify how the object should be manipulated. To solve the discrimination problem the robots should develop relatively complex abilities: they should appropriately control an articulated arm with many DOFs, and they should discriminate similar objects on the basis of low-resolution tactile sensors.

Video 4.2 shows how evolved robots manage to correctly discriminate the category of the object in the large majority of cases. The ability acquired is also robust with respect to variations of the relative positions of the object to be discriminated.

Video 4.2. The behavior and the discrimination output produced by the robot during a series of post-evaluation episodes in which the robot interacts with ellipsoid and spherical objects. The discrimination output is indicated by the blue arrow. At the beginning of each episode the arrow points up since the robot is not yet producing a pattern belonging to the first or the second category. Later on, the arrow indicates the discrimination response that usually is maintained until the end of the evaluation episode. The top and bottom video show how the robot successfully discriminate objects located in different relative positions (https://youtu.be/vKCc1wNngMk; https://youtu.be/9XYHtUyxvRk; Tuci, Massera & Nolfi, 2010).

The analysis of the evolved robots indicates that the categorization processes involve three phases. In the first phase, the robot manipulates the object by wrapping it with its fingers and by moving the object until a suitable hand/object posture is reached. The information contained in the tactile stimuli experienced during this phase increases progressively by finally reaching a high informative value when the hand/object couple converge on a posture that remains substantially stable during the rest of the episode (Figure 4.7). During the second phase, the robot starts to indicate whether it believes that the object is an ellipsoid or a sphere, and keeps integrating the information extracted from observations to eventually revise its discrimination decision. The probability that the discrimination answer is modified, however, become progressively lower over time. This leads to the third and final phase in which the discrimination output remains fixed regardless the nature of the new observations.

Figure 4.6 shows the probability that the observations collected during a typical episode lasting 400 steps are experienced interacting with an ellipsoid or with a sphere. At the beginning of the episode the observations does not contain any information. Later on, while the robot manipulate the object, the amount of information provided by the observation increases. During the second phase, which in this case  starts approximately at step 200, the observations indicate that the object belongs to one of the two categories with a confidence of about 80%. This means that single observations provide information on the category of the object but can be misleading. The robots overcome this problem by carrying the discrimination process over time, i.e. by summing the partially conflicting evidences extracted from observations over time until the evidences collected allow performing the categorization with a sufficiently high confidence. For a discussion of the role of continuous processes of this type in the context of human behavior see Spivey (2007).

Figure 4.6. Geometric separability index of observations over time during a typical trial. The index measures the level of separation in sensory space of the tactile observations experienced by interacting with objects of the two categories. Values close to 0.5 indicate observations experienced equally likely when the object is an ellipsoid or a sphere. Consequently, these observations that do not provide any cue about the category of the object. Values close to 1.0 indicate observations experienced when interacting with an object of a specific category only. (Tuci, Massera & Nolfi, 2010).

Overall, this analysis indicates that the evolved robots produce actions enabling them to later experience observations that include cues on the category of the object. The reliability of the discrimination process is ensured by summing over time the evidences collected from multiple observations.

4.7 Cognitive Offloading and Anticipation

We will now see how adaptive robots can use actions to store relevant information in the environment and later exploit such information to achieve their goals. Storing information in the environment, instead than in the robots’ brain, represents a form of cognitive offloading since it enables to offload the brain from the need to store, update and retrieve information. Moreover, we will see how sensory-motor coordination can produce solutions that anticipate future needs, i.e.  solutions in which the execution of the actions that need to be performed at a certain point is preceded by the execution of preparatory actions.

An example of cognitive offloading performed by Tetris players consists in rotating the pieces to check the perceptual match with unfilled spaces rather than rotating them mentally, which is slower and less reliable. Another example consists in crossing the fingers to remember that something needs to be done (Gilbert, 2015). An example of anticipatory behavior, in the context of a reaching and grasping behavior, consists in assuming the right hand posture before the beginning of the grasping phase (Hofsten & Ronnqvist, 1988).

The experiment that can be used to illustrate these aspects involves a simulated MarXbot robot (Bonani et al., 2010) evolved to navigate in a double T-maze environment (Figure 4.7; Carvalho & Nolfi, 2016). The robot, that is initially located in an area at the bottom of the central corridor, with a random position and orientation, should travel toward the blue cylinder that is located in one of the four possible ends of the double T-Maze. The right destination is marked by two green beacons located in the central corridor. For example, if the first green beacon is located on the right and the second on the left as in the case of Figure 4.7, the robot should turn right and then left at the two junctions of the T-maze to reach the right destination. The robot thus faces a time delay problem in which the sensory information experienced at time t should influence the action of the robot at time t+n.

Figure 4.7. The double T-maze environment. The simulated MarXbot robot is initially located at the bottom of the central corridor in a random position and orientation. The position of the blue cylinder that represents the target destination correlate with the relative position of two green beacons located in the central corridor. The positions of the two beacons (right-right, right-left, left-left, or left-right) indicate the direction the agent should take at the first and the second junction, respectively, to reach the target destination. (Carvalho & Nolfi, 2016).

The brain of the robot includes 8 sensory neurons encoding the state of the infrared sensors, 8 visual neurons encoding the percentage of green and blue pixels detected in the 4 frontal sectors of the camera, and 2 motor neurons controlling the speed of the robot’s wheels. The robot can visually detect the green beacons and the blue target only when in their proximity due to the visual occlusions caused by the walls. The experiment has been repeated in two conditions in which the robot is provided with a feed-forward or with a recurrent neural network controller. In the first condition the robot cannot preserve information about previous observations in internal states. The robot is rewarded with 1 point for every successful episode.  

Figure 4.8. Trajectories of a typical evolved robot post-evaluated for 300 episodes. The trajectories indicated in magenta, blue, yellow, and cyan are produced by the robot during the trials in which it should navigate toward the destination shown with the corresponding color. The position of green beacons is not indicated. The figure displays the trajectory produced by a typical evolved robot with a feed-forward neural network, i.e. a robot that determines the action on the basis of the current observation only. Robots provided with recurrent neural network produce similar behaviors. (Carvalho & Nolfi, 2016).

As shown in Figure 4.8, the evolved robot manages to navigate successfully to the right destination most of the times. The analysis of the trajectories produced by varying the position of the beacons and the target destinations permits to understand the strategy used by the robots to solve the problem. The trajectories first converge in the initial portion of the central corridor and then diverge when the robot pass near the beacons. The initial convergence enables the robot to reduce the differences caused by the fact that the initial positions and orientation of the robot vary randomly. The divergence, instead, enables the robot to enter into one of four separated basins of attraction of the robot/environmental dynamics leading to the correct destination.

The observed behavior implies that the robot does not need to store the position of the green beacons in memory and exploit the stored information to decide whether to turn left or right at the junctions. Moreover, it implies that the robot does not need to recognize whether it is located at the first or at the second junction. The robot “stores” the information conveyed by the position of the green beacons in its relative position with respect to the maze walls. Specifically, it “stores” the information by assuming a relative distance from the left and right wall of the corridor that varies depending on the position of the beacons and that anticipates the direction in which the robot should turn at the next junction. Consequently, when the robot arrives at a junction, it can decide the direction of turning on the basis of its position and orientation with respect to nearby walls. In other words, the robot reacts to the perception of the green beacons by varying its relative position in the corridor, maintains the assumed position, and later use the position to turn in the direction indicated by the position of the green beacons experienced earlier.

At this point we might wonder whether robots can discover the possibility to store information into internal states and later use this information to act appropriately. The results of a second set of experiments in which the robots were equipped with recurrent networks and the positions of the robot was perturbed during motion indicate that evolving robots are indeed able to develop cognitive strategies relying on memory. Consequently, the development of the strategy described above cannot be ascribed to the inability of the robots to develop alternative strategies.

In this new set of experiments, the robots are suddenly moved on their left or on the right with a given low probability. In the presence of this form of perturbations, the robots cannot longer solve their problem by using the strategy described above, or at least by using only the strategy described above, since the information stored in the robot’s position is cancelled by the displacement of the robot.

As shown in Figure 4.9, the robot managed to solve the problem also under these conditions. The analysis of the trajectory shows that it keeps using the cognitive offloading strategy. It however extends the strategy with the ability to store information about the beacons in internal states and to use this information to re-enter into the right basin of attraction after position perturbations.

Figure 4.9. The trajectories of a typical evolved robot subjected to position perturbations in the area preceding the first junction. The four pictures show the trajectories of the robot during evaluation episodes in which it should navigate toward one of the four different target destinations. The black and colored lines are used to indicate the trajectories of the episodes in which the robot failed or succeeded. As shown, the robot normally re-enters in the right basin of attraction after position perturbations. (Carvalho & Nolfi, 2016)

These results demonstrate how reactive strategies, that use sensory-motor coordination to store the relevant information in the environment, and cognitive strategies, that store relevant information in internal states, do not represent alternative solutions and can rather be combined to maximize the effectiveness of the robot’s behavior.

4.8 Situatedness

I like to conclude this Chapter by going back to the notion of situatedness.

As for embodiment, we can define situatedness as a binary property that discriminates systems that are situated in an external environment, with which they interact, from systems that are not. This binary definition is useful since it allow distinguishing qualitatively different systems, e.g. autonomous robots from expert systems.

The term situatedness, however, can also be used to capture another property that is quantitative in nature. It can be used to indicate to what extent a system that is situated exploits the possibility to alter its perceptual and/or physical environment through actions. From this perspective, autonomous robots can be more or less situated. As for embodiment, the higher the level of situatedness is, the greater the chance that the robot will operate effectively is. Moreover, the higher the level of situatedness is, the higher the level of computation that is performed as a result of the interaction between the robot and the environment is, and consequently, the smaller the amount of computation that need to be performed internally by the brain is.

Figure 4.10. One of the possible reactive solutions to the foraging animat problem. The arrows indicate the action produced in each cell. The agent produces the same action in cells marked with the same number, independently from whether the cell is located on the good or bad side of the environment. Consequently, the agent can determine its action on the basis of the current observation only, without storing and using information about previously experienced cells. This solution ensures that the agent moves away from the bad side and remains in the good side by oscillating over one of the four attractors (indicated with a1-a4). (Nolfi, 2005)

4.9 Learn how

Read the section 13.7 and familiarize with the evorobotpy2 environments that permits to evolve wheeled robots. Follow the instruction included in the Exercise 6 to replicate the experiment described in Section 4.4. Analyze the behaviors obtained by robots that have two additional infrared sensors on the rear side and a recurrent neural network policy.



5. Behavior and Cognition as Complex Dynamical Systems

5.1 Introduction

Behavior is a dynamical process that unfolds in time and originates from the continuous interaction between the robot’s body, the robot’s brain, and the environment. In the previous two chapters we stressed the fact that behavior is not generated by the brain of the robot only. The body of the robot and the environment play an important role as well. Moreover, the behavior of the robot is not simply the result of the contribution of the three constituting components (the brain, the body and the environment). It is the result of their interactions.

In this chapter we will focus on the dynamical system nature of behavior. The term dynamical systems indicates systems that change over time. It is a wide class that includes pendulums, planetary systems, weather, growing plants, human behaviors, stock markets, and human languages. More specifically, we will illustrate the implications of the dynamical system nature of behavior for robotics. Dynamical systems display a series of important properties. Exploiting these properties is essential to produce effective behaviors.

5.2 Dynamical system theory

Dynamical system theory is the set of ideas and mathematical tools that describe how dynamical systems change over time (Strogatz, 2001). It originated as a branch of physics and then becomes an interdisciplinary subject. The field started in 1600, when Newton invented differential equations, discovered his laws of motion, and combined them to explain planetary motions.

Unfortunately, the analytical method used by Newton can be applied only to very special cases. For this reason, the second crucial progress in dynamical system theory was the introduction of the qualitative method discovered by Pointcaré that can be applied to any dynamical system.

The invention of high-speed computers, that enabled the possibility of simulation studies, represented the third crucial step.

One of the first notable application of dynamical system theory was made by Edward Norton Lorenz who modelled rolling fluid convection in the atmosphere with a set of 3 non-linear differential equations. By simulating how the state of the variables governed by the equations vary over time, Lorenz observed a chaotic dynamic (Figure 5.1). The state of the system continues to oscillate in an irregular aperiodic fashion without never setting down to an equilibrium or to a periodic state. The line making up the curve, representing how the state of the system varies over time, never intersects itself and loops around forever spending a variable time on each wing. Moreover, Lorenz observed that the behavior produced was highly influenced by the initial state. Minor differences in the initial state produced totally different behaviors later on. This implies that the system is inherently unpredictable since tiny errors in the measure of the initial state have large effects in the long term. Finally, Lorenz observed that the chaotic dynamic exhibited by his system included a form of structure. Indeed, as shown in Figure 5.1, the plot of the state of the system over time produce a butterfly-shaped trajectory.

As it will later become clear, these peculiar properties do not characterize the rolling fluid convection only. They are general properties characterizing many different systems.

Figure 5.1 Lorenz’s chaotic attractor. The state of the system is characterized by three variables represented in a Cartesian 3-dimensional space. The yellow line represents the state of the system over time. (https://commons.wikimedia.org/wiki/Category:Lorenz_attractors?uselang=it#/media/File:Lorenz_system_r28_s10_b2-6666.png).

5.3 A brief introduction to dynamical systems

In this section I will briefly summarize the essential concepts and terminology of dynamical systems theory. For a more detailed account see Strogatz (2001).

To describe the behavior of a dynamical system we should first identify the mutable parts or properties. This permits to describe the system with a set of state variables χ, which indicate the current state of the varying components. The dynamical law specifies how the state variables change over time. The dynamical law can be represented by a set of differential equations or iterated functions.

Figure 5.2. The pendulum.

For example, in the case of a pendulum the variables that change over time are the angle (θ) and the angular velocity (θ̇) of the pendulum. The dynamical law that determines how the state of the pendulum changes over time can be modelled with the following differential equation (see Tedrake, 2019):

where g is the acceleration produced by gravity, l is the length of the pendulum, and b is a damping factor For simplicity we assume that the rope is massless.

By solving the differential equation, we would be able to calculate the state of the pendulum at any arbitrary time t+n on the basis of the state of the pendulum at time t. However, the presence of a non-linearity in the equation, i.e. the sin component, prevents the possibility to solve the equation analytically. This, however, does not prevent the possibility to analyze the geometrical and/or topological structure of the dynamics, as we will see below. 

The state space of the system includes all the values that the variables can assume. A phase portrait is a geometrical representation of the dynamics of the system in the state space. It can include a curve that visualize how the state of the system varies over time from an initial state (as in Figure 5.1). Alternatively, it can include the vectors (represented with arrows) that display how the state of the system changes in a single time step from any possible initial state (Figure 5.3).  The phase portrait permits to analyze qualitatively the behavior of the system. For example, it permits understanding whether the system converge to a stable state, which remains constant from a certain point on, or to a chaotic attractor, as in the case of the rolling fluid convection system illustrated above.

Figure 5.3. Phase portrait of the dumped pendulum. The horizontal and vertical axes indicate the angular position and velocity of the pendulum. The blue arrows indicate how the angle and the velocity vary in a time step from any possible state.

The long-term behavior of a dynamical system can diverge or converge on a specific set called attractor, i.e. a state or a set of states toward which a system tends to evolve from all starting conditions or from a large set of different starting conditions. In the case of equilibrium point attractors, the state of the system converges toward a specific state. This is the case of the dumped pendulum that converges on the state [θ = 0, θ̇ = 0], i.e. on a resting position in which the pendulum lies down still. The fact that the long-term dynamics of the dumped pendulum converges on this equilibrium point attractor is clearly shown by the phase portrait (Figure 5.3). In the case of limit cycle attractors, the state of the system converges over a closed trajectory in the state space that repeats over and over (for example the closed curve shown in red in Figure 5.4). Finally, in the case of strange attractors the system converges over a fractal structure like that shown in Figure 5.1.

The convergence is caused by the fact the trajectories passing through nearby states are oriented toward the attractor. The area of the state space that converge on the attractor is called basin of attraction. The basin of attraction implies that the system will converge on the attractor, after a transient phase, from any initial state situated in the basin of attraction. In the case of equilibrium points and limit cycle attractors, the system converges on the same states independently of the initial conditions. In the case of strange attractors, instead, the system is sensitive to the initial condition in the long term. This implies that any two arbitrarily close alternative initial points in the state space can lead to states that are arbitrarily far apart in the long term.

Figure 5.4 An example of limit cycle attractor. (https://commons.wikimedia.org/wiki/File:Van_der_Pol_stable_limit_cycle_mu%3D1.svg).

In general, the state space of a dynamical system can contain multiple attractors surrounded by associated basin of attractions. These basins are separated by unstable manifolds called separatrices. This implies that similar states located far from the separatrices are situated in the same basin of attraction and converge on the same attractor. Instead, initial states located near the separatrices can belong to different basins of attraction and converge on different attractors.

Finally, the dynamic of a system is influenced by parameters. For example, the dynamic of a pendulum is influenced by the value of gravity, the dynamics of a robot is influenced by the battery level. This means that the behavior of the system and the phase portrait of the system change when a parameter is modified. A parametrized dynamical law defines a family of dynamical systems, with any particular flow corresponding to a single set of parameters.

In general, dynamical systems display structural stability, i.e. the dynamics of the system changes only slightly as a result of small variations of parameters. The value of the limit sets and the shape of basins of attraction can vary slightly but the number and type of attractors remains stable. However, at certain parameter values, dynamical systems can show structural instability. This means that minimal changes in the value of the parameters can produce drastic qualitative changes. These qualitative changes are referred as bifurcations.

A simple example illustrating the sensitivity to parameters is the logistic map system introduced by Robert May (1976) that has a state space characterized by a single dimension (x) and is governed by the following iterated map equation:

Figure 5.5 shows the long-term state or states produced by the system for different values of the m parameter. When the parameter m is in the range [1.0, 3.0], the system converges on an equilibrium point attractor (see Figure 5.5). When m is in the range [3.0, 3.44949], the system converges on a limit cycle attractor in which the state oscillates between two values (from most of the initial conditions). When m is in the range [3.44949, 3.56995], the system converges on a limit cycle attractor formed by 4, 8, 16, 32, and more values. Finally, when m is  > 3.56995, the system converges on a strange attractor.

Figure 5.5. Long-term state/states produced by the logistic map for different values of the parameter  The horizontal and the vertical axes indicate the value of the parameter m and the state x assumed by the system in the long term.(https://commons.wikimedia.org/wiki/Category:Bifurcation_theory#/media/File:Diagrama_de_bifurcaci%C3%B3n.png)

This ends this brief introduction to dynamical systems. In the next sections we will discuss the implications of the dynamical system nature of behavior for robotics.

5.4 Fully-actuated versus under-actuated systems

The dynamical system tools permit to clarify in a more precise way in which sense the behavior of a robot is not simply a manifestation of the robot’s brain but is rather the result of the interaction between the brain of the robot, its body, and the environment.

For this purpose, it is useful to distinguish between fully activated and under-actuated robots. The former can be commanded to follow arbitrary trajectories in configuration space by controlling all the degrees of freedom and thanks to the availability of unlimited energy resources. The latter instead cannot perform as the former due to the lack of actuators with respect to the degrees of freedom and/or due to the presence of constraints and limits on the actuators.

As an example of a fully actuated systems let’s consider a pendulum with a high-powered actuated joint. The system is fully actuated since the number of actuators equals the number of DOFs and since the actuator does not have limitations, e.g. power limitations. For a system of this type the effect of gravity, and consequently the dynamics arising from the interaction between the pendulum and the environment, can be cancelled. This can be obtained by using the motor to apply a torque equivalent to the gravitational force acting on the pendulum, i.e. equivalent to the tangential component of gravity (assuming a rigid rod). Once the dynamics originating from the interaction with the environment is cancelled, controlling the pendulum become trivial. To move the pendulum in a given direction, the controller should simply apply an additional torque in the right direction until the pendulum assumes the desired target position. In a system of this kind, the behavior is generated from the robot controller. The interaction between the system and the environment does not play any role since the effect of gravity is neutralized and since the system does not interact with external objects.

As an example of highly-actuated system we can consider ASIMO®, the humanoid robot released by Honda in 1996 that represented the state of the art of walking robots for the following 20 years. Although ASIMO displayed impressive walking capabilities, its behavior looked unnatural. This was due to the fact that, similarly to the fully actuated pendulum described above, ASIMO uses a lot of torque to cancel the natural dynamics and to follow a planned trajectory. Indeed, the robot consumes at least an order of magnitude more energy than a human to walk. I used the term highly-actuated instead than fully actuated since robots are always characterized by a certain level of under-actuation. Controlling all DOF is practically impossible and relying on unlimited energy is unfeasible in most of the cases. Still, for the reason described above, ASIMO constitutes a highly-actuated system.

Under-actuated systems, on the contrary, are energy-efficient. They do not waste energy in the attempt to cancel the natural dynamics originating from the interaction with the environment. Examples of underactuated systems are the passive or quasi-passive robots described in Section 3.2. Most observers would agree that the walking style of these machines is more natural than ASIMO’s.

Another example of under-actuated systems is a low-powered actuated pendulum. The limitation of the actuator prevents the possibility to cancel out the natural dynamics of the system which is illustrated by the phase portrait of a passive pendulum (Figure 5.3). The actuator can only be used to vary slightly the direction and the module of the vectors illustrated in the phase portrait (Figure 5.6).

Figure 5.6. Left: Schematization of the actuated pendulum. The red arrow indicates the actuator. Right: Schematization of the phase portrait of an underactuated pendulum. The vertical and horizontal axes represent the angular position and velocity of the pendulum. The black arrows on the horizontal and vertical axes represent the tangential component of gravity and the inertia of the pendulum blob, respectively. Due to its limitation in power, the actuator can only modify slightly the direction and the module of the vectors as exemplified by the red arrows. This implies that the dynamics of the underactuated pendulum will remain qualitatively similar to that of a passive pendulum illustrated in Figure 5.3. Figure adapted from Tedrake (2019).

In the case of underactuated systems, therefore, the role of the brain consists in altering as slightly as possible the vector field that originate naturally from the interaction between the robot and the environment. The goal is that to vary the natural behavior of the systems within the limits imposed by constraints. In other words, the role of the brain is to orchestrate the robot/environmental dynamics in a way that enables the production of functional useful behavior.

5.5 Stability and flexibility

The long-term stability and the sensitivity to critical conditions of dynamical systems has also important implications for robotics.

The possibility to produce behaviors that are robust to environmental variations is essential for robots since the physical world is full of uncertainties and almost no parameter, dimension, or property tends to remain exactly constant. We will discuss this aspect in more details in Chapter 7.

Dynamical systems provide a structural solution to the problem of generating behaviors that are effective and that are robust to perturbations: attractors with associated basins of attraction. Attractors cannot exist in isolation. Their existence depends on the presence of transient trajectories that converge to the attractor forming the associated basin of attraction. The basin of attraction thus plays both the roles mentioned above: ensures that the system converges to a desired steady state dynamic and ensures that the system converges back to that dynamics after deviations caused by variations or perturbations.

To clarify the concept, let’s consider a counter-example, i.e. a system exhibiting a not structurally stable behavior (Tani, 2016). For example, a frictionless spring-mass system (Figure 5.7, left) described by the following equations:

where x is the one-dimensional position of a mass m, v is its velocity, and k is the spring coefficient. A system of this kind will start to produce a periodic oscillatory behavior with an amplitude that depends on the initial extension of the spring. This behavior, however, is not structurally stable. Indeed, if we apply a force to the mass of the oscillator, the amplitude of the oscillation will change immediately and will not converge back on the original oscillation amplitude. The reason why the behavior of the system is unstable is that it does not corresponds to a limit-cycle attractor surrounded by an associated basin of attraction. Indeed, the phase portrait on the system on the 2-dimensional state space constituted by the x and v variables does not show a steady state set surrounded by a basin of attraction. It rather shows a set of concentric circles lacking any convergent flow (Figure 5.7, right).

Figure 5.7. Left: Schematization of a frictionless spring-mass system with spring coefficient k, mass m and position x. Right: phase portrait of the system. Adapted from Tani (2016).

We already encountered examples of functional behaviors corresponding to fixed points or limit cycle attractors documented by phase portrait analysis: the behavior of the foraging robot and of the robot navigating in the T-maze illustrated in Section 4.4 and 4.7. Presumably, also the behavior generated by the passive walker illustrated in Section 3.2 is the result of a limit cycle attractor that drives the system toward a specific sequence of states that repeats over and over. Indeed, although a phase-portrait analysis of this system is missing, the ability to spontaneously recover from divergences occurring during motion can be attributed to the presence of a basin of attraction that brings the system back to the limit cycle.

The second important structural property that is relevant for robotics is flexibility, namely the possibility to drastically modify the behavior exhibited by the system in specific contextual situations requiring the production of qualitatively different behaviors. The combination of stability and flexibility over time can produce a form of chaotic itineracy characterized by quasi-stable behaviors interrupted by sudden transition leading to other different quasi-stable behaviors (Iizuka & Ikegami, 2004). The possibility to realize drastic behavioral changes is enabled, as we have seen, by the possibility to generate bifurcations in the system dynamics through variations of critical parameters. We will see an example of phase transition in behavior in Section 5.7.

5.6 Multi-levels and multi-scales organizations

Finally, the last fundamental property of dynamical system (which has implications for robotics) is the possibility to generate dynamics with multi-level and multi-scale organizations. As an example of organization of this type we can consider human beings, a dynamical system formed by heterogeneous entities (molecules, cells, individuals, social groups) spanning at widely different levels of organization. The elements forming each level are independent from other elements of the level and from other levels, to some extent, but are also part of the same dynamical system.

From the perspective of this book, a relevant issue is the fact that the behavior exhibited by a robot manifests itself at different levels of organization. It usually displays a hierarchical multi-level and multi-scale organization that includes an initial level, formed by elementary behaviors extending for limited time periods, and by higher organization levels, including higher and higher level behaviors extending for longer time periods (Nolfi, 2009). The behaviors forming each level are partially independent (West-Eberhard, 2003; Carvalho and Nolfi, 2016). They play different functions and extend over different periods of time. However, they are not completely independent. They are part of the overall behavior produced by the robot and they are not separated by clearly identifiable boundaries. As we will see in the following chapters, the multi-level and multi-scale organization of behavior plays important functions. For example, it permits to generate a large number of behaviors by recombining in a compositional manner a smaller set of elementary behaviors and facilitates the development of complex behaviors.

5.7 Affordances and behavior arbitration

An example that illustrates the importance of displaying behaviors organized in functional specialized sub-behaviors and the mechanism that can regulate the transition between sub-behaviors is constituted by the cleaning robot described in Carvalho and Nolfi (2016).

The experiment involves a simulated MarXbot robot (Bonani et al., 2010) that is evolved for the ability to vacuum clean the floor of an in-door environment. The robot is provided with 8 infrared proximity sensors, which encode the average activation state of eight groups of three adjacent sensors, a time sensor that encodes the amount of time passed since the start of a cleaning episode, 3 internal neurons, and 2 motors neurons that control the desired speed of the two wheels. The environment consists of an indoor area including a large central room and a series of corridors and smaller rooms (Figure 5.9). The fitness corresponds to the fraction of 20x20cm non-overlapping areas visited by the robot at least once during a cleaning episode. At the beginning of each episode, the length of the walls surrounding the environment are varied randomly up to 10% and the robot is placed in a randomly selected position and orientation within the large room. Episodes are terminated after 7500 steps corresponding to 6m and 15s.

Figure 5.9. Typical trajectories displayed by evolved robots. The black lines represent the walls. The blue, red, and green lines display the trajectory of the robot during three consecutive phases lasting 2500 steps (i.e. 250s). Top: robot displaying a single uniform behavior. Bottom: robot producing a behavior composed of two sub-behaviors specialized for cleaning the large room and the peripheral areas. (Carvalho & Nolfi, 2016)

As reported by the authors, robots lacking the time sensor evolved sub-optimal solutions characterized by the production of a single exploration behavior (Figure 5.9, top). This behavior is generated by alternating obstacle-avoidance phases near obstacles and straight motion phases far from obstacles. These robots fail to clean part of the peripheral areas. The robots provided with the time sensors, instead, develop a better solution that involves the production of two qualitatively different behaviors: a spiraling behavior to clean the large central room, and a wall following behavior to clean the peripheral portions of the environment (Figure 5.9, bottom). This behavior actually resembles that produced by the first versions of the iRobot Roomba® vacuum cleaning robot.

In the case of the robot shown in Figure 5.9 (top) the role played by the evolved brain can be summarized with two control rules that make the robot turn left or move straight when the frontal infrared sensors are activated or not, respectively. By moving straight far from obstacles, the robot keeps experiencing similar observations and consequently keeps moving straight until it approaches an obstacle. At that point, the activation of the frontal infrared sensors elicits the second rule that makes the robot turn left until its frontal side is free from obstacles. This, in turn, re-triggers the execution of the first control rule that generates a straight movement behavior, again.

In the case of the robot shown in Figure 5.9 (bottom), instead, the role of the evolved brain can be summarized with two control rules that are responsible for the spiraling and wall-following behaviors. The first control rule makes the robot move forward by turning left with an angle that decreases progressively as time passes. This rule thus produces a spiraling behavior characterized by larger and larger spirals that finally become a move-forward behavior in which the robot also slightly turns on the right. This final behavior combined with the second control rule (which makes the robot turn slightly on the left with an angle proportional to the activation of the right infrared sensors), enables the robot to produce a wall following behavior during the second part of the episode.

The transition from the spiraling and to the wall-following behavior is regulated by the state of the time sensor, which depends on the time passed, and by the state of the infrared sensors, which depends on the relative position and orientation of the robot in the environment. Indeed, the transition occurs within an appropriate time window when the right infrared sensors are activated. This ensures that the transition occurs when the robot is located near a wall with the wall on its right, i.e. in a position from which the wall-following behavior can be initiated. Moreover, this ensures that the transition occurs at the right time, i.e. after the robot spent enough time cleaning the central portion of the environment and has still sufficient time left to clean the peripheral portions of the environment.

We can indicate the state that triggers the execution of a behavior, and consequently that produces the transition to the new behavior, with the term affordance (Gibson, 1979; see also Chemero, 2011). Affordances can be described as states, extracted from observations, which indicate the opportunity to exhibit a certain behavior. From a dynamical system perspective, as we said above, affordances consist of critical values of parameters that produce a bifurcation in the robot/environmental dynamics.

Affordances can be provided naturally by the environment without the intervention of the robot or can be co-generated by the robot through actions and/or internal processing. For example, the activation of the frontal infrared sensors “affording” an obstacle avoidance behavior occurs naturally. Every time the robot approaches or is approached by an obstacle, i.e. every time the robot should avoid an obstacle, the robot experiences the state of its frontal infrared sensors activated. Instead, the activation of the right infrared sensors that “afford” the wall-following behavior in the robot shown in Figure 5.9 (right) is co-generated by the robot through the exhibition of the left-spiraling behavior. Indeed, this behavior ensures that the robot later experiences the observation that triggers the wall-following behavior.

5.8 Learn how

Read section 13.8 and follow the instructions included in the Exercise 7 to plot the phase portrait of an under-actuated pendulum at different stages of the evolutionary process.



6. Adaptation

6.1 Introduction

In this chapter I will introduce the methods enabling robots to adapt to their task environment, i.e. to develop the behavioral and cognitive skills required to perform a desired function. I will introduce evolutionary methods in Section 6.2, reinforcement learning in Section 6.3 and learning by demonstration in Section 6.5. Each family is loosely inspired by a corresponding natural adaptation process. The three classes of methods should not be considered as alternatives. Indeed, they can be usefully combined.

The description is necessarily synthetic and does not attempt to be exhaustive. After briefly introducing each method, I will explain in details only a few state-of-the art algorithms selected among those that are particularly suitable for robotic applications.

6.2 Evolutionary ­robotics

A straightforward way to generate adaptive robots consists in evolving them through a process analogous to natural evolution, an approach that is referred as evolutionary robotics (Nolfi and Floreano, 2000; Nolfi, Bongard, Husband & Floreano, 2016). This can be realized by creating populations of artificial genotypes, which encode the characteristics of the robots (phenotypes), and by allowing the best genotypes to produce copies of themselves with variations. We already introduced the method in Section 2.4 and we already encountered several examples of evolved robots in Chapters 3-5.

The process is schematized in Figure 6.1. The genotypes of the initial generation are created randomly. Each genotype is used to produce a corresponding robot (phenotype) that interacts with the environment for one or more evaluation episodes. During the episode/s the performance of the robot is evaluated with a scalar value (fitness).  Eventually, the characteristics of the robots might also change while they interact with the environment as a result of development and/or learning. The best robots are allowed to reproduce, i.e. to produce copies of their genotype (offspring) including variations (introduced through mutations or other genetic operators). The offspring form the new generation. The evolutionary process continues until a total maximum number of evaluations is performed.

Figure 6.1. Schematization of the evolutionary robotics method. See text for explanation.

The role of the experimenter is limited to the design of a fitness function, i.e. the function that rates the ability of a robot. For example, in the case of robots evolved for the ability to walk, the fitness function might measure the distance traveled by the robot during an evaluation episode. We will illustrate the criteria that can be used to design effective fitness functions in Section 6.5.

The experimenter should also specify the developmental process, i.e. the rules that determine the relationship between the genotype and phenotype. These rules, however, are generally problem independent. Consequently, to evolve a robot for the ability to solve a given problem the experimenter should design the fitness function and select a developmental process among those proposed in the literature (see Section 6.2.3). The experimenter should also specify the characteristics of the robots that remain fixed and that are not subjected to the adaptive process.

6.2.1 Evolutionary algorithms

The evolutionary algorithm determines the way artificial genotypes are selected and varied. Several approaches have been proposed over the years.

A first family of algorithms known as genetic algorithm was developed by John Holland and collaborators (Holland, 1975, 1992). In typical implementations, these algorithms operate on genotypes constituted by vectors of binary values, select the reproducing genotypes stochastically with a probability proportional to their relative fitness, and generate variations through recombination and mutations. Recombination is usually realized by dividing the genotype of two reproducing individuals in two or three parts and by pasting (crossing over) some of the parts copied from the first genotype with the complementary parts copied from the second genotype. Mutations are usually realized by flipping each bit with a low probability, e.g. 1%. New generations are formed by offspring. Eventually, a non-varied copy of the best genotype of the previous generation can be included in the new generation (elitism).

A second family of algorithm known as evolutionary strategies was developed by Ingo Rechenberg (Rechenberg & Eigen, 1973) and Hand-Paul Schwefel (Schwefel, 1981, 1995). In typical implementations, these algorithms operate on genotypes formed by vectors of real numbers and generate variations by perturbing all real numbers with small randomly generated values. Often the best parents are allowed to remain in the new generation, i.e. only the worse parents are replaced with offspring.  

Modern evolutionary strategies are an extended version of evolutionary strategies that combine the matrix of random variations and the fitness received by offspring to estimate either the correlation among variations leading to high fitness (Hansen & Ostermeier, 2001) or the gradient of the expected fitness (Wierstra et al., 2014). This information is used to explore the most promising directions of the search space. In typical implementations, these methods use a single parent, generate population of individuals distributed in the multi-dimensional search space around the parent, and vary the genes of the parent slightly in the direction of the fitness gradient. A particularly effective instance of these methods is the OpenAI-ES algorithm (Salimans et al., 2017; see also Pagliuca, Milano & Nolfi, 2019) that will be illustrated in the next section.

The methods relying on multiple parents stress the advantages that can be gained by exploring different areas of the search space in parallel. These advantages can be maximized by introducing mechanisms that favor the diversification of the population like: (i) the preferential selection of parents that differ from other selected parents (Mouret & Doncieux, 2009; Pugh, Soros & Stanley K.O., 2016), or (ii) the subdivision of the population in sub-populations that evolve almost independently of each other (Whitley, Rana & Heckendorn, 1997; Skolicki & Jong, 2004). The methods that rely on a single parent, instead, tend to attribute more importance to the resolution adopted to explore a single portion of the search space. In principle one can use hybrid methods combining multiple parents and numerous offspring for each parent. In practice, however, hybrid solutions of this type are too computationally expensive.

6.2.2 The OpenAI-ES algorithm

The OpenAI-ES algorithm (Salimans et al., 2017) is a natural evolutionary strategy (Sehnke et al., 2010; Wierstra et al., 2014) that estimates the gradient of the expected fitness by using uniform distributed variations and updates the centroid of the population through the Adam stochastic gradient optimizer (Kingma & Ba, 2014).

The pseudo-code of the algorithm with standard parameters is included below. In a nutshell, it evolves a distribution over policy parameters centered on a single parent θ composed of λ × 2 individuals. At each generation, the algorithm generates the gaussian vectors ε that are used to perturb the parameters (line 4), and evaluates the offspring (line 4-5). To improve the accuracy of the fitness estimation the algorithm generates mirrored samples (Brockhoff et al., 2010), i.e. generates λ couples of offspring receiving opposite perturbations (line 4-5). The difference of the fitness obtained by each offspring couple is then ranked and normalized in the range [-0.5, 0.5]] (line 7). This normalization makes the algorithm invariant to the distribution of fitness values and reduces the effect of outliers. The estimated gradient g is then computed by averaging the dot product of the samples ε and of the normalized fitness values (line 8). Finally, the parameters of the parent θ are updated for one step by using the Adam optimizer (line 9). The update of the parameters can include a weight decay normalization operation.

To better clarify how the gradient is estimated, let's consider, for example, the case in which: (i) the parent (θ) includes only two parameters that are initially set to 0.0, (ii) the size of the population is 6, and (iii) the randomly generated samples are: σ * εi = ((-.0193, .0083) (-.0133, .0211) (-.0167, -.0212)). The resulting population will correspond to the red, green, and grey circles and to the three symmetric blue, cyan, and orange circles (Figure 6.2). The population is distributed around the parent (θ) which, in this example, is located at the origin of the axes. Let's now assume that the evaluation of the six individuals produce the fitness values displayed near the circles (first numbers). The normalized ranked fitness will correspond to the following values: gray=0.5, cyan=0.3, orange=0.1, blue=-0.1, green=-0.3, red=-0.5 (indicated by the second number near the dots in the figure). The contribution of each offspring to the estimated gradient will correspond to the colored arrows, where the module and the direction of the arrows correspond to the normalized fitness and their orientation corresponds to the relative positions of the offspring with respect to the parent in the parameter space. The estimated gradient is indicated by the black arrow which is the vector sum of the six colored arrows (Figure 6.2). Notice that, for graphical purpose, the length of the black arrow was reduced of 95%.

Figure 6.2. Visualization of gradient estimation. The x and y axes represent the value of the first and second parameter, respectively. The circles represent the population. The two numbers on the top-right of the circles indicate the natural and normalized value of the fitness, respectively. The colored arrows represent the contribution of each offspring to the estimated gradient. The black arrow indicates the gradient of the expected fitness. Please notice that the green and cyan arrows overlap. For visualization purpose the length of arrows has been reduced of 95%.

The parameters of the parent θ will then be modified in the direction of the black arrow and the process will be repeated for several generations. The exact modification of the parameters depends on the Adam stochastic optimizer (Kingma & Ba, 2014) that computes a moving average of the estimated gradient over time and uses the momentum and the squared momentum of each parameter to accelerate the movement in the directions of the gradients that are consistent over time and to decelerate the movements as a function of the steepness of the gradient.

The accuracy of the gradient estimation and the effectiveness of the algorithm can be improved by exposing symmetrical offspring to the same environmental conditions (Nolfi, unpublished result). Moreover, the effectiveness of the method can be improved by normalizing the observation vectors during the evolutionary process (Salimans et al., 2017; see also Pagliuca, Milano & Nolfi, 2020).

6.2.3 Development

The developmental process prescribes how the information encoded in the genotypes, i.e. the information structures that are subjected to variation and selection, determines the phenotypes, i.e. the robots that are situated in the environment.

The simplest way to realize the developmental process consists in using a direct encoding method in which each element of the genotype specifies a corresponding property of the phenotype. For example, as in the case of the experiment of Pagliuca & Nolfi (2020) reported in Section 3.4, the genotype can be formed by a vector of real numbers that specifies the connection weights of the robot’s neural network and the properties of the elements forming the body of the robot. Direct encoding methods operating on genotypes formed by vectors of real numbers can be used in combination with modern evolutionary strategies, such as the one illustrated in the previous section, which estimates the gradient of the expected fitness. However, direct encoding methods might not be suitable to encode phenotypes characterized by a variable number of features (e.g. a variable number of body elements with parametrized properties).

A second possible method consists of using genotypes formed by a list of tuples with a type and a set of associated parameters, as in the case of the experiment of Lipson & Pollack (2000) reviewed in Section 3.4. Each tuple encodes a corresponding phenotypical element (e.g. a body element or a connection among two neurons), as in the case of the direct encoding methods. However, the relation among elements can vary and can be encoded by using indexes pointing to other elements. For example, the tuples encoding neural connections can include two indexes pointing to the two neurons from which the connection originates and ends. As discussed in Chapter 3, this method can be used to evolve bodies formed by a variable number of elements assembled in variable topologies (as in Lipson & Pollack, 2000) or neural networks formed by a variable number of neurons connected in variable architectures (as in the case of the NEAT algorithm developed by Stanley & Miikkulainen, 2002). The length of the genotype and the number of elements forming the phenotype can be increased or decreased by using genetic variation operators which add or remove tuples.

A third method consists in encoding a developmental neural network in the genotype by using one of the two methods described above. The developmental neural network is then used to determines the presence/absence of the elements in the body or brain space and the property of the elements. An example of this method is the work of Stanley, D’Ambrosio & Gauci (2009) who evolved neuro-agents provided with neural network policies with variable architectures. This is realized by postulating the existence of a series of neurons arranged in successive planes embedded in an Euclidean space and by evolving a developmental network with six input neurons encoding the x, y, and z position of two neurons, and two output neurons encoding the presence/absence of the corresponding connection and the connection weight. Connections and the connection weights are then determined by querying the developmental network with the spatial coordinates of all possible couple of neurons. A similar method was used by Cheney, Clune & Lipson (2014) in the experiment reviewed in Section 2.3 to encode the body properties of the evolving agents. In that case, the input neurons of the developmental network encode the x,y and z position of a voxel in the 10x10x10 body space. The output neurons specify the type of cell that should be placed in the voxel or whether the voxel should remain empty. The network is then queried 1000 times, for each voxel position, to determine the content of each voxel.

Finally, the fourth method consists in encoding in the genotype the developmental rules that determine how an initial stem cell divide in two daughter cells that, in turn, subdivide multiple times generating 4, 8, 16 and more cells. During the division process the cells can differentiate and specialize. This produces a developmental process analogous to that of natural multicellular organisms. This method has been used by Joachimczak, Suzuki & Arita (2016) to evolve agents formed by 2D spherical cells connected through elastic springs.

Some authors explored the possibility to combine evolution and learning (Nolfi & Floreano, 1999). For example, in the model proposed by Floreano, Durr & Mattiussi (2008), some of the connection weights are genetically encoded while others are set randomly and are adapted on the basis of genetically encoded Hebbian rules while the robot interacts with its environment. We will describe more recent attempts to combine evolution with un-supervised and self-supervised learning in Chapter 11.

6.3 Reinforcement learning methods

Another way to build adaptive robots is reinforcement learning (Sutton & Barto, 2018), a form of trial and error learning analogous to that used by animals and humans to adapt during the course of their life. The process is driven by a reward function that rates how well or how badly the robot is doing.

Reinforcement learning differs from evolution in two respects: (i) it operates on a single individual, instead than on a population of individuals, and (ii) it varies the parameters of the robot on the basis of a reward that rates the performance of the robot on each single step, instead than on the entire evaluation phase. The overall objective of learning is that to maximize the cumulative reward, as in the case of evolutionary methods. Although, the method is usually used to maximize the discounted expected return (illustrated below) which is an approximation of the cumulative reward.

In the large majority of cases, reinforcement learning is used to adapt only the connection weights of the neural network policy of the robot. Recently, however, some authors used it also to adapt selected features of robots’ body (see for example Schaff et al., 2019; Luck, Amor & Calandra, 2019).

In general reinforcement learning methods are more sample efficient than evolutionary methods, i.e they are faster than evolutionary methods. On the other hand, they are more complex, depend on a larger set of hyperparameters, and are more subjected to instability, i.e. to degenerative processes that prevent progress and eventually produce regressions.

6.3.1 Basic concepts

Reinforcement learning usually introduce random variations in the actions performed. This is another difference with respect to evolutionary methods that instead introduce random variations in the parameters of the control network. The introduction of random variation in the action vector is realized by using stochastic policies. The two most common kinds of stochastic policies are categorical policies and diagonal Gaussian policies. The former are used in the case of discrete action spaces, i.e. in the case of problems where the agent can perform an action chosen within a finite set of possible alternative actions (like in the case of the Atari games [Mnih et al., 2015]). The latter are used for problems involving a continuous action space. Categorical policies use an output layer that encodes the probability to execute each possible alternative action followed by a softmax layer that normalizes the probabilities so that their sum equal 100%. Diagonal Gaussian policies use an output layer that encodes the action vector that is then perturbed through the addition of random values with mean 0.0. The distribution of the random values is encoded in a vector containing the diagonal elements of the covariance matrix representing the multivariate normal distribution of actions which is updated together with the other parameters of the policy,

The trajectory or rollout is the sequence of observations, actions performed, and rewards received during each step of evaluation episodes.

As specified above, the goal of the agent is to maximize some form of cumulative reward or return. One kind of return is the finite-horizon undiscounted return, which is just the sum of rewards obtained in a fixed window step. An alternative return is the infinite-horizon discounted return, which is the sum of all rewards later obtained by the agent discounted by how far they are in the future. The discount factor gives more importance to short term than to long term future reward.

Finally, the expected return is the return that an agent can expect to receive from a certain state on, under the assumption that the agent will act on the basis of a given policy.

Online learning indicates algorithms that modify the parameters of the control network by using the rollout data collected on the last performed evaluation episodes. This data is later discarded. Offline learning algorithms, instead, use all the data generated during the training process to update the control network. Experience replay refers to an intermediate case in which the data used is sampled from the rolling history.

On-policy algorithms work with a single policy network which determines the actions on the basis of the observations. Off-policy algorithms, instead, use two policy networks. A target policy that is learned and a behavioral policy that is used to determine the actions performed on the basis of the observations.  

6.3.2 Q-learning

Reinforcement learning comes in different varieties. A first variety, that is suitable for problems with discrete action space is action-value function learning or Q-learning (Sutton & Barto, 2018). In short, Q-learning approximate the function Q(s, a) that computes the expected return of each possible action a in each possible state s. The Q function is often approximated by a neural network that receives as input the observation and the action and produces in output the expected return. The action executed by a Q-learning agent is selected by using the current Q function to evaluate the expected return of each possible action in the current state and by selecting the action with the highest expected return.

One way to calculate the expected reward is the Montecarlo method that consists in waiting the end of the evaluation episode and then calculating, for each step, the sum of the reward received in successive steps.  The sequence of observations and associated expected rewards are then used as training set for training with back-propagation the neural network computing the Q(s, a) function. The process is normally performed offline, i.e. by using data collected not only during the last evaluation episode but also during recent evaluation episodes (experience replay) or during all episodes.

As an example of the Montecarlo method, consider an agent learning to play the Pong Atari game at an intermediate stage of the learning process. During an evaluation episode it will select actions that produced good returns in previous evaluation episodes but will also select random action with a probability ε where ε is a parameter that is progressively reduced during the course of the training process. Given the scoring rule of the Pong game, the agent will always receive a reward 0.0 during evaluation episodes and a reward of 1.0 and -1.0 at the end of won and lost episodes-games, respectively. Consequently, the expected undiscounted return will be 1 or -1. Overall, this implies that the training data generated in this way will alter the parameters of the policy so to increase and decrease the Q-value of actions produced in won and lost episodes, respectively. Clearly, not all actions selected in successful and unsuccessful episodes necessarily contributed positively or negatively to the outcome, respectively. However, over many episodes, only the actions that really contributed to produce the positive result will be associated with high Q values consistently.

An alternative method for estimating the expected return consists in the temporal difference method that permits to calculate a gradient after each step, without waiting the end of the episode. This is realized by using the Q-values computed by the policy itself as a proxy of the expected return. The Q-values computed by the policy are initially wrong but become more and more accurate as learning proceeds. The idea can be illustrated with the following example. Suppose that on Friday you want to predict the weather for Sunday and you have some model of weather forecast. According to the Montecarlo method, you have to wait until Sunday to update your model. However, on Saturday your model will provide a more accurate forecast than on Friday. This means that you can use the difference between the prediction made on Friday and the prediction made on Saturday to improve your prediction model already on Saturday.

The advantages of temporal difference is that it permits using a teaching signal that is less noisy than that computed with Montecarlo. The advantage of Montecarlo is that it is not biased by the inaccuracy of the value prediction used by the temporal difference method that necessarily drives the learning process toward partially wrong directions. Montecarlo methods are slower but tend to converge on good solutions in the long run. Temporal difference methods are usually faster but might diverge, i.e. might produce phases in which the quality of the policy decreases instead of increasing. A good compromise consists in using a hybrid approach in which the policy is updated every n steps, where n is greater than 1 and smaller that the length of the evaluation episodes.

A notable example of Q-learning algorithm is the Deep Q-Learning (DQN) which enabled to achieve human-level performance on many Atari games (Mnih et al., 2015).  Unfortunately, Q-learning does not perform well on problems involving continuous action spaces. Probably this is due to the fact that attempting to learn a function that estimate the expected value of all possible actions in all possible states is too complex.

6.3.3 Policy learning

Policy learning methods determine the actions on the basis of the observations directly, as in the case of evolutionary methods. Learning is normally performed online and on-policy, i.e. the update of the parameters of the policy is based on the data collected recently by performing the actions prescribed by the current version of the policy. In the case of actor-critic methods, the policy is also trained to approximate a value function, i.e. to predict the return expected at each step. As we will see below, this predicted value can be used to compute an advantage, i.e. how much better or worse the executed action is with respect to the average action executed in the current situation by the current policy. The advantage is then used to determine whether the probability to execute the actions performed should be increased or decreased.

An example of algorithm of this class is the Vanilla Policy Gradient (VPG) described in Achiam (2018). The pseudocode of the algorithm is included below. The policy and the value function are approximated by an actor and a critic neural networks with parameters θ and Φ, respectively, that receive as input the current observation. The policy network, which consists of a diagonal Gaussian policy, outputs the action vector. The critic network, instead, outputs the value (i.e. the predicted cumulative reward).

During each iteration k, the agent is evaluated for multiple episodes until a certain number of steps (e.g. 4000) are collected. The data collected at each step include: the observation vector, the action vector, the reward, the log probabilities of the action vector, and the value, i.e. the predicted cumulative reward (line 3). The rewards and the value data (produce by the critic) are then used to compute the cumulative reward from each step on, and the advantage of the actions executed (line 4 and 5).

The cumulative reward is computed by discounting future rewards by a parameter  set to 0.99 on the basis of the following equation:

where rt is the reward received at time t.

The advantage, i.e. how much better the discounted return of the action executed is with respect to the average return obtained in the state, is computed on the basis of the following equation:

Where the value V is the output of the critic network and λ is an additional discounting parameter set to 0.97. The vector of advantages is then normalized so to assume an average of 0.0 and normal standard deviation.

At this point the algorithm can estimate the gradient for the actor network that maximizes the advantages by using the log derivative trick (line 6), update the parameters of the policy network with the Adam stochastic optimizer (line 7), and update the parameters of the critic network based on the difference between the estimated discounted return computed by the critic network and the actual discounted return (line 8). The critic network is also updated by using the Adam stochastic optimizer.

Policy learning algorithms include the A2C/A3C (Min, Badia, Mirza et al., 2016), the TRPO (Schulman et al., 2015), and the PPO (Schulman et al., 2017). The last two methods are extensions of the vanilla policy gradient algorithm described above. The extension consists in the addition of a rule that reduces the amplitude of the step taken along the gradient to avoid the instability that can be caused by the introduction of changes which alter too much the behavior of the policy. In the case of the PPO this is achieved by penalizing excessive divergence from the previous policy or by clipping the gradient so to keep it’s size below a desired threshold.

Policy learning and Q-learning can be also combined. An example of hybrid method of this type is the DDPG algorithm (Lillicrap, Hunt, Pritzel et al, 2015) that concurrently learns a deterministic policy and a Q-function and use the former to improve the latter and vice versa.  

6.4 How to design the fitness or reward function

Designing the fitness function or the reward function that enables a robot to develop a certain skill can be challenging although it is certainly much less challenging than designing the robot directly, without relying on an adaptive process. Such difficulty concerns the design of incentive plans for animals and humans as well. The problem arises from the fact that incentives can often be maximized in many alternative ways including ways that are unsatisfactory from the point of view of the designer.

A first aspect to consider is that rewards can be sparse, i.e. can reward the accomplishment of a function without necessarily rewarding the steps that enable the achievement of the goal. For example, predator robots evolved for the ability to ‘catch’ prey robots can be rewarded only for reaching and touching the prey robot (Simione and Nolfi, 2020). A sparse reward of this type is sufficient for enabling the development of a series of behaviors that are instrumental with respect to this objective and that are not rewarded explicitly. For example, the ability of avoiding obstacles, pursuing a moving target, optimizing the motion trajectory with respect to multiple constraints, integrating sensory information over time, anticipating the behavior of the prey robot, managing appropriately available energy resources, adapting on the fly to the behavior of the current prey, etc. Similarly, robotic hands trained through reinforcement learning for the ability to move and rotate an object to a desired position and orientation (Andrychowicz et. al., 2018) can be rewarded only when the goal is achieved, without providing any feedback for the sequence of actions and behaviors that are necessary to achieve the goal. Consequently, the usage of fitness/reward functions that reward the robot for the ability to achieve its goal only should be considered the most straightforward solution.

In certain cases, however, sparse rewards of this type might create a bootstrap problem. The problem originates from the fact that the behavior displayed at the beginning of the training process is poor. The combination of a sparse reward with poor behaviors can lead to a situation where the robots keep receiving the same null rewards. In this condition, the adaptive process is unable to detect and retain adaptive variations and consequently to produce progress.

The best way to avoid the bootstrap problem consists in exposing the adaptive agents to variable environmental conditions, i.e. varying the characteristics of the environment in which the robot is situated during evaluation episodes and varying the initial position, orientation, and posture of the robot at the beginning of evaluation episodes. Indeed, the greater the variability of the learning conditions is, the greater the chance that the robot will get a reward is. A good example of this issue is the case of the robotic hand of Andrychowicz et. al. (2018) mentioned above, that we will review in more detail in the next chapter. The chance to be rewarded even once at the beginning of the training process, when the offset between the initial and final position and orientation of the object is significant, is negligible. However, the chance to be rewarded when the offset is minimal is much higher, even at the beginning of the training process. By selecting the target position and orientation of the object randomly, the authors ensure the exposure to highly variable environmental conditions including particularly favorable conditions from which the chance to be rewarded is sufficiently high.

When the bootstrap problem cannot be solved by increasing the variability of the learning experiences, the experimenter can be forced to reward the robot also for the ability to achieve sub-goals that are instrumental to the achievement of the robot’s main goal or that discourage the exhibition of behaviors that are dysfunctional to the achievement of the robot’s goal. In this case, the fitness/reward function will include a primary reward component for the achievement of the goal and one or more secondary reward or punishment components. The utilization of additional components can solve the bootstrap problem and speed-up learning but also introduce biases in the learning process that can steer the process toward sub-optimal solutions. For example, consider the case of a robot trained to play tennis against a pre-programmed opponent player. A function that rewards the robot only when it manages to win a game might cause the bootstrap problem since the chance of winning a game at the beginning of the adaptive process is extremely low. The experimenter might thus decide to give a primary reward for winning a game, a secondary smaller reward for scoring a point, and a tertiary even smaller rewards for shooting balls inside the opponent court. After all, scoring points and complying with the game rules are necessary pre-requisite for winning a game. However, the addition of secondary reward components will encourage the achievement of the corresponding sub-goals independently of whether they are instrumental or not to the achievement of the primary goal. For example, they can favor the development of strategies that maximize the number of valid shots by using a prudent behavior that minimizes the risk of producing nets and outs but generate situations that are easy to handle from the opponent. In other words, progress with respect to secondary components can reduce, rather than increase, the chances to win games. This is due to the fact that the secondary components can be maximized in different manners, and only some of these manners are instrumental to the achievement of the robot’s primary goal. Overall this implies that the experimenter should limit as much as possible the usage of additional reward or punishment components and should formulate them so to minimize their biasing effect.

The problem is complicated by the fact that the effect of secondary reward components on the adaptive process vary as a function of the learning phase and of the behavior currently exhibited by the robot. Consequently, the utilization of additional components can be beneficial in some replications of the experiment or during a certain phase of the adaptive process but can have a negative effect in other replications or in other phases.

Finally, the fitness/reward function should also be tuned to the characteristics of the method used. In evolutionary methods the learning process is driven by the cumulative reward obtained during evaluation episodes. In reinforcement learning, instead, is usually driven by the expected return computed on a finite horizon. Consequently, functions that are suitable for methods of a class are not necessarily suitable for methods of a different class, see for example the analysis reported in Pagliuca, Milano & Nolfi (2020).

6.5 Learning by demonstration

Finally, a third and last class of adaptive approaches consists in learning by demonstration or imitation learning (Argalla et al., 2009; Billard & Grollman, 2013; Billard, Calinon & Dillmann, 2016). In these methods, the experimenter shows to the robot a behavior performing a given desired function and the robot learns to approximate the demonstrated behavior.

An example is constituted by the robot arm shown in Video 6.1 that is trained for the ability to prepare orange juice (Billard & Grollman, 2013). The training occurs in three phases: a demonstration phase in which the teacher demonstrate the appropriate behaviors in variable conditions, a training phase in which the training set generated during the demonstration phase is used to train a neural network policy through a supervised learning method, and a testing phase in which the ability of the robot to perform the task and to generalize to partially new conditions is evaluated.

Video 6.1. Top: demonstration phase. Bottom: post-evaluation phase. ([www.scholarpedia.org/article/Robot_learning_by_demonstration], Figure 1 and 2, Billard & Grollman, 2013).

The demonstration phase can be realized in different manners. In the case of kinesthetic teaching the actuators of the robot are set in a passive mode and experimenter physically manipulates the robot so to force it to produce the desired behavior. This is the modality used in the case of the experiment shown in Video 6.1. The demonstration phase is used to generate a training set containing the observations and the state of the actuators inferred from the passive movements of the robot on the basis of the state of proprioceptors. Kinesthetic teaching constitutes the most straightforward method to demonstrate the target behavior but is not free from drawbacks and limits. A drawback concerns the need to “clean” the training set from the stimuli generated by the presence of the demonstrator.  A limit is that the demonstrator can control only a limited number of DOFs during the demonstration.

In teleoperation the actuators of the robot are controlled by the experimenter through a joystick and/or a haptic device. The demonstrator teleoperates the robot on the basis of visual information gained from her/his perspective through her/his own sensing system. Alternatively, the demonstrator can teleoperate the robot on the basis of sensory information extracted from the robot’s own sensors and accessed through some form of virtual reality tool. An advantage of teleoperation is that the demonstrator does not need to stay close to the robot and consequently does not interfere with the data collected. The limit is that, as for kinesthetic teaching, controlling robots with many DOFs can be difficult and might require long practicing sessions.

In methods that use direct imitation of human behavior, the training set is generated from the observation of a human displaying the desired behavior. An example is constituted by the work of Ude, Atkeson & Riley (2004) in which a humanoid robot is trained to produce the walking behavior on the basis of a training set including the position and the velocity that the robot’s joints should have over time. This information is extracted from videos displaying a walking human through a standard machine vision tool that identifies the position and orientation of the human limbs over time. The advantage of this method is that it enables the demonstrator to move freely and naturally. A downside is that it can only be applied to robots with a human morphology that matches in detail the characteristics of a real human. Another difficulty is that the information extracted from the observation of the demonstrator can be inaccurate and incomplete due, for example, to parts visually occluded by other parts.

Approaches also vary with respect to the level of organization at which they are applied. Indeed, learning by demonstration can be applied to train the robot to display low-level behaviors capable of achieving a desired function, as in the examples illustrated above, or to learn to combine pre-existing behaviors to achieve new functions (Dilmman, 2004; Skoglund et al., 2007).

A general drawback of learning by demonstration is that minor differences in the action produced by the robot with respect to demonstrated actions can lead to significant differences over time which might prevent the robot from achieving the expected goal during the testing phase.

One way to limit this problem consists in using an iterated approach in which the demonstration, training and evaluation phases are repeated multiple times and in which the training set is extended with additional demonstrations selected with the purpose of improving the current unsatisfactory aspects of the robot’s behavior. Yamashita & Tani (2008) used this strategy to train a humanoid robot to hold and to move an object with two hands. After the first training phase the authors observed that the robot failed to hold the object since it did not use enough force to push the object within its hands. They thus created a second training set that was generated by allowing the robot to act on the basis of its trained controller and by pushing the robot’s hands on the object so to encourage the robot to learn to put more pressure on the object. We will describe this experiment in more details in Chapter 11.

Learning by demonstration methods can be usefully combined with evolutionary or reinforcement learning. The learning by demonstration component can be used to acquire a behavior that approximates a suitable effective behavior but which eventually fails to achieve the goal due to small errors accumulating over time. The evolutionary or reinforcement learning component can be used to refine the behavior acquired through the learning by demonstration method in a way that is functional to the achievement of the desired goal. Hybrid approaches of this type have been used to train a robot arm to flip pancakes (Kormushev et al 2010) and to solve the ball in a cup problem (Kober & Peters, 2011).

6.6 Learn how

Read section 13.9 and follow the instructions included in the Exercise 8 to understand in details the implementation of the OpenAI-ES evolutionary algorithm included in the evorobotpy2 software. Read section 13.10 which provides a guideline for setting hyperparameters in evolutionary experiments.

You can experiment with reinforcement learning algorithm after reading the next chapters.



7. Robustness, Plasticity and Antifragility

7.1 Introduction

The physical world is highly uncertain. Almost no characteristic, dimension, or property remains constant. Variations affect both the external environment and the body of the robot. For example, the light conditions, the temperature, and the distribution of dust vary over time. The body parts of the robot break down or wear out, the quality of the materials changes, batteries run out, etc. Moreover, the relative position/orientation of the robot in the environment varies during different operation periods, the quantities measured by sensors are noisy, the effect of actuators is uncertain, the measures used to calculate the fitness or the reward are noisy.

To operate effectively robots should be robust to these forms of variations, i.e. should be able to perform effectively in varying conditions. Possibly, robots should be more than robust. They can be antifragile (Taleb, 2012), i.e. they can be capable to gain from the exposure to variations. Finally, robots should adapt on the fly to variations that cannot be handled properly by means of a single robust strategy. We will use the term plasticity to indicate the ability to adapt on the fly to the variations of the environmental conditions.

7.2 Robustness

The problem of dealing with environmental variations can be solved by developing solutions that are robust, i.e. that perform effectively in varied environmental conditions. The development of robust solutions can be promoted simply by exposing the robots to variable environmental conditions during the adaptive process. When the training is performed in simulation this can be realized by: (i) varying the position, orientation and/or the posture of robot at the beginning of each evaluating episode, (ii) varying the characteristics of the external environment at the beginning of each evaluation episode, e.g. the position and the texture of the objects present in the environment, (iii) perturbing with random values the state of the sensors and actuators of the robot and eventually the state of the body parts forming the robot during each step. Usually, varying only some properties of the environment is sufficient. On the other hand, a complete lack of variation usually leads to the development of brittle solutions that are overfitted to the conditions experienced and that do not generalize to different environmental conditions.

In the case of evolutionary methods this implies that the environmental conditions shall vary among evaluation episodes (in the case of robots evaluated for multiple episodes), among the robots forming the population, and across generations. In the case of robots trained with reinforcement learning, the environmental conditions should vary among evaluation episodes and across learning epochs. 

When we think to environmental variations we usually refer to dynamic environments, i.e. to environments embedding features varying over time. However, static environment varying over space, i.e. including areas with diversified characteristics, are functionally equivalent to dynamical environments from the point of view of a situated robot. From the perspective of a robot that is situated, what matters most is the local portion of the environment. The local portion of the environment changes as a result of the movements of the robot. Consequently, the variations of the environment over space can be equivalent to variations of the environment over time. Overall this implies that the variability of the environmental conditions can be increased either by training the robot in a dynamic environment or by training the robot in a spatially diversified environment.

The adaptive process also exposes the robots to other forms of variations that are necessary to discover how to improve the robots’ skills: variations of the parameters introduced during the reproduction process, in the case of evolutionary methods, and variations of the action introduced every step, in the case of reinforcement learning methods. This naturally promotes the synthesis of solutions that are robust with respect to these forms of variations. Indeed, adapted robots are generally more robust than hand designed robots for this reason. The development of solutions that are robust to these forms of variations generally provides also a certain level of robustness with respect to environmental variations.

The need to evaluate the adaptive robots in variable environmental conditions complicates the identification of the best solution. The best solution obtained during an adaptive process does not necessarily correspond to the solution obtained at the end of the evolutionary or training process. When the environmental conditions do not vary, the best solution can be identified simply by choosing the solution that obtained the highest fitness or the highest cumulative reward during the adaptive process. Instead, when the environmental conditions vary, this is not sufficient since the performance obtained depends also on the specific environmental conditions experienced. The solution that achieved the highest performance can correspond to a solution that was lucky, i.e. that encountered easy environmental conditions, or that is brittle, i.e. that achieved a high performance in the specific conditions experienced but performs much poorly in different conditions. The truly best solution should be rather identified by post-evaluating the most promising candidate solutions for multiple episodes in new environmental conditions and by choosing the solution that achieved the highest performance during such post-evaluation (Pagliuca & Nolfi, 2019).

7.3 Plasticity

The second way that permits dealing with environmental variations is plasticity, i.e. the ability of a robot to adapt to changing conditions on the fly. As mentioned above, in situations where robust solutions are not available, plasticity represents the only possible way to handle variations. The adaptation of the robot’s behavior to the current environmental conditions can be realized through behavioral or neural plasticity.

Behavioral plasticity consists in the ability to display multiple behaviors and to select the behavior that suits best the current environmental circumstances. As we have seen in Chapter 5, behavioral plasticity does not necessarily require specialized machineries. It can arise simply as a consequence of a bifurcation of the robot/environmental dynamics triggered by a variation of critical parameters correlated to observation states. For example, it may arise as a bifurcation triggered by the state of the time sensor that leads to the production of an explorative behavior, suited for cleaning open areas, or a wall-following behavior, suited for cleaning peripheral areas (see Section 5.7). As an alternative example, consider a predator robot that aims to catch prey. Some of these prey run away as soon as they perceive the predator. Others hide themselves, stay still and run away only if the predator approaches them. This problem does not admit a single strategy. To capture both types of prey the predator should display two different behaviors and select the behavior that suits the current prey. It should approach the hiding prey and anticipate the trajectory of the running-away prey without necessarily approaching it. The predator can extract from the observations an internal states encoding the category of the current prey and use this state to select the appropriate behavior.

Behavioral plasticity permits adapting to the current environmental conditions immediately. However, it only permits adapting to environmental conditions that already occurred repeatedly in previous phases of the adapting process. Indeed, it is the exposure to those environmental conditions which creates the adaptive pressure for the development of the appropriate associated behaviors. Variations leading to new conditions never experienced before can only be handled through neural plasticity (as discussed below) or through behavior generalization (as discussed in the next chapter).

Neural plasticity can be realized by simply continuing the adaptive process. However, standard adaptive methods can result too slow to cope with fast environmental variations. Consequently, one might consider specialized adaptive methods which can be faster.

A possible way to speed-up the adaptation to environmental variations consists in enabling the robots to vary their parameters on the basis of a genetically encoded unsupervised learning rules (Floreano & Urzelai, 2000). This can be performed by encoding the properties of the robot’s brain in a vector of tuples determining for each connection: (i) whether it is fixed or plastic, (ii) the weights value in the case of fixed connections, and (iii) the Hebbian rule type and the learning rate in the case of plastic connections (Floreano & Urzelai, 2000; see also Figure 7.2). Hebbian learning is a form of unsupervised learning that varies the strength of the connection weight on the basis of the activation of the presynaptic and postsynaptic neurons. The utilization of this method in variable environmental conditions support the evolution of robots capable to adapt on the fly to the current conditions. For example, in a co-evolutionary experiment involving the evolution of predator and prey robots adapted for the ability to catch prey and avoid being caught by predators, respectively, it allows predator robots to adapt on the fly to the characteristic of their current prey (Floreano & Nolfi, 1997). For a related approach that combine Hebbian and back-propagation learning see Miconi (2016).

Figure 7.1. The first two bits of each tuple specifies whether the connection is fixed or plastic and the sign of the connection. In the case of fixed connections, the next four bits encode the weight. In the case of plastic connections, instead, the next four bits specify the learning rule and the learning rate. The initial weight of plastic connections is set randomly at the beginning of evaluation episodes. There are four possible type of Hebbian learning rules that were modeled upon neurophysiological data and are complementary to each other (see Floreano & Urzelai, 2000).

Another approach has been proposed by Cully et. al (2015). In their case the adaptation to the environmental variations is realized by using a standard evolutionary algorithm. The speed of the training process is increased by applying the algorithm to a smaller set of parameters encoding the principal components of variations of the full set of parameters.

7.4 Antifragility

Antifragility is the ability to benefit from variability, disorder and stressors. Consequently, antifragility is the opposite of fragility and goes beyond robustness or resilience. Robust or resilient systems continue performing well in varied conditions. Antifragile systems benefit from the exposure to variations (Taleb, 2012).

Evolutionary and reinforcement learning systems are examples of antifragility. Systems subjected to an evolutionary process improve as a result of random variations of their parameters introduced during reproduction. Similarly, systems trained through reinforcement learning algorithms improve as a result of the random variations introduced in their actions. Both methods benefit from variations. They exploit variations to discover better solutions.

Adaptive robots can benefit also from the exposure to environmental variations. Let’s consider for example the case of a robot adapted for the ability to walk on a flat environment from a given initial posture. The adapted robot will be fragile to uneven terrains and to variations of its initial posture. On the contrary, a robot adapted for the ability to walk on rugged terrains from a given initial posture will be robust to uneven terrain and also, to a certain extent, to variations of its initial posture since the exposure to rugged terrains will expose it to a larger set of postures. Overall this implies that the exposure to environmental variabilities of one kind can lead to solutions that generalize to other forms of variability.

Often the advantages gained by variation at one level of organization are paid in term of fragility at other organization levels. In other words, antifragility for one is fragility for someone else. Some parts of the system should be fragile to make the entire system antifragile (Taleb, 2012).  For example, the gain obtained by an adaptive robot at the end of an adaptation phase, in which it has been intentionally exposed to strong environmental variations, is paid with a reduction of performance during the adaptation phase. Similarly, the increase in performance obtained by sharing information with other robots through communication at the level of a group can produce a reduction of performance at the level of the single cooperating individuals (see Chapter 8). However, the disadvantages affecting certain levels of organization can be worth paying when the associated advantages affect the level of organization we care of.

7.5 Crossing the reality gap

A form of robustness that is particularly important for adaptive robots concerns the ability to be robust with respect to the differences between simulated and the real environments. This is necessary when the training process is performed in simulation and the trained solution is then be ported on the physical robot situated in the physical environment. The trained robot is moved from the simulated to the physical environment and consequently should be robust with respect to the variations that it encounters when it is moved from the environment in which it has been trained to the environment in which it should operate. The ability to keep performing effectively in the real environment is usually referred as the ability to cross the reality gap.

The problem of crossing the reality gap can be avoided by performing the adaptive process directly in hardware on the real environment. However, this is generally time consuming and costly. For example, Levine et al. (2017) trained a manipulator robot with a reinforcement learning algorithm in hardware for the ability to grasp objects with varying shapes on the basis of visual information. To collect the training data in hardware the authors used 14 robotic manipulators that performed a total 800,000 grasping attempts. Monitoring the operation of many robots for a prolonged period of time and solving the problems occurring during the process requires a quite significant amount of human work.

Carrying the adaptive process in simulation, instead, present several advantages. It can enables to simulate the behavior of robots at a rate that exceeds real time. It permits speeding up training by means of parallel computation (notably evolutionary methods can achieve almost linear speedup in the number of CPU cores [Salimans et al., 2017]). It permits accessing information that is not available in the physical environment and that can be used to drive the adaptive process (see below). It permits reducing the hardware cost and the cost for hardware maintenance.

The simulation, however, will never match perfectly the actual world. Consequently, the candidate solutions adapted in simulation should be sufficiently robust to cross the reality gap, i.e. should be robust to the differences between the simulated and the real environment.

A good illustration of the method that can be used to promote the development of solutions able to cross the reality gap is the experiment of Andrychowicz et al. (2018) in which the shadow-robot hand (ShadowRobot, 2005) has been trained in simulation for the ability to manipulate objects and in which the solutions trained in simulation were successfully ported in hardware.

The robotic hand has 24 DoF on the wrist and on the fingers actuated by 20 pairs of agonist–antagonist tendons. The policy network of the robot was trained for the ability to rotate and move a cubic object to a given target position and orientation (goal) generated randomly through the PPO reinforcement learning algorithm (Schulman et al., 2017). This problem is particularly hard to simulate accurately due to the need to simulate the effect of collisions between the object and a robot with an articulated morphology.

The reward function rewards the robot with 5.0 every time the goal is achieved and punishes it with -20.0 every time the object falls from its hand. The desired target position that the cubic object should assume, i.e. the goal, is shown in the bottom right part of the Video 7.1. As you can see, the trained robot manages to manipulate and rotate the object as requested. Once the goal is achieved and the robot is rewarded, a new goal is selected randomly, and the robot starts a new manipulation behavior.

Video 7.1. The dexterous in-hand manipulation experiment of Andrychowicz et al. (2018). The goal, i.e. the target position and orientation that the cubic object should assume, is shown in the bottom-right of the video. (Andrychowicz et al. 2018, https://youtu.be/jwSbzNHGflM??)

The actor network of the robot, which is used both during training and testing, receives as input the target and current position and orientation of the object and the position of the palm and of the fingertips. The position and orientation information are extracted by three external cameras pointing to the robotic hand. The images perceived by the cameras are simulated during the training and are extracted by the real cameras during testing in the real environment.

The critic network receives as input also the position and velocity of the hand joints and the velocity of the object. These additional data, available in simulation, could not be extracted in a reliable manner from real sensors. However, this circumstance does not constitute a problem since the critic network is used during the training only, which is performed in simulation.

To ensure that the simulator is as accurate as possible the authors avoided using sensors that are too hard to simulate accurately such as the tactile sensors included in the shadow-robot hand. These sensors measure the pressure of the fluid included in elastic containers that varies as a result of pressure caused by physical contacts.

Moreover, to promote the development of robust solutions the authors perturbed the environmental conditions during training with correlated and uncorrelated noise, i.e., respectively, with variations that are selected randomly at the beginning of each evaluation episode and are then left constant for the length of the episode and with variations that are selected randomly every step. More specifically they perturbed: (i) the state of the sensors and of the actuators with correlated and uncorrelated noise, (ii) the refresh rate of sensors and actuators with uncorrelated noise, (iii) the position, orientation, and velocity of the cubic object and of the body parts of the robots with uncorrelated noise, (iv) friction and other physical parameters with correlated noise, (vi) the position of the cameras, the light conditions, and the texture of all objects in the scene with correlated noise.

In summary, to obtain solutions capable to cross the reality gap one should thus take the following actions:

These actions eventually permit crossing the reality gap without necessarily using high-fidelity and computationally expensive simulations. For example, they might allow porting successfully vision-based flying robots trained with low-quality 3D rendered images to real environments (Sadeghi and Levine, 2017).

7.6 Environmental variations, adaptive complexity and local minima

Solving an adaptive problem in varying environmental conditions is usually significantly more complex than solving a problem in not varying or less varying conditions. Moreover, the higher the range of variations is, the higher the complexity of the adaptive problem is. The range of environmental variations is thus a crucial part of the problem faced by an adaptive robot.

This aspect clearly shows in simulation experiment where the variability of the environmental condition can be arbitrary reduced and eventually eliminated. For example, the time required to evolve or train a robot to solve a certain problem in simulation is generally much lower when the robot starts always from the same identical position and orientation than when the initial position and orientation of the robot vary. The first condition leads to brittle solutions that are simpler and consequently easier to discover. The latter condition leads to solutions that are robust with respect to the initial conditions, which are usually more complex.

Another domain where this aspect manifests is competitive games. Learning to defeat a single opponent is generally much easier that learning to defeat a varied set of opponents.

The range of environmental variations also influences the probability that the evolutionary or learning processes remain stuck in local minima. The smaller the variations of the environmental conditions are, the higher the risks to remain stuck in local minima are.

Overall this implies that introducing a sufficient amount of variations in the environmental conditions is essential to ensure the development of effective solutions.

7.7 Learn how

Familiarize with the PyBullet 3D dynamic simulator and with the locomotor problems by reading section 13.11 and by making the Exercise 9.



8. Swarm Robotics

8.1 Introduction

In the previous chapters we have discussed problems involving a single robot. In this chapter, instead, we will consider problems concerning multiple robots or robotic swarms, i.e. group composed by several individuals. By properly interacting, multiple robots can achieve goals which isolated individuals would be unable to achieve (Dorigo, Birattari & Brambilla, 2014).

Multiple cooperating robots can display the following desirable properties (Camazine et al., 2001):

For these reasons, swarm robotics is particularly interesting for dangerous application such as demining or search and rescue, where the risk of losing robots is high.

The behavior of the individuals forming a swarm is not organized by a central entity that dictates instructions to individuals. Rather, it arises from the complex nonlinear dynamics of local interactions occurring in a distributed and decentralized way.

8.2 Reynold’s boids

Pioneering research in this area was conducted by Craig Reynolds (1987) who designed swarms of simple agents, called boids, capable of producing flocking behaviors analogous to those exhibited by bird flocks and fish schools.

Each boid has an independent brain and observes the position and orientation of nearby boids, i.e. those boids located within the maximum Euclidean distance indicated by the grey circle in Figure 8.1. Consequently, boids react only to nearby flockmates. Each boid is provided with an actuator that permits varying its orientation and moves at a constant speed.

Figure 8.1. Exemplification of the control rules used by Boids. Boids are represented by isosceles triangles oriented in the direction of the corner opposite to the base. The grey circle contains the flockmates that can be perceived by the black boid located at the center. The red arrows represent the steering behavior generated by the separation, alignment, and cohesion rules on the basis of the position and orientation of nearby boids, shown in blue. Distant boids, shown in white, are ignored. Each boid steers in the direction of the resultant of three vectors calculated with the three corresponding rules. (Adapted from https://www.red3d.com/cwr/boids/)

Each boid turns in the direction of the resultant of the three steering vectors calculated by the following rules:

A large collection of boids operating on the basis of these three simple rules produces a flocking behavior similar to that displayed by flocks of birds (Video 8.1).

Video 8.1. The behavior produced by a swarm of boids (https://www.youtube.com/watch?v=nbbd5uby0sY, by Gavin Wood, see also https://github.com/GavWood/DOGE-Game-Engine)

Some characteristics of the overall behavior exhibited by the swarm can be tuned by varying the relative strength of the steering forces produced by the separation, alignment and cohesion rules. For example, the reduction of the relative strength of the cohesion rule increases the probability that the swarm divides in multiple flocks and re-join into a single flock later on.

The behavior of the swarm can also be enriched by using additional steering rules. For example, the addition of an obstacle-avoidance and of a navigation rule, which consist respectively in steering away from nearby obstacles and toward the target destination, produces a flocking behavior in which the swarm of boids avoid obstacles and moves in a coordinated manner toward the target (Video 8.2).

Video 8.2. The boids approach a moving target in an environment including an obstacle (https://www.youtube.com/watch?v=GUkjC-69vaw, by Janne Karhu, see also https://github.com/GavWood/DOGE-Game-Engine)

Since the 2002 “Batman returns” movie directed by Tim Barton, the method is commonly used to animate collective behaviors in animation movies. 

8.3 Self-organization

The above described flocking behavior also illustrates another interesting property of collective systems: self-organization. We can define self-organization as the spontaneous formation of spatial, temporal, or spatiotemporal structures or functions in systems composed of many interacting elements. The structure (e.g. the flock) emerges from local interactions among the individuals without the intervention of external directing influences. Moreover, the emergent structure is typically robust with respect to external perturbations and capable to adapt spontaneously to environmental variations.

Self-organization arises as a result of multiple interactions among agents, fluctuations, and positive and negative feedbacks (Bonabeau, Dorigo & Theraulaz, 1999). Multiple interactions among agents are an obvious requirement. Fluctuations are necessary to explore new states or configurations. Positive and negative feedbacks are necessary to modulate fluctuations, and to promote coordination among individuals. More specifically, positive feedback plays the role of amplifying a trend that originates from a minor random deviation and grows as a result of a sort of snowball effect. Negative feedback, on the other hand, plays the role of damping and/or containing the amplification of the deviation. Overall fluctuations regulated by positive and negative feedbacks enable the swarm to explore new configurations by abandoning undesirable states and by prolonging desirable ones.

Self-organized behaviors are ubiquitous in groups of social biological systems. Video 8.3, for example, illustrates a living bridge that emerges spontaneously as a result of the behaviors of a colony of ants that move back and forth between a food source and a nest. The living bridge, which is constituted by ants supporting the passage of other ants, permits to create a convenient shortcut. The bridge is progressively transposed and elongated so to minimize the travelling distance between the nest and the foraging area (Reid et al., 2015).

Video 8.3. Living bridge built by army ants to create shortcut (https://www.youtube.com/watch?v=zXzYaf1mbFc, by Christopher R. Reid, Matthew J. Lutz, Simon Garnier, SwarmLab, New Jersey Institute of Technology, USA; Reid et al., 2015)

8.4 Self-organized path formation in a swarm of robots

Self-organizing properties are also commonly observed in swarm of robots evolved for the ability to perform some function.

Let’s consider for example the experiment illustrated in Figure 8 in which a group of 10 e-puck robots are evolved for the ability to forage by finding the nest and the foraging area and by moving back and forth between the two locations (Sperati, Trianni & Nolfi, 2011). The swarm of robots is rewarded with 1.0 point every time a robot enters in one of the two areas for the first time or enters in one area after visiting the other area.

Figure 8.2. (a) The e-puck robot equipped with the colored LED communication turret. (b) A schematic representation of the field of view of the camera of the robot.  The white and grey circles indicate the positions of the LEDs which can be turned on in blue and red, respectively. (c) The neural network architecture. (d) Schematization of the environment. The two gray circles represent the areas. The distance between them is selected randomly within [70, 150] cm at the beginning of evaluation episodes (Sperati, Trianni & Nolfi, 2011)

The robots include infrared sensors, ground sensors, a color camera, two motorized wheels, and a LEDs ring. In this experiment the frontal and rear LEDs can be turned on in blue and red, respectively. The camera has a view range of 144 degrees and can detect objects up to 35 cm, approximately, due to its limited resolution and to the noise added on the state of the simulated sensors (Figure 8.2b).

Each robot has a neural network with 13 sensory neurons, 3 internal neurons, and 4 motor neurons (Figure 8.2c). The sensory neurons encode the state of 8 infrared sensors, 1 ground sensor, and 4 visual sensors encoding the intensity of red and blue colors perceived on the frontal-left and frontal-right portions of the robot’s visual field. The motor neurons encode the desired speed of the robot’s wheels and whether the blue and red LEDs should be turned on or off. The colored light emitted by the LEDs can be perceived by the robots through their camera and can be used by the robots to coordinate.

The genotype of the evolving robots includes the connection weights and the biases of a single neural network controllers that is duplicated N times. The N networks are then embedded in the N corresponding robots forming the swarm. This implies that the swarmbot is homogeneous, i.e. it is formed by identical individuals. As discussed in Chapter 9.2, the usage of homogeneous robots eliminate the conflict of interest among individuals and favor the development of cooperative behavior.

The areas are marked by the grey color of the ground and by a red beacon located at their center. Since no explicit map of the environment is available, and since the sensory range of the robot is limited, the robots should explore the environment to find the areas.

The analysis of the evolved robots indicates that the swarm solves the problem by creating dynamic lanes, shaped, generated and maintained by the robots themselves. These lanes enable the robots themselves to move back and forth between the nest and the foraging area by minimizing the travel path.

Video 8.4 shows the behavior of a swarm composed of 10 robots. Video 8.5, instead, shows a post-evaluation in which the evolved network is embedded in 50 robots and in which the two areas are more distant among themselves.

Video 8.4. The behavior of a swarm composed of 10 robots (https://youtu.be/XwOO4fBvtWc, Sperati, Trianni & Nolfi, 2011).

At the beginning of the evaluation episodes, when the robots are placed in randomly selected positions and orientations and the dynamical lanes are not yet formed, the robots move straight toward closeby target areas (if present), avoid collision with other robots (if present), and move along semicircular trajectory in all other cases. Once they reach a target are they invert their direction of motion by producing a U-turn. Moreover, the robots maintain their frontal blue LEDs always on and dodge on the left when they perceive a blue light ahead.

Video 8.5. The behavior of a swarm composed of 50 robots (https://youtu.be/2ELCmcZuOEs, Sperati, Trianni & Nolfi, 2011).

The execution of these behaviors leads to the formation of robot lanes where the first robot keeps producing the behavior described above while the other robots just follow the first robot in an ordered queue. The creation of these linear formations enables the robots in the queue to start moving in the direction of a target area as soon as the first robot of the line perceives it and, consequently, before they can start perceive the area with their own sensors.

The behavior displayed by groups of robots moving in linear formations combined with the production of the U-turn behavior around target areas leads to the formation of double lanes moving in opposite directions along an ellipse-like trajectory with one of its focal points in either target areas.

At the beginning these lanes are unstable and can fall apart after a while, as a result of the interference caused by robot–robot avoidance behaviors. However, new, more stable lanes are created later.

The progression toward the formation of a single lane connecting the two areas is ensured by the inherent instability of the dynamic lanes passing around a single target area and the stability of the ones passing around two target areas. Moreover, this progression toward lanes connecting the two target areas is favored also by the fact that the portion of the lane passing around the target area is more stable than the opposite portion of the lane when the latter does not yet pass around the other target area. This implies that a portion of the lane explores the environment until it finds the other area.

Once a dynamic lane rotating around the two target areas is formed, the problem is substantially solved. The solution, however, is further optimized by progressively eliminating unnecessary deviations from the shortest path and by uniformizing the relative distance among robots so to reduce the interference caused by obstacle avoidance behaviors.

These dynamic lanes are clear examples of self-organizing spatiotemporal structures that originate from the interaction of several robots operating on the basis of simple rules. Indeed, such spatiotemporal structures: (i) self-generate and self-maintain, (ii) are robust with respect to small environmental perturbations and to variations in the number of robots, and (iii) are scalable, e.g. permits to swarms composed by many robots to find and to navigate between targets areas that would result too hard for swarms composed by less individuals.

The formation of lanes moving in opposite directions in these experiments is analogous to the spontaneous formation of lanes in pedestrian flows studied by Helbing and collaborators (Helbing, 1991; Helbing et al., 2005, Video 8.6).

Video 8.6. Lane formation in pedestrian counter-flows (https://www.youtube.com/watch?v=e2WfvJXB__8, Helbing, 1991; Helbing et al., 1995).

Helbing and collaborators modelled moving pedestrians as agents operating with two simple rules: (i) move in the direction of your destination at a certain preferred velocity, and (ii) steer and/or slow down to avoid collisions. The interactions between several agents and their environment mediated by these rules generate emergent properties, i.e. well justified lanes moving in opposite directions. These properties enable the agents to flow in an ordered manner without the need of enforcing such behavior with signs or other external interventions. The crowd appears equipped with a collective intelligence able to identify and follow the optimal path.  The foraging robots and the groups of pedestrian agents thus constitute two examples of collective behaviors that emerge spontaneously from the interaction among the single agents without the need of a centralized coordination mechanism.

8.5 Specialization

In some cases, the efficacy of robotic swarms can be improved through division of labor. This implies that two or more subsets of individuals perform different complementary tasks that are functional to the realization of the overall objective of the swarm (Ferrante et al., 2015).

An example that illustrates how a functional specialization of this type can emerge in an experiments involving evolving robots is reported in Pagliuca & Nolfi (2019). The experiment involves a swarm of simulated MarXbots (Bonani et al., 2010). The task consists in collecting invisible food elements distributed in the environment and in periodically releasing the food collected into the nest, i.e. the circular gray area (Video 8.7). The swarm of robots is rewarded with 1.0 point for each food element collected in the environment and released in the nest. The nest cannot be detected through the camera, i.e. the robots can only detect whether they are located over the nest or not through their ground sensors.

The robots are provided with infrared sensors, ground sensors, a color camera, two actuated wheels, and frontal red and rear blue LEDs which can be turned on or off.

Video 8.7. The behavior of an evolved swarmbot in the case of a replication that converged over a non-specialized solution (https://youtu.be/aYbSvskVkE8, Pagliuca & Nolfi, 2019).

Video 8.7 shows the behavior displayed by an evolved swarmbot in the case of a replication that converged over a non-specialized solution. As can be seen, the robots display the ability to form lanes centered over the nest. These lanes enable the swarm of robots to explore the environment, to collect food, and to periodically return to the nest to release the food collected. More specifically the robots systematically explore the portions of the environment surrounding the nest by producing ellipsoid-shaped lanes rotating around the nest.

Other replications of the experiment, instead, converge over specialized solutions. For example, the replication shown in Video 8.8 converges to a solution in which the swarm self-divides in two sub-groups: a larger group collecting food and a smaller group (eventually formed by a single robot) marking the position of the nest. We can refer to the robots belonging to the former and latter groups as food collectors and nest markers.

Video 8.8. The behavior of an evolved swarmbot in the case of a replication that converged over a specialized solution (https://youtu.be/on44SWBFkW0, Pagliuca & Nolfi, 2019).

The larger group of food collectors consists of individuals moving independently in different directions to collect food. The smaller group of nest markers consists of individuals which remain in the nest to mark its position to the food collectors which need to periodically return to the nest.

The role of the individual robots might change dynamically, during evaluation episodes, since food-collectors returning to the nest can become nest-markers and since nest-markers might exit from the nest and become food-collectors. The transition between the two roles is regulated so to ensure that the number of robots playing the role of nest markers is small.

8.6 Coordinated Locomotion in Self-Assembling Swarm-Bots

Baldassarre et. al. (2006, 2007) studied how a swarmbot formed by physically assembled robots could evolve the capability to move in a coordinated manner. The robots are composed of a chassis and a turret that can rotate along the vertical axis (Figure 8.3). The chassis includes a track with two motorized wheels, light sensors, and a force sensor that detects the direction and the intensity of the force that the turret of the robot exherts on the chassis. The turret includes a gripper that enables the robots to attach to their peers. When assembled, the robots can perceive whether and how the other robots are pushing or pulling them through the torque sensor. The robots can self-assemble to carry out missions that could not be performed by isolated robots, such as passing over a gap larger than a single robot. Once assembled, however, they need to coordinate to start moving together and to compensate for misalignments originating during motion.

Fig 8.3. A swarmbot composed of 8 assembled robots (Baldassarre et. al. 2007)

A group composed of four robots assembled in a linear structure was evolved for the ability to move in the direction of the light, if present, or to move in any direction otherwise (Video 8.9). The fitness consists of the distance travelled by the swarmbot toward the light, in the former case, and the distance from the original to the final position of the swarmbot, in the latter case. As in the experiments reported in previous sections, the swarm is homogeneous (i.e. the genotype encodes the parameters of a single neural network policy that is replicated four times and embedded into the four corresponding robots).

Video 8.9. The behavior displayed by a swarmbot evolved for the ability to locomote and eventually to move in the direction of a light source. The white lines over each robot indicate the direction and the intensity of the torque exhorted by the turret of the robot on the chassis (https://youtu.be/PDhPZdFuE_0, Baldassarre, Parisi & Nolfi, 2006)

As can be seen, the evolved robots manage to negotiate a common direction of motion, start moving along that direction, and compensate the misalignments originating during motion. The direction of motion corresponds to the direction of light, when present.

Interestingly, the ability to coordinate acquired by these robots is robust with respect to the topology of their ensemble, to the number of robots, and to whether the robots are assembled through rigid or flexible links (i.e. whether the gripper permits a limited degree of relative motion or not). Moreover, the evolved robots post-evaluated in new environmental conditions, spontaneously display useful additional behaviors. We can indicate this property with the term behavioral generalization. The term generalization indicates the ability of the robots to react appropriately to new observations. The term behavior generalization refers to the ability to react to new environmental conditions by producing novel behaviors that are appropriate for the new condition.

This is illustrated by Video 8.10 showing the behavior of 8 robots assembled in a circular topology through non-rigid links situated in a maze-like environment. The robots are provided with the policy network evolved for the four robots assembled into a linear structure situated in the empty environment shown in Video 8.9. This means that these robots never experienced obstacles before. As can be seen, the robots keep producing coordinate motion and coordinate light approaching in the new situation. In addition, they display an ability to coordinately avoid obstacles, explore different portions of the environment and rearrange their shape so to fit narrow passages. The alternation of these behaviors allows the swarmbot to reach a distant light target initially occluded by obstacles through the exhibition of several different behaviors appropriately alternated over time.

Video 8.10. The behavior produced by eight robots assembled into a circular structure in a maze environment including walls and cylindrical objects (represented by gray lines and circles). The robots start in the central portion of the maze and reach the light target located in the bottom-left side of the environment by exhibiting a combination of coordinated-movement behaviors, collective obstacle-avoidance, and collective light-approaching behaviors. (https://youtu.be/YtDhkx6sNkE, Baldassarre et al., 2006)

The ability of the robots to generalize to new conditions is also illustrated in Video 8.11 showing the behavior produced by the same evolved neural network policy embedded in the physical robots. As can be seen, the swarmbots evolved in simulation transfer in hardware and maintain their ability to display coordinated motion regardless of the topology with which the robots are assembled and of the irregularity of the terrain.

Video 8.11. The behavior produced in the physical environment by swarmbot formed by 4 or 8 robots assembled in different topologies and located over flat and irregular surfaces (https://youtu.be/L0Nl8gTShzU, Baldassarre et. al., 2007)

The overall behavior produced in the maze-like environment (Video 8.10) originates from the combination and alternation of several lower-level behaviors. Some of these behaviors correspond to those shown in the empty environment. Others originate from the combination of the control rules which permit the production of these behaviors and the new environmental conditions. Still others originate from the interaction and the alternation of the former behaviors over time.

Fig. 8.4. Schematization of the behavior exhibited by the swarmbot in the maze-like environment. The central portion of the figure represents the brain and the body of four robots (shown in green and pink) situated in an environment (shown in yellow). The external portion indicate the behaviors exhibited by the swarm of robots. The behavior shown in red and light red are rewarded directly or indirectly by the fitness function. The behaviors indicated in black originate as a result of the interaction of the control rules responsible for the generation of the rewarded behavior and new environmental conditions or as a result of the interaction between lower-level behaviors (Nolfi, 2009)

Indeed, if we analyze the robots in the maze-like environment, we can see how their behavior can be characterized at four different levels of organization (Figure 8.4). At the first level we can identify the following four elementary behaviors:

The three control rules enabling the production of these four behaviors also permits the production of the other behaviors described below. More specifically, the combination and the alternation of these four elementary behaviors produce the following higher-level behaviors that generally extend over longer time spans:

The combination and the interaction between these behaviors produce the following higher-levels behaviors that generally extend over still longer time spans:

The combination of all these behaviors leads to the overall behavior of the swarmbot:

It is worth noticing that only the behaviors shown in red in Figure 8.4 have been rewarded during the evolutionary process. The behaviors shown in light-red have been rewarded indirectly since they are instrumental to the production of the behaviors shown in red. All other behaviors have not been rewarded either directly or indirectly. They emerge as a result of the interaction between the control rules responsible for the production of the behavior shown in red with new environmental conditions, or they are the result of the interaction of lower-level behaviors.

This analysis demonstrates how adapting robots tend to spontaneously display behaviors with a multi-level and multi-scale organization. Moreover, it demonstrates how adaptive robots are capable to display not only the behaviors that have been exhibited and rewarded during the adaptive process but also additional potential behaviors that did not manifest in the environmental conditions experienced during the adaptive process. The control rules allowing the production of the behaviors that have been rewarded can produce additional behaviors in new environmental conditions that can enable the robots to immediately handle such new environmental conditions appropriately, without the need to adapt to them.

8.7 Swarm of robotic boats operating in a natural environment

Duarte et al. (2016) demonstrated how a swarm of robotic boats can operate successfully in a natural environment.  The robot boats include differential drive motors, wireless communication, compass and GPS. The data collected by these sensors are used to update the state of a set of sensory neurons encoding the angle and position of a target destination and the distance of the other robots and of obstacles (in the frontal-left, frontal-right, rear-left and rear-right directions). The motor neurons encode the desired translational and rotational speed of the robotic boats.

The swarm of robots was evolved in simulation for the ability to perform: (i) a homing behavior consisting in navigating toward a given target destination while avoiding obstacles, (ii) a disperse behavior consisting in moving while maintaining a predefined target distance from the nearest robot, (iii) a clustering behavior consisting in approaching the other robots by possibly forming a single cluster, and (iv) a monitoring behavior consisting in maximizing the number of portions of the environment visited recently.

The evolution of robust solutions that could be transferred successfully from the simulated to the real environment was promoted by adding uncorrelated noise to sensors and actuators in simulation and by evaluating the evolving agents for multiple episodes carried in variable environmental conditions.

The evolved swarms were post-evaluated successfully in a semi-enclosed water area of 300 x 190 m located in Lisbon, Portugal, for the ability to sample the water temperature over an area of interest. This was realized by manually dividing the mission in the following five sequential phases with predetermined time duration: (i) collectively navigate from the base station to the center of the area of interest (homing), (ii) disperse, (iii) monitor the area of interest, (iv) aggregate by using the clustering behavior, and (v) return to the base station (homing).

8.8 Learn how

Learn how to implement an new AI-Gym environment based on the PyBullet 3D dynamic simulator from scratch by reading section 13.12 and by doing the Exercise 10.



9.Communication and Language

9.1 Introduction

In this chapter we will see how groups of evolving robots invent a communication system from scratch and use it to coordinate and cooperate. The evolved communication systems can include symbols, i.e. conventional signs analogous to the vocalization or the gestures used by humans and by some animals to indicate an entity or an event to another individual or to solicit an individual to perform an action. Moreover, we will see how a robot can acquire the ability to comprehend a restricted subset of a human language and use it to cooperate with a human user.

Disembodied neural networks can be trained to produce and/or comprehend human language. For example, they can be trained to translate a text from one language to another or to describe the content of pictures or videos by producing sentences in human language. The level of linguistic competence that these systems can acquire, however, is constrained by the limits of their interaction modalities. A translator system, which experiences and produces written text only, would not be able, for example, to ground (Harnad, 1990) the meaning of the word dog in the perceptual features of dogs. Similarly, a disembodied system trained to describe images in natural language would not be able to ground the meaning of the word chair in the proprioceptive and tactile information that can be experienced by sitting on a chair. A rich meaning of this kind, analogous to the meaning that we humans associate to the word chair, can only be acquired by a system that is able not only to visually observe chairs but also to physically interact with them. For embodied systems of this type a chair is, above all, an object on which you can sit, i.e. something that supports your body when you sit on it and relax the muscles of your legs. Indeed, tree stump or rigid boxes, affording a sitting behavior, can be appropriately conceptualized as chairs, despite their lack of the visual features typical in canonical chairs. Overall this implies that robots capable to interact with the external environment can potentially acquire deeper linguistic competences than disembodied systems.

For this reason, in this chapter we will discuss not only how adaptive robots can develop communication skills. We will also discuss how robots can develop integrated behavioral and communication skills. Moreover, we will analyze how the development of communication skills affects the development of behavioral skills and vice versa. For a review of research studying the evolution of communication and language in robots and the usage of robots to model natural evolution see Nolfi & Mirolli (2010) and Cangelosi & Parisi (2012).

In addition, we will use the experiments described in this chapter to better clarify the factors that promote the evolution of cooperative behaviors in groups of robots and the importance of the multi-level and multi-scale organization of behavior and cognition in robots.

9.2 Evolutionary conditions for the emerge of communication

Communication can evolve spontaneously in group of robots that are rewarded for the ability to carry on tasks requiring coordination and/or cooperation and which have the possibility to alter the perceptual environment of the other robots. This can be illustrated by a simple experiment in which groups of 10 MarXBot robots (Bonani et al., 2010) are evolved for the ability to forage by approaching and remaining near the food source and by avoiding the poison source (Video 9.1; Floreano et al., 2007; Mitri, Floreano & Keller, 2009). The food and poison sources correspond to the red objects located in the bottom-left and in the top-right part of Video 9.1.

Video 9.1. The behavior of a group of robots evolved for the ability to forage. The red objects located on the bottom-left and top-right of the environment represent the food and poison sources, respectively. The robots can communicate by setting the color of their LEDs in green or blue. (https://www.youtube.com/watch?v=K3s5uh6-sXo, Mitri, Floreano & Keller, 2009)

The need to cooperate originates from the need to discriminate food and poisons source by distance. They can be discriminated by individual robots located over them from the color of the ground but they are visually identical by distance. The possibility to alter the perceptual environment of the other robots is ensured by the fact that the robots have a ring of LEDs which can emit green or blue light.

The robots are provided with a neural network controller with 12 sensory neurons, that encode the color of the ground and the intensity of green and blue light perceived over the frontal-left, frontal-right, rear-left and rear-rights sectors of the omnidirectional camera. Moreover, they have 3 motors neurons that control the desired speed of the differential driving systems and the color of the LEDs.

The robots are rewarded and punished with +1 and -1 for every time step spent close to food and poison sources, respectively. They are not rewarded for varying the color of their LEDs or for varying their motor behavior as a function of the colors signals that they perceive. In other words, the robots do not receive any reward for producing signals, for learning to react to signals appropriately, and for producing the behaviors that are necessary to carry on the task. This however does not prevent the evolution of the required communicative and behavioral skills. The required abilities are discovered by the adapting robots since they are instrumental to approach the food and avoid the poison.

Video 9.1 shows the behavior of the robots at the end of an evolutionary process. As you can see the robots manage to solve the task by displaying suitable motor behaviors, by varying the color of their LEDs in a meaningful way, and by reacting to the perception of color signals appropriately. In particular, in the case shown in the video, the robots turn their LEDs in blue near the food source and react to the perception of the blue robots by approaching them. This enables the robots to infer the location of the food and to navigate toward it.

In other replications of the experiment, the robots turn their LEDs in green near the food source and in blue far from it. This happens since choosing the blue or the green color to communicate the position of the food is clearly arbitrary and functionally equivalent. In other experiments, instead, the robots assume one color near the poison source and a different color elsewhere. This communication strategy is less efficient than the previous type. Indeed, some of the experiments that first discover the second communicattion strategy later switch to the first communication strategy.

As illustrated by the authors, however, an effective communication system of this kind evolves only when the group of robots is homogeneous and/or when the fitness is computed at the level of the group (Figure 9.1).

The evolution of a collective behavior can be realized in three possible ways. The first possibility consists in creating the group of robots from a single genotype and in computing the fitness of such genotype on the basis of the performance of the group (Figure 9.1, top). In this case the group is homogeneous, i.e. it is formed by individuals sharing the same characteristics, and the fitness is computed at group level, i.e. the fitness of the genotype is the sum of the fitness of the robots forming the group. The second possibility consists in creating the group of N individuals from N different genotypes and in attributing to each genotype the sum of the fitness of the robots forming the group (Figure 9.1, center). In this case the group is heterogeneous and the fitness is computed at group level. The third possibility consists in creating the group of N individuals from N different genotypes and in computing the fitness of each genotype on the basis of the performance achieved by the corresponding individual robot (Figure 9.1, bottom). In this case the group is heterogeneous and the fitness is computed at individual level.

Figure 9.1. Illustration of the three methods which can be used to evolve collective behaviors. The Figure exemplifies the case in which the population is formed by 8 genotypes and the group is composed of 3 individuals. The top part of the figure shows the case in which the group of robots is homogeneous and the fitness is computed at group level. The central part shows the case in which the group is heterogeneous and the fitness is computed at group level. The bottom part shows the case in which the group of robots is heterogeneous and the fitness is computed at individual level.

Effective communication emerges only in the first two cases, in which the groups is homogeneous and/or in which the fitness is computed at group level. In the third case, in which the group is heterogeneous and the fitness is computed at individual level, there is a conflict of interest between the individuals composing the group. Consequently, the robots do not tend to evolve cooperative communicative skills. On the contrary, in the third case, the robots evolve deceptive communication, i.e. learn to exploit the tendency of the other robots to approach or avoid a color by generating the light signals soliciting the execution of the wrong behaviors (approaching the poison and avoiding the food). More precisely, these experiments lead to continuous alternation of poor deceptive communication phases followed by poor cooperative communication phases (Mitri et al., 2009). Both phases are unstable since the development of deceptive communication leads to the retention in successive generations of variations that eliminate the production of inappropriate behaviors in response to the deceptive signals. Moreover, the un-intended production of signals, that can be used to infer the position of food or poison from a distance, creates the adaptive condition for the development of deceptive communication, i.e. for to the retention of variations producing deceptive signals in successive generations.

9.3 Evolution and complexification of behavioral and communicative skills

We can now consider a slightly more complex experimental setting that permits observing how a communication system originates and evolves and how the development of communication abilities facilitates the development of behavioral abilities and vice versa.

The experiment (De Greef & Nolfi, 2009) involves two ePuck robots (Mondada et. al, 2009) situated in an environment containing a foraging area and a nest which are rewarded for the ability to find the two areas and to navigate back and forth between them (see Video 9.2). As in the case of the previous experiment, the need to cooperate originates from the fact that the target areas can be perceived only in their proximity by means of the color of the ground.

Video 9.2. The behavior displayed by the evolved robots in a first replication of the experiment. The top and bottom videos show the behavior in simulation and in hardware, respectively. The bottom video includes also the sound produced by the robots. (De Greef and Nolfi, 2009,, https://youtu.be/VqpkH_dEWTY, https://youtu.be/6iL2I6jlI1E)

The neural network policy of the robots includes 2 motor neurons, encoding the desired speed of the two actuated wheels, and 1 motor neuron that controls the frequency of the sound emitted by the speaker of the robot.   

From the sensory point of view the robots are provided with: 8 infrared sensors distributed around the body allowing them to detect the presence of nearby obstacles, 2 ground sensors detecting the color of the ground under the robot, a sound sensor detecting the frequency of the sound produced by the other robot, and 3 visual sensors detecting the intensity of the red light perceived in the left, frontal, and right sector of the camera. The red color provides information on the relative positions of the other robot since the robots are surrounded with a stripe of red paper. Due to the limited capability of the ePuck robot processor, the sound is transmitted and received through the wireless connection. 

Overall, this implies that a robot has the ability to affect the perception of another robot by moving in the environment (so to alter the visual image perceived by the camera of the other robot) and by emitting sounds with varying frequency (so to alter the sound state of the microphone of the other robot).

The robots are rewarded with 1 point every time they are concurrently located in the two areas for the first time or after they move from one to the other area, i.e. after the robot located in the white area moves to the black area and vice versa. The group of robots is homogeneous and the fitness is computed at the group level. As in the case of the previous experiment, the robots are not rewarded for communicating or for producing any specific behavior.

Video 9.2 (top) shows an example of evolved behavior in simulation. At the beginning of the evaluation episodes the robots explore the environment by avoiding the walls and by moving straight in the attempt to find the black and white areas. The robots are initially situated in a random position with a random orientation outside target areas. Once they reach the two areas, they coordinate their behavior so to navigate toward each other and consequently toward the area previously occupied by the other robot.

Video 9.2 (bottom) shows the behavior of the same robots in hardware and includes an audio with the sounds produced. The robots generate three classes of sounds with different frequencies in the black, white, and neutral areas. The frequency of the sounds also varies slightly inside the black and white areas as a function of the time passed inside the area.

Video 9.3. The behavior displayed by the evolved robots in a second replication of the experiment. The top and bottom videos show the behavior in simulation and in hardware, respectively. The bottom video includes also the sound produced by the robots. (De Greef and Nolfi, 2010, https://youtu.be/Ox3r0A-RdZo, https://youtu.be/RUzC07wNSZg)

Video 9.3 shows the behavior obtained in another replication of the experiment. Much like the previous cases, the robots avoid obstacles and move straight outside the areas to explore the environment and find the target areas. In this replication, however, the robots move along the border of the areas after entering in them. Moreover, they exchange their positions in two phases. First, the robot located on the white area exits and navigates toward the center of the black area. Then, the robot located in the black area exits and navigates toward the center of the white area. Do notice how the robots manage to navigate toward the center of the areas even though they are unable to perceive the areas from a distance and are unable to perceive the center of the areas. In the case of this second replication of the experiments, the robots emit sounds with two well-differentiated frequency only (Video 9.3, bottom). More specifically, they produce a low frequency sound when they are located inside the black area and a high frequency sound outside the black area. Differently from the solution found in Video 9.2, the frequency of the sound produced by the robots remains almost constant, i.e. it does not depend on the time spent inside areas. These two solutions are representative of the solutions founds in the other replications of the experiments.

We can now analyze how the solutions shown in Video 9.3 is developed during the evolutionary process (see Figure 9.2).

The analysis of the behavior exhibited by the robot during the first generations shows that, at this stage, they display two simple behaviors that consisting in moving-forward and avoiding obstacles. These two behaviors are produced when the infrared sensors are on or off, respectively. The combination of these two behaviors allows the robots to explore the environment and to occasionally receive a 1 point reward when the two robots happen to cross over the two areas concurrently.

After few other generations, the robots develop a new behavior consisting in moving by turning on a side when the ground sensor detects the black area. This produces a remain-in-the-black-area behavior that is realized by moving along a circular trajectory inside the area. This new behavior increases the probability to be rewarded since the reward is collected as soon as the other robot enters into the white area.

Figure 9.2. Behavioral and communication capabilities of the evolving robots. The left side of the figure show the behavioral and communication skills. The height represent the evolutionary phase in which the skills have been evolved. The behavior developed during the first and last generations are shown at the bottom and at the top, respectively. In black are represented behaviors emerging from the combination over time of two lower-level behaviors. The black arrows indicate the lower-level behaviors from which they originate. The blue arrows indicate behaviors that are re-used to produce an higher-level behavior or that support the production of higher level-behaviors. The right part of the figure indicates (in blue) the conditions that trigger the execution of the corresponding behavior. In the case of behaviors emerging from the combination of lower-level behaviors (indicated in black) the conditions correspond to the ones triggering the lower-level behaviors they emerge from. The underlined text refers to behavior and conditions that were modified during the evolutionary process (see text).

The development of this new behavior naturally leads to the production of two differentiated sounds signals: (i) a high frequency sound that is produced when the robot moves straight outside black target areas, and (ii) a low frequency sound that is produced when the robot moves in circles inside the black area. This happens since the neural network policy tends to spontaneously produce different signals during the production of different behaviors (a phenomenon indicated with the term production bias, see Mirolli and Parisi, 2008). The state of the internal neurons, which influences both the motor neurons controlling the wheels and the motor neuron controlling the sound produced, is necessarily different during the production of the two behaviors. Consequently, the sound produced tends to also differ during the production of the two behaviors.

The production of two different sounds inside and outside black areas does not have any utility initially, since the robots are not yet able to react to these sounds appropriately. However, it creates the adaptive conditions for the development of the following motor behaviors afforded by the signals:

The development of these three additional behaviors enables the robots to solve their adaptive task relatively well at an intermediate state of the evolutionary process. They explore the environment until they find the black or white area, remain in the area in which they are located (unless they are both located in the same area) until the other robot enters in the other target area, and then exit from the areas. However, the robots are not yet able to navigate directly to the other target area. After exiting the area, they just resume their exploration behavior.

The problem of navigating directly toward the other area is challenging for robots which do not perceive the areas from a distance. On the other hand, the problem can be solved by re-using and adapting the behavioral and communicative skills developed in previous generations.

The first modification introduced for this purpose consists in exiting from the white area only when the robots visually perceive the other robots. The condition that triggers the exit-from-the-whote-area behavior thus becomes the perception of the low frequency signal and the visual perception of the other robot. This permits to exit and to navigate toward the other robot which is located in the other area. This new strategy permits to reach the destination most of the times but fails occasionally when the other robot is located on the border of the black area. Once again, however, the development of this new skill creates the condition for the development of further skills. The relative position assumed by the robot located in the black area now matters in an evolutionary sense, since it determines the direction of the robot exiting from the white area. This leads to the second adaptive modification which alters the way in which the robots remain in the black area. They now do so by moving along the border of the area and by stopping on a side when they visually perceive the other robot (see Video 9.3). This enables the robots located in the black area to stay on the left side of the area, with respect to the other robot. This, in turn, enables the robot exiting from the white area to navigate toward the center of the black area.

The development of the move-on-the-border-and-look-robot behavior, obtained by modifying the remain-on-the-black area behavior developed in previous generations, also solves the problem of navigating directly from the black area to the white area. This happens since the execution of the move-on-the-border-and-look-robot behavior implies that the robot located in the black area remains oriented toward the center of the white area. In other words, the execution of this behavior prepares the robot located in the black area to exit toward the center of the white area.

Overall, the development of all these behaviors leads to the optimal solutions shown in Video 9.3.

There are several aspects that are worth noticing in this experiment. The first is that evolution produces a progressive expansion of the behavioral and communication skills of the robots. This is due to the fact that progress is often achieved through the development of new skills and to the fact that old skills tend to be preserved.

The tendency to preserve old skills can be explained by considering that they tend to assume new functions, in addition to the original function they were developed for. Such additional functions derive from the fact that they often support new functional behaviors developed after them. As an example, consider the remain-on-black-area behavior and the associated production of low frequency sounds in the black area. This behavior not only increases the chance that the robots are rewarded once, i.e. its original function. It also supports the remain-in-white-area, the exit-from-white-area, and the exit-from-black area behaviors developed later that permit obtaining multiple rewards during an evaluation episode. This implies that the probability to retain variations that eliminate the remain-on-black-area behavior is extremely low. The only variations that can be retained are those that vary the way the behavior is realized while preserving its functions. This is the case of the variations leading to the development of the move-along-the-border-and-look-the-other-robot behavior, which preserves the original function (remaining on the black area) and enables it to play a new function (by enabling the other robot to infer the position of the center of the black area).

A second important aspect is that the expansion of the behavioral and communicative skills of the robots increases the evolvability of the robots, i.e. of the propensity of the robots to further improve their skills. This is due to the fact that new skills can often be produced by re-using old skills. Consequently, the larger the behavioral repertoire of the robots is, the greater the chance than they will develop new skills is. This positive feedback can potentially produce a snowball effect leading to a progressive acceleration of the adaptive capabilities of the robots.

The third aspect to notice is the strict interdependence between behavioral and communication skills. As we have seen, the development of new signals often supports the development of new behaviors and the latter, in turn, might support the development of new signals. This produces solutions with strongly integrated behavioral and communicative skills, in which the development of new behavioral and communication abilities supports each other producing a form of mutual scaffolding (see also Cangelosi et al., 2010).

9.4 Compositionality and behavior generalization

Robots can also acquire the ability to comprehend and/or produce human language and use it to cooperate with a human agent. For instance, a robot can be trained for the ability to interact with a human in order to satisfy her/his requests as they are expressed in natural language.

An example of how this can be realized is illustrated in an experiment reported by Tuci et al. (2011) in which an iCub humanoid robot (Metta et al., 2008) was trained for the ability to comprehend and accomplish natural language requests such as “touch the red object” or ”grasp the blue object”.

Video 9.4. The behavior of the robot at the end of the training process. The top, central, and bottom video show the behavior produced by the robot in response to the language command “indicate red”, “touch yellow” and “grasp red”, respectively. (Tuci et al., 2011; https://youtu.be/HcW6_RZa4hc, https://youtu.be/i9aVlQH1SKQ, https://youtu.be/zCcCIqQEfG0).

As shown in Video 9.4, the robot is situated in front of a table containing three objects of different colors. The neural network policy of the robot receives as input the sensory information extracted from the robot’s camera, the proprioceptive information encoding the current positions of the joints of the left arm/hand, the tactile information collected by contact sensors located on the fingers and on the palm of the hand, and the command expressed in human language. The command is composed of an action word, selected from the set [“indicate”, “touch” and “grasp”], and an object word, selected from the set [“red object”, “green object”, and “blue object”]. For the sake of simplicity, the words forming the linguistic commands are not recognized from speech and instead encoded directly in the state of six input neurons encoding the six corresponding words. The motor neurons of the policy network encode the desired position of the 7 DOF of the left arm, the 2 DOF of the head, the 2 DOF of the torso, and of 2 DOF that control the extension/flexion of the thumb and the other four fingers.

The 3 action and the 3 object words can be combined to form 9 different commands. However, to verify the ability of the robots to generalize, the authors trained the robots with 7 commands only and then post-evaluated the trained robots with all the 9 commands. In addition, to verify whether the composition of the behavior set was relevant, the authors replicated the experiment in a control condition in which the action word “indicate” was substituted with “avoid”.

The connection weights of the policy network are encoded in a population of genotypes and evolved. During training, the robots were evaluated for 7 episodes experiencing the 7 different commands. The robots were rewarded for the ability to produce a behavior appropriate to the current command, i.e. for the ability to bring the hand over the right object in case of indicate commands, for the ability to touch the right object in case of touch commands, and for the ability to grasp the right object in case of grasp commands. In the control experiment, the robots were rewarded for moving the hand away from the object in case of avoid commands. In other words, the robots were rewarded for the ability to produce seven different behaviors and for the ability to comprehend the meaning of the sentences by executing the appropriate behavior.

The evolved robots manage to understand the 7 commands experienced during training by exhibiting 7 corresponding behaviors in all replications of the experiment (see Video 9.4 for an example of the behaviors produced). Moreover, in some of the replications, the robots demonstrate the ability to understand also the two commands that were not used during training by producing the appropriate corresponding behaviors. In other words, the robots display an ability to comprehend the meaning of the individual words and to combine these meanings in a compositional manner. This ability to generalize is never observed in the control experiments performed with the “avoid” action word.

The analysis of the solutions discovered by the robots in the different replications of the experiment shows that the robots which generalize solve the task by developing a set of elementary behaviors and by combining these elementary behaviors to produce the 9 different required behaviors.

More specifically, the robots which understand all 9 commands display the ability to produce the following 5 elementary behaviors:

The object word triggers the appropriate indicate behavior, which takes the hand of the robot over the appropriate object. Then the touch or grasp words trigger the corresponding behaviors. The indicate word does not trigger any additional behavior.

The reason why the adaptive process discovers combinatorial solutions of this type in the standard experimental condition, is that these solutions require simpler control rules than non-modular solutions, which would treat each of the 7 language commands independently.

The reason why combinatorial solutions do not emerge in the control condition with the “avoid” action word is that in this case the level of overlapping between behaviors is smaller. Consequently, the advantage that can be gained by generating the 7 required behaviors by recombining a smaller set of elementary behaviors is smaller.

This experiment extends the work of Sugita & Tani (2005) who demonstrated for the first time the possibility to acquire compositional and integrated behavioral and language comprehension capabilities in robots.

9.5 On the complex system nature of robots’ behavior

Before concluding, let’s discuss some of the implications of the complex system nature of robots’ behavior (Nolfi, 2009). The behavior exhibited by a robot often display a modular organization with somewhat semi-discrete and semi-dissociable sub-behaviors playing specific functions. The organization of these behaviors extends over multiple levels since higher-level behavior lasting for longer time periods often originate from the sequences of lower-level behaviors, lasting for shorter time periods (Figure 9.3).

Figure 9.3. Schematization of the complex system organization of behavior. The behavior of the robot, shown in the white disk, is the result of the high-frequent interactions between the brain of the robot, the body of the robot and the environment (represented by the green, pink, and yellow disks). More specifically, the interactions between these three constituting elements produces a set of elementary behaviors, shown in the inner part of the white disk. The combination and the alternation of these behaviors over time produce the higher-level behaviors shown in the central portion of the white disk. Finally, the combination and the alternation of these higher-level behaviors produce the overall behavior of the robot, shown in the outer white disk. The properties at higher-level of organization originate from multiple non-linear interactions among the properties at lower levels of organization.  The system is characterized both by bottom-up and top-down effects since the high-level behaviors alter the relation between the robot and the environment and/or the environment itself. Consequently, the execution of high-level behaviors alters the future interactions between the three constituting elements determining the future behavior of the system.

This multi-level and multi-scale organization has several advantages:

9.6 Learn how

Read Section 13.13 and make the Exercise 11 to learn to implement back-propagation.



10. The Neural Basis of Cognition

10.1 Introduction

The term cognition does not have a shared and uncontroversial definition. Someone uses it to indicate only high-level faculties that characterize human intelligence such as planning, language, imagination and consciousness. Others include additional faculties such as perception, memory and selective attention. Still others extend the meaning of the term cognition to all processes that regulate the interaction between the agent and the environment in a way that is instrumental to the achievement of the agent’s goals (Bourgine and Stewart, 2006).

According to the more extended definition, also the examples that we reviewed in the previous chapters include cognitive robots. In the last part of this book, however, we will focus on complex cognitive faculties. More specifically, in this Chapter we will analyze the neural mechanisms and processes that are able to support the development of complex cognitive faculties. We will describe examples of cognitive robots, i.e. robots displaying high cognitive faculties, in the next Chapter.

10.2 Features extraction

As detailed in the previous chapters, the role of the robot’s brain consists in varying the state of the actuators on the basis of the current and eventually previous observations in order to orchestrate the dynamics originating from the interaction between the robot and the environment. The goal of such orchestration is that to ensure the achievement of the robot’s goal/s.

When the observations encode directly the features that can be used to determine the state of the actuators, the robot can solve the problem by connecting directly the sensory neurons to the motor neurons. This is the case, for example, of the Braitenberg’s vehicles illustrated in Chapter 2. In the other cases, instead, the required features should be extracted from the sensory neurons by internal neurons or generated by performing actions that permit to later experience observations including the required features.

Figure 10.1. Illustration of a hypothetical brain including two sensory neurons (s1 and s2) and a binary motor neuron (m1). The vertical and horizontal axes delimit the two-dimensional sensory space. The neural networks on the right of the sensory space represent the minimal network that can perform an OR and XOR function (left and right Figures, respectively).

To illustrate the role of internal feature extraction, consider a minimal robot that should regulate the state of its single binary motor on the basis of the state of its two binary sensors. If the motor is on when the first or the second sensor is active, and is off otherwise, the brain of the robot performs an OR function (Figure 10.1, left). In this case, the state of each sensor can be used to determine the state of the motor independently from the state of the other sensor. In other words, the state of sensor s1 can be used to determine the state of motor m1, independently from the state of s2 and vice versa. Consequently, the sensor s1 can be connected directly to the motor m1 with a positive connection weight. Similarly, the sensor s2 can be wired directly to the motor m1 with a positive weight. From a geometrical point of view, problems of this kind are called linearly separable since the observations that should produce different actions can be separated with a single line (Figure 10.1, left).

Instead, if the actuator is on when the first sensor is active and the second sensor is not active or vice versa, and is off otherwise, the brain of the robot performs a XOR function. In this case, the state of each sensor cannot be used to determine the state of the motor independently from the state of the other sensor. Consequently, the sensors cannot be connected directly to the motor. From a geometrical point of view, the observations which produce different actions cannot be separated with a single line. They can be separated with two or more lines, e.g. a line that separates the observation [0, 0] from the other observations and a line that separates the observation [1, 1] from the other observations (see Figure 10.1, right). The problem requires a neural network including at least two internal neurons which separate the two sub-areas of the sensory space. The features extracted by these two internal neurons can then be used to determine directly the state of the motor, i.e. can be connected directly to the motor neuron.

Unfortunately, there are not precise rules for determining the appropriate number of neurons and layers. We can only use simple heuristics like the higher the complexity of the problem is, the higher the number of neurons and layers should be. Moreover, the higher the size of the observation and action vectors are, the higher the number of neurons and layers required is.

In the next Chapter we will see how the extraction of useful features can be promoted through self-supervised learning.

For systems that are not embodied and situated, internal layers represents the only way to extract features. Whereas embodied and situated systems can also extract features through actions, i.e. can perform actions that enable them to later experience the required features in their future observations. We saw several examples of this kind of feature extraction the previous chapters.

A first example is the humanoid robot evolved for the ability to discriminate spherical and ellipsoid objects on the basis of tactile information illustrated in Section 4.5. The robot solved the problem by manipulating the objects so to converge on two types of hand/object postures, a first type of posture after interacting with spheres and a second type of posture after interacting with ellipsoids. In other words, the manipulation behavior permits to later experience observations which include the features required to discriminate the category of the object. A second example is constituted by the robot evolved for the ability to navigate in the double T-Maze described in Section 4.7. In this case the way the robots respond to the perception of the green beacons ensures that they will later experience observations including the feature required to decide which way to turn at the T-junctions. A third example is constituted by the follow-border-look-robot behavior displayed by the robots reviewed in Section 9.3. The execution of this behavior enables the companion robots to experience observations including the required features, i.e. the direction of the target destination.

Clearly, features extracted through actions can be combined with feature extracted internally. This is the case of the robots described in Section 4.5 and 4.7 that combine features extracted externally and features extracted internally by integrating observations over time. The possibility to extract information from the variations of observation over time brings us to the topic discussed in the next Section.

10.3 Features extraction over time

In some cases, the relevant features cannot be extracted from the current observation but can be extracted from the way observations vary over time. A straightforward example is constituted by the extraction of velocity information from position sensors. Another example is constituted by the extraction of distance information from an optic flow. Indeed, while the robot move forward, the rate of variations of nearby pixels located in certain portions of the visual field provides a reliable indication of the proximity of the objects located in the corresponding direction. A third example concerns the anticipation of future states. For example, the expected position of a target moving with a certain speed and direction after a certain time period.

The extraction of features from multiple observations experienced over time requires recurrent neural networks. In these networks internal neurons receive signals from other internal neurons that have not yet been updated in the current time step and consequently encode information about previous states. This property enables such networks to preserve into their internal states features extracted from previous observations.

Recurrent neural networks come in different varieties. Here we restrict our analysis to the two varieties that are particularly interesting for robotics: Continuous Time Recurrent Neural networks (CTRNN) and Long Short-Term Memory networks (LSTM).

CTRNN (Beer, 1995) use ordinary differential equations to update the activation of neurons. More specifically, the variation of the activation of a neuron i with action potential yi corresponds to:

where τi is the time constant of the post-synaptic neuron, yi is the activation of the post-synaptic neuron, i is the rate of change of the activation of the post-synaptic neuron, Wij are the connections weights between pre-synaptic neurons and the post-synaptic neuron, σ() is the logistic activation function, and Θj are the biases of the pre-synaptic nodes. The retention of information over time, therefore, is ensured also by the fact that CTRNN neurons partially preserve their previous state. The rate of preservation, and consequently the maximum rate of variation, depends on the time constant parameter.

LSTM (Hochreiter and Schmidhuber, 1997; Gers and Schmidhuber, 2001) use internal neurons constituted by complex units, called cells. Each cell includes three logistic neurons, one hyperbolic tangent neuron, and performs additive and multiplicative calculations (Figure 11.2). The activation of the four neurons is determined conventionally on the basis of the input vector (it), of the incoming weights of each neuron (Wf, Wi, Wo, and Wc), and of the activation functions of the neurons. The three gating neurons (f, i and o) use the logistic activation function that returns a value in the range [0.0, 1.0]. The cell neuron, instead, uses the hyperbolic tangent function that returns a value in the range [-1.0, 1.0]. The state of the cell is set equal to the sum of the state of the cell at time t-1 multiplied by the activation of the forget gate neuron (f) summed to the activation of the memory cell unit (o) multiplied for the activation of the input gate neuron (i). This implies that the forget gate neuron controls to what extent the old state of the unit is preserved and that the input gate neuron controls to what extent the new input alters the state of the cell. The output of the cell is calculated by multiplying the new state of the cell, squashed through a hyperbolic tangent function, for the activation of the output gate neuron. This implies that the output gate controls to what extent the internal state of the unit influences the post-synaptic neurons.

Figure 10.2. A schematization of a LSTM cell. The dotted circle represents the cell neuron that uses the hyperbolic tangent activation function. The full circles represent three logistic neurons that perform the role of forget (f), input (i) and output (o) gates and use the logistic activation function. The bottom rectangles indicate the incoming connection weights of the four neurons. The top arrow represents the state of the unit over time, i.e. at time t-1 and time t. The bottom line indicates the input vector (it) that is multiplied for the four vectors of weights (Wf, Wi, Wc and Wo) to generate the netinput of the four corresponding neurons. The black boxes indicate multiplicative (x), additive (+), and hyperbolic tangent (tanh) operations. Finally, the bottom-right arrow indicates the output of the cell (ot).

Integrating information over time requires to perform three interdependent functions: storing relevant states in the activation state of internal neurons, preserving such internal state for the time necessary, and using the internal states to alter the activity of other neurons. The gating neurons specialized for these three functions makes the LSTM more effective than vanilla recurrent neural networks and CTRNN.

Robots can also integrate information over time and store information externally, in their body or in the relation between their body and the environment.  We have seen an example of how a robot can store information in its relation with the environment and later use this information in Section 4.7. Robots can also extract features over time from their own body and store information in their posture. For example, a walking robot can infer information about previous encountered conditions from its current posture since the posture might reflect the environmental conditions experienced previously. Moreover, a robot can store information for later use in its own posture. For example, a robot can slightly alter the position of its hands in response to a certain environmental stimulus that requires the execution of a specific action later on.  Thus, the position of the hands is used to store the information that can be later accessed through propriosensors.

10.4 Deep learning

The extraction of features, that can then be used to determine actions, is related to the notion of deep learning.

Learning or credit assignment consists in finding the weights that enable the robot to exhibit an effective behavior. Depending on the problem, this might require long causal chains of computational stages, where each stage transforms, usually in a non-linear way, the aggregate activation of the network. Deep learning refers to the possibility to assign credit across many stages of this kind (Schmidhuber, 2015).

Determining in advance whether a specific problem requires deep of shallow neural networks is not really possible since problems usually admit a large number of different solutions. Establishing whether a problem is deep would require to verify that none of the possible solutions can be implemented in shallow networks. Indentifying all the possible solutions is generally not feasible. Indeed, as seen in previous chapters, certain solutions are hard to imagine from the point of view of a human observer. Determining whether a specific solution is deep or shallow, instead, can be difficult but is generally feasible. It implies determining the number of intermediate transformations performed by the solution network.  

For feedforward neural networks that are not situated, i.e. that cannot alter their next observations through actions, the maximum length of the credit assignment path corresponds to the number of internal layers. In other words, un-situated feed-forward neural networks have a maximal depth bounded by the number of internal layers. This is the reason why, for these networks, the term deep is used to indicate networks with many internal layers.

Recurrent neural networks instead are unbounded, i.e. they can rely on a large (and potentially infinite) number of functional subsequent elaboration stages, regardless from the numbers of internal layers.

Situated neural networks are also unbounded, independently from the characteristics of their architecture, since they can alter their next observations through actions. In other words, they can use their actions to transform their observations several times in successive steps. For example, to access the observation x that permits to achieve the robot’s goal through action y from state s, and embodied network can produce action 1, which permits to later experience observation 2, which triggers the action 2 which permits to later experience observation 3, which triggers the action 4, which finally permits experiencing the observation x.

Problems requiring deep solutions are challenging.

For reinforcement learning methods that backpropagate a gradient over intermediate layers and/or in time, through recurrent connections, the main challenge consists in the gradient vanishing/exploding problem (Hochreiter, 1991). The problem, that is caused by cumulative backpropagated error signals which either shrink exponentially or grow out of bounds, can be alleviated by using LSTM networks and suitable activation functions.

Evolutionary methods do not rely on backpropagation and consequently are not affected by the gradient vanishing/exploding problem. They avoid the credit assignment problem by simply retaining the variations that produces useful outcomes. They do not attempt to reconstruct the casual chain of events enabling the useful outcome. The main challenge for evolutionary methods is optimizing networks with a huge number of parameters. This problem can be alleviated by using effective stochastic optimizers, like Adam (Kingma & Ba, 2014).

10.5 Neuromodulation

The response of biological neurons is not only determined by the signals received through incoming connections. It is also influenced by neuromodulators, i.e. substances packaged into vesicles and released by neurons. Neuromodulators permit to dynamically control the properties that determine how the neurons respond to incoming stimuli in a context-dependent manner.  In artificial neural networks, modulation mechanisms of this type can be realized by using special neurons that regulate the output of other neurons. We already encountered an example of regulatory neurons in the LSTM networks described above. For example, we have seen how within an LSTM cell the state of the input gate neuron regulates the output of the cell neuron in a multiplicative manner.

Although the study of neuromodulation in adaptive robots is still in an initial exploratory phase, few works investigated its utility in problems that require to adapt to varying environmental conditions on the fly. Husband et al. (1998) and Philippides et al. (2005) demonstrated the utility of regulatory “gas” neurons, i.e. neurons diffusing a virtual gas that regulates the response of nearby neurons. Petrosino, Parisi & Nolfi (2013) demonstrated how the availability of logistic modulatory neurons, that regulate the output of associated standard neurons, facilitates the evolution of selective attention skills in evolving robots. Selective attention refers to the ability to appropriately respond to a first class of stimuli while ignoring the stimuli belonging to a second class in a certain context and to respond to the stimuli of the second class while ignoring the stimuli of the first class in a different context.

10.6 The architecture of the neural network

Finally, another important factor is the architecture of the robot’s brain, namely the number and type of neurons, the way they are interconnected, and eventually the constraints on the connection weights.

For many problems, feed-forward networks with one or two internal layers constitute a suitable architecture. As we mentioned above, there are not precise rules for determining the number of neurons and the number of layers. In general, the larger the dimensionality of the observation and action vectors is, the larger the number of internal neurons required is. Moreover, the greater the complexity of the problem is, the greater the number of internal neurons and layers required is. The network should include at least the minimal required number of neurons and layers. The usage of networks that are larger than required, on the other hand, does not usually hinder performance.

From problems that necessarily require to integrate sensory-motor information over time, i.e. that require internal memory, recurrent neural networks are needed. As discussed in the previous chapters, determining whether a problem necessarily requires a recurrent network or not is not trivial since problems admit several different solutions and since identifying all possible solutions is not feasible. Consequently, the experimenter should rely on its intuition. On the other hand, using an architecture with recurrent connections for a problem that does not require memory does not usually hinder performance.

Problems that require to process visual information benefit from the usage of convolutional neural networks. These networks are formed by stack of progressively smaller convolutional layers, with shared weights, and pooling layers (Fukushia, 1979; Weng, Ahuja & Huang, 1993). The usage of convolutional layers with shared weights permits extracting the same classes of features from the different spatial portions of the image. Moreover, the hierarchical organization of the architecture facilitates the extraction of progressively higher-level features.

Problems requiring the production of rhythmic behaviors, e.g. walking and swimming, can benefit from the availability of neuron oscillators. Neural oscillators are a special type of neurons capable of producing periodic oscillatory outputs also in the absence of rhythmic inputs (see for example Kamimura et al., 2005). Oscillatory neurons of this type play a role analogous to Central Patterns Generators (CGPs), i.e. neural circuits producing self-sustained oscillatory signals that support the production of rhythmic behavior in natural organisms.

A possible strategy to design the architecture of a network consists in creating architectures analogous to those possessed by animals and humans. An example is constituted by the Darwin X robot (Krichmar et al., 2005). The brain of this robot is constituted by a neural network with 90,000 neurons and 1.4 million connections distributed in 50 neural areas. The connectivity within and among areas reproduce the characteristics of 50 corresponding neural areas of the human brain.

The architecture of the network can also be evolved. This can be realized by evolving genotypes that encode directly or indirectly the connectivity among neurons and the connection weights (Stanley & Miikkulainen, 2002; Stanley, D’Ambrosio & Gauci, 2009; Stanley and Gauci, 2010). Alternatively, it can be realized by evolving the architecture through an evolutionary algorithm and by learning the connection weights through a different algorithm (see for example Real et al., 2017).

10.7 Learn how

Read Section 13.14 and make the Exercise 12 to learn to implement a minimal reinforcement learning algorithm.



11. Cognitive Robots

11.1 Introduction

As we have seen in the previous Chapters, robots adapted through evolutionary and reinforcement learning methods can develop behavioral and cognitive skills by relying only on the input provided by their sensors and on a scalar value, that rates how well they are doing, calculated automatically through a fitness or reward function. Robots adapted through learning by demonstration (Section 6.5 and 9.4) require more information, i.e. a detailed description of the actions to be produced each step extracted from a demonstrated behavior.

In this Chapter, we will see how the availability of additional training feedback and/or of additional input information can enhance the adaptive process. By additional training feedback is intended information that can be used to determine how to change the parameters of the adapting robot/s. The next sections will be focused in particular on information that can be extracted by the robot itself from its own sensor/internal/motor states and that can be generated without human intervention. By additional input information it is meant additional input vectors provided by a human caretaker or produced by another robot. These additional input vectors, that complement the information contained in the observation vector, can consist of affordance and/or goal vectors. Affordance vectors represent the behavior afforded by the current robot/environmental context. Goal vectors represent the state that a robot should attain to achieve its goal.

For an analysis of the relation between the development of behavioral and cognitive skills in humans and robots see Cangelosi & Schlesinger (2015) and Asada & Cangelosi (in press).

11.2 Extracting features through self-supervised learning

Self-supervised learning indicates a form of supervised learning in which the desired output vectors are generated automatically on the basis of the input vectors, without human intervention. 

A first type of self-supervised networks are auto-encoders, i.e. feed-forward networks trained to produce output vectors identical to input vectors, where input vectors consist of observations. When the number of neurons forming an internal layer is lower than the size of the input vector, the information contained in the input vector is compressed into a smaller internal vector. The compression permits extracting the features that encode the way in which the element forming the input vector co-vary. Moreover, the compression permits filtering out noise and generate partially missing information.

A second type is constituted by predictor networks that are trained to produce as output the input vector that they will experience at time t+1. These networks can be used to extract features from input vectors that anticipate future states and to filter out noise.

A third type is constituted by forward-models networks, i.e. networks that are trained to predict the next observation on the basis of the current observation and of the action that the agent is going to perform. As predictor networks, these networks extract in their internal neurons information about future states. More specifically, they permit to predict the effect that the actions of the robot have on the robot’s perceived environment.

Finally, a fourth type is constituted by sequence-to-sequence networks, i.e. networks that are trained to produce as output the last n observations experienced during the current and the previous n-1 steps. The network first receives a sequence of n observations, one vector at a time, and then reproduces the same sequence in its output, one vector at a time. These networks extract information on how observations change over time.

Predictors, forward models, and sequence-to-sequence networks are implemented with recurrent neural networks.

The efficacy of adaptive methods can be enhanced by combining a control network, trained with an evolutionary or reinforcement learning algorithm, with one or more feature-extracting networks trained through self-supervised learning (Lange, Riedmiller & VoigtHinder, 2012; Mattner, Lange & Riedmiller, 2012; Ha & Schmidhuber, 2018; Milano & Nolfi, 2020). The feature-extracting networks are used to extract useful features from observations. The control network is used to map the features extracted by the former network/s to appropriate actions.

A notable example is constituted by the work of Ha & Schmidhuber (2018) that combines a control network with an auto-encoder and a forward-model network (Figure 11.1). The model has been applied to the CarRacing-v0 problem (Klimov, 2016) that consists in driving a car on a race track in simulation by receiving as input the images collected by a camera located above the track. The training is performed in four phases. The first phase is dedicated to the generation of the training set that is formed by the observations o and the actions a experienced and performed by the car during 10,000 episodes in which the car moves by selecting random actions. The second phase is used to train the auto-encoder network that receives in input and reproduces in output the observation vector. Such training enables the auto-encoder to extract an abstract compressed representation z of observations in its internal layer. During the third phase, the forward-model network is trained to predict zt+1 by receiving in input zt and at.  The forward model extracts in its internal layer a vector of feature h that permits predicting the next compressed representation of the observation. Both the auto-encoder and the forward-model networks are trained on the basis of the training set collected in the first phase. Finally, in the fourth phase, the control network is trained to drive the car so to maximize the cumulative rewards by receiving as input the z and h vectors extracted from the observations from the pre-trained auto-associative and forward-model networks.  The control network is trained through the Covariance-Matrix Adaptation Evolution Strategy (CMA-ES, Hansen & Ostermeier, 2001).

Figure 11.1. The architecture of the agent’s brain. o indicates the observation vector. The auto-encoder network (A) receives as input the observation vector (o), produces as output the observation vector, and extracts in its internal neurons the feature vector z. The forward-model network (F) receives as input the feature vector z and the action a that the car is going to perform, produces as output the feature vector z at time t+1 predicted, and extract in its internal neurons the feature vector h. The control network (C) receives as input the feature vectors z and h and produces as output the action vector a. Adapted from Ha & Schmidhuber (2018).

The agents provided with the auto-encoder network outperform the agents lacking feature extracting networks. Moreover, the agents provided with both an auto-encoder and a forward-model network outperform the agents provided with an auto-encoder network only (see Video 11.1). As shown in the videos, the agent that includes the two feature-extracting networks drives more smoothly and handle sharp corners more effectively than the other agents.

Video 11.1. The behavior produced by a typical evolved agents provided with a control network only (top), a control and an autoencoder networks (center), and a control network, an auto-encoder, and a forward-model networks (bottom). (https://worldmodels.github.io, Ha & Schmidhuber, 2018)

Video 11.2 illustrates the features extracted by the autoencoder network. On the left, central, and right portion of the video you can see examples of the 64x64 pixel images observed by the camera, the z vectors extracted from a trained auto-encoder network, and the 64x64 pixel images reconstructed by the autoencoder. The fact that the images reconstructed by the auto-encoder include most of the information present in the original images demonstrates that the autoencoder manages to compress most of the information contained in the 64x64 pixels in a vector composed of 15 values. The second portion of the video shows how the image reconstructed varies when the elements forming the z vector are altered manually. The effects of these variations indicate that some of the elements of the z vector encode abstract properties of the image, such as the presence of a straight path or of a curve ahead.

Video 11.2. Examples of z vectors, and output images produced by a trained auto-encoder network that receives as input a set of typical images. The second part of the video show the impact of manual modifications of the z vector on the output image reconstructed by the auto-encoder (https://worldmodels.github.io, Ha & Schmidhuber, 2018)

We can exploit the ability of the auto-encoder network to reconstruct the image from the z vector to look inside the brain of the agent and observe the visual perception of the agent while it is driving the car (Video 11.3).

Video 11.3. Comparison of the images received by the auto-encoder in input and of the images reconstructed by the auto-encoder while the car drives on the basis of an evolved control network that receives as input the z and h feature vectors. (https://worldmodels.github.io, Ha & Schmidhuber, 2018).

In a following work, Milano & Nolfi (2020) introduced a method for continuing the training of the feature-extracting network/s during the training of the control network. Moreover, the authors compared the advantage provided by different feature-extracting methods on four problems: the HalfCheetahBullet, Walker2dBullet, BipedalWalkerHardcore, and MIT race car problems. The obtained results demonstrate that all feature-extracting methods provide an advantage with respect to the standard method in which the control network receives as input unprocessed observations, providing that the training of the feature-extracting networks is continued during the training of the control network. Moreover, the comparison indicates that the best results are obtained with the sequence-to-sequence method.

The necessity to continue the training of the feature-extracting network/s during the training of the control network can be explained by the fact that the problem considered by Milano & Nolfi (2020) involves agents operating on the basis of egocentric information while the problems considered in the other studies referred above involve agents operating on the basis of allocentric information, i.e. on the basis of a camera detached from the agent that observe the agent and the environment. Indeed, the difference between the observations experienced by producing random action with respect to actions selected to maximize the expected reward is greater for agents that operate on the basis of egocentric information than for agents that operate on the basis of allocentric information.

11.3 Generating observations

The observation generated by a predictor network or by a forward-model can be used as a proxy of the actual observation vector when it is missing or incomplete.

An example that illustrates how a predictor network can enable a robot to keep operating correctly during phases in which the agent is temporarily blind is described in Gigliotta, Pezzulo & Nolfi (2011). In this experiment, a pan-tilt camera robot located in front of a panel is evolved for the ability to foveate at the different portions of the image located around the center of the panel, by moving in the clockwise direction (Figure 11.2). The panel is colored with red and blue colors with an intensity that vary linearly along the vertical and horizontal axes.

Figure 11.2. Left: A pan-tilt camera robot located in front of the colored panel. Center: The portions of the image to be foveated are located at the intersection between the white circle and the white lines. Right: the neural network controller of the agent. (Gigliotta, Pezzulo & Nolfi, 2011)

The robot is provided with: (i) two sensors (R and B) that detect the intensity of the red and blue color at the center of the visual field of the camera, (ii) two sensory neurons (R1 and B1) encoding the state of the sensors or the state of two additional internal neurons (H21 and H22) during normal and blind phases, respectively, (iii) a layer of internal neurons with recurrent connections (H1), and (iv) two motor neurons controlling the pan and tilt motors of the camera. Therefore, a single network is responsible to set the state of the motors and to generate the observation that is used as a proxy of the missing observation at time t+1 during blind phases. 

The connection weights of the neural network are encoded in a vector of parameters and evolved by using a fitness function that rewards the agent for the ability to move the focus of the camera along the circular path in the clockwise direction (Figure 11.2, center). Unlike the models described in the previous section, therefore, the predictor network is not trained with self-supervised learning to minimize the offset between the predicted and the actual state of the sensors. The robot is provided with a single network that includes a prediction module. The connection weights of the entire network, including the weights of the prediction module, are evolved on the basis of the feedback provided by fitness function that rates the robot for the ability to foveate along the desired circular trajectory.

The evolved robots shows the correct behavior also during blind phases, i.e. when the camera does not provide input data for several steps. As mentioned above, during these phases the state of the sensory neurons is set on the basis of the state of the H21 and H22 internal neurons at time t-1.

Interestingly, the state generated by the predictor network does not match the state that the sensors would assume if the camera was not blinded. The network predicts correctly whether the intensity of red and blue will increase or decrease during the current step as the result of motion but amplifies the rate of variation over time. These states enable the agent to produce a correct behavior even if the rate of variations is exaggerated. In other words, the predictor network generates states that are functionally equivalent to the missing observations but that differ significantly from them.

Similar results were reported by Freeman, Ha & Metz (2019) who evolved the neural network controller for a swing-up balancing problem. In this work, sensory deprivation was introduced by enabling the agent to access to the observation in each step with a probability p and by replacing the state of the sensory neurons with the state of predictor neurons in the remaining cases.

By varying the probability, the authors observed that the agents manage to solve the problem also when p was set to 10%, i.e. when the agent is allowed to gather information from its sensors in a minority of the cases. The comparison of the observation generated by the neural network and the missing observation indicates that, much like the previously discussed case, the two vectors differ significantly. The observation generated by the network accounts only for the features supporting the production of the effective behavior and neglects other features. After all, this is the function for which the predictor component of the network is rewarded for.

Overall these experiments demonstrate that an agent can acquire the ability to generate states that can be used as a proxy of missing sensory information even without the usage of self-supervision thanks to fact that the generation of these state is instrumental to the achievement of the agent’s goal. Moreover, these experiments demonstrate that the state generated by the agents should not necessarily account for all the information included in the missing observation. It should only incorporate the features that support the production of the correct behavior.

11.4 World models

Forward models can also be used to replace completely the interaction between the robot and the environment. In other words, they can be used by the agent to stop acting in the real environment and to start acting in an imaginary world simulated through a forward-model network (world model). During these dream-like phases the actuators of the robot are blocked, the state of the sensors is ignored, and the action vector produced by the neural network controller of the robot is given as input to the forward model that generates the next observation vector and eventually the reward (Figure 11.3, right).

World models can be used to plan, i.e. to choose the actions to be executed after having checked mentally the expected outcomes of possible alternative actions, or to adapt mentally in a imagined world, i.e. to develop the skills required to achieve the robot’s goal mentally without interacting with the actual environment. Mental training can be realized by using a standard evolutionary or reinforcement learning algorithm combined with a forward-model network trained through self-supervision. During the training process, the parameters of the control network are varied based on the imagined outcome of the hypothetical behavior shown by the agent generated and modelled by the control and forward-model networks without involving the physical environment and the body of the robot. 

Figure 11.3. Illustration of embodied and mental training, left and right respectively. In the case of embodied training the control network receives as input an observation vector (o) measured by the sensors of the robot and a reward (r) computed through a reward function. The output vector produced by the control network is used to set the state of the actuators that alter the environment and/or the relation between the robot and the environment. In the case of mental training the next observation (ot+1) and the reward (r) are predicted by a forward-model network trained through self-supervision. The forward model network receives as input the action vector produced by the control network (a) and the observation at time t (o). Notice that in mental training the body of the robot (represented in pink) and the environment (represented in yellow) do not play any role.

Mental adaptation presents similarities with adaptation in simulation. Indeed, in both cases the brain of the robot interacts with a virtual environment simulated in a computer instead than with the physical environment. However, adaptation in simulation is realized by using  a computer program designed manually which simulates the robot, the sensors and the actuators, the environment, and their interaction. Mental adaptation instead is realized by using a neural network trained through a self-supervised learning algorithm.

An example of mental adaptation is reported in Ha & Schmidhuber (2018). The experiment involves an agent trained to play the VizDoom game (Kempka et al., 2016). As in the case of the experiment on the CarRacing-v0 problem reported in Section 11.2: (i) the agent includes a control, an auto-encoder and a forward-model network, (ii) the control network receives as input the internal state of the auto-encoder and of the forward-model network, (iii) the auto-encoder is trained to re-generate the observation, and (iv) the forward-model is trained to predict the next internal state of the auto-encoder network (see Figure 11.1). Moreover, as in the case of the experiment on the CarRacing-v0 problem the training is realized in a series of phases dedicated to: (i) collect the training set during episodes in which the agent moves by performing random action, (ii) train the auto-encoder network, (iii) train the forward-model network, and (iv) train the control network. However, in the case of this experiment: (i) the forward-model network is trained to generate also the reward (which is +1 when the agent survives until the end of the episode, and -1 when the agent dies), (ii) the control network is trained mentally without using the game software, and (iii) the trained agent is post-evaluated by using the game software to verify that the skills acquired mentally generalize to the actual game environment.

The post-evaluation of the trained agent in the actual game environment demonstrates that the skills acquired by the agent through mental training enable the agent to operate effectively also in the actual game environment.

Clearly, the possibility to transfer successfully agents trained mentally in the actual world depends on the accuracy of the world model and on the capacity of the world model to generate sufficiently varied experiences. In their model Ha & Schmidhuber (2018) introduced two characteristics that are crucial in that respect. The first is that the forward-model is used to predict an abstract representation of the observation instead than the observation itself, i.e. the z vector instead than the o vector. This permits to increase the accuracy of the predicted state and to filter out features that are unpredictable. The second aspect is that the forward-model network produces as output a density function p(z) instead than a deterministic prediction of z. A probabilistic output combined a high temperature increases the variability of the conditions generated by the world model. This mechanism plays a role similar to the addition of noise in simulated environments discussed in Section 7.5.

11.5 Affordance vectors

The possibility to develop behavioral and cognitive skills can be facilitated and/or enhanced by the availability of additional input patterns that complement the observation state extracted from the sensors of the robot. One kind of such additional input patterns are affordance vectors encoding the opportunity for the execution of behaviors.

As any other feature, affordances can be extracted from observations within the internal layers of the robot’s brain (see for example Section 5.7). Alternatively, affordance vectors can be generated by another cooperating robot (as illustrated in Section 9.3). In this section, instead, we will consider situations in which the affordance vectors are provided by a human caretaker.

Clearly the possibility to receive affordance information directly in input can facilitate the adaptive process, especially in cases in which the relation between the affordances and observations is indirect.

An example that illustrates how the availability of affordance patterns can support the development of behavioral skills is illustrated in the experiment of Nishimoto & Tani (2009) in which a Curio robot (Ishida, 2004) is trained through learning by demonstration to produce three articulated behaviors (see also Yamashita & Tani, 2008). The behaviors consist of: (i) holding a cubic object with the two hands and moving the object up and down and then left and right four times by finally releasing the object and returning to the initial posture (Video 11.4, first part), (ii) touching the objects with both hands for three times and holding and moving the object along a circular trajectory for three times by finally releasing the object and returning to the original posture (Video 11.4, second part), and (iii) holding the object, moving the object forward and back for three times, touching the object with the left and right hand for three times, and returning to the initial posture.

Video 11.4. Behaviors produced by a trained robot that received the first and the second affordance pattern. (https://www.youtube.com/watch?v=czxMZL61MK8, Nishimoto & Tani, 2009).

The robot is trained through a learning by demonstration method realized through kinesthetic teaching and interactive tutoring. More specifically, during the demonstration phase, a human caretaker located behind the robot holds and moves the arm of the robot so to make it performs the three behaviors several times while the actuators of the robot are set in a passive compliant mode. This phase is used to collect a training set formed by the observation and action vectors. During the training phase, the neural network controller of the robot is trained to respond to the sequence of observations stored in the training set by producing the associated sequence of action vectors through back-propagation through time (Werbos, 1990). During the post-evaluation phase, the robot is evaluated for the ability to produce the behaviors demonstrated without the assistance of the caretaker. During the demonstration the object is initially located in three different positions on the left, center, and right of the taskspace. During the post-evaluation, the object is initially placed in random positions selected within the same range of the demonstration.

As reported by the authors, the post-evaluation phase indicates that the robot is not yet able to produce the requested behavior. This is due to the fact that minor differences between the actions produced by the robot during post-evaluation and the actions demonstrated by the experimenter cumulate by producing significant differences over time. As a result of these differences, the trained robot fails to hold the object in its hands. This problem was solved by repeating the training process a second time in which: (i) the robot was allowed to control its actuators during the demonstration, and (ii) the experimenter exhorted an external force on the arms of the robots during the holding phases to correct the robot’s behavior. In other words, the training was realized by iterating the learning by demonstration process multiple times and by using the successive demonstrations to correct only the aspects of the robots’ behavior that required further adjustments.

Figure 11.4. The architecture of the robot’s brain. The sensory, motor, affordance, and internal neurons are indicated with red, blue, green, and white circles. The state of the affordance neurons is set at the beginning of each episode on the basis of one of the three affordance vectors that corresponds to the behavior to be produced.  (Figure adapted from Nishimoto & Tani, 2009).

The neural network controller of the robot is constituted by a multi-level recurrent neural networks with neurons varying at different time scales. More specifically, a lower-level sub-network that includes the sensory neurons, the motor neurons, and fast varying internal neurons and a higher-level sub-network that includes the neurons encoding the affordance vector and slow varying internal neurons (Figure 11.4). The time constant of the neurons determining the rate of variations is set manually (see Section 10.3). The two sub-networks, that are connected through few shared neurons, assume functionally different roles during training. The higher-level sub-network becomes responsible for triggering the production of one of the tree behaviors and for triggering the appropriate sequence of elementary behaviors required to produce it. The lower-level sub-network, instead, becomes responsible for the production of the elementary behaviors. This functional subdivision can be appreciated also by inspecting the way slow and fast neurons vary during the production of the behaviors (see Video 11.4).

The three affordance vectors that are used to trigger the execution of the three corresponding behaviors are identical at the beginning of the training process (i.e. are formed by vectors of identical 0.5 values). However, the values forming each affordance vector are varied during the course of the training in the direction of the error back-propagated from the motor neurons to the affordance neurons. In other words, the content of the affordance vector is varied in a way that facilitates the production of corresponding demonstrated behaviors.

Another example that demonstrates how the availability of affordance vector facilitates the evolution of effective behaviors is reported in Massera et al. (2010). In this case, a humanoid iCub robot (Metta et al. 2008) was trained in simulation for the ability to reach, grasp and move an object. The authors compared the results obtained in a standard experiment and in an extended experiment in which the robot received from the caretaker a sequence of three affordance vectors indicating whether the current context afford a reach, grasp or move behavior. The observation vector includes the information extracted from the camera located on the robot head and extracted from the position sensors located on the joints of the robot. In the case of the extended experiment, the tree affordance vectors are received from the caretaker at the beginning of the evaluation episode, when the hand of the robot approaches the object, and when the object has been successfully grasped, respectively. As reported by the authors, the availability of the affordance vectors facilitates the development of the required behaviors.

11.6 Goal vectors

Another type of information that can usefully complement the observation vector is a representation of the robot’s goal, i.e. of the target state that should be attained by the robot. As we have seen, a robot can acquire the ability to achieve goals without possessing or receiving an explicit representation of them. However, the availability of a representation of the goal can facilitate the development of the goal achieving capability.

An example that illustrates the utility of goal representation is the algorithm introduced by Andrychowicz et al. (2017) that permits to learn from errors.

To illustrate the idea, let’s consider a robotic arm with 7 DOFs that should learn to move a target object to a given target position indicated by the experimenter. The sensors of the robot detect the current angular position and velocity of the robot's joints and the position of the object. In addition, the robot receives as an input from the experimenter a vector that encodes the target position that the object should assume, i.e. the representation of the goal. The robot is rewarded with 1 and -1 at the end of episodes in which it manages or fails to achieve the goal, respectively.

Since the probability that a robot with random parameters solves this problem is extremely low, it will receive a positive rewards only occasionally during the initial phase of the training process. Consequently, the progresses produced by learning will be slow.

As we discussed in Section 6.4, this bootstrap problem can be solved by designing a reward function that scores the robot also for the ability to produce behaviors that are instrumental to the achievement of the goal, e.g. that rewards the robot also for the ability to reduce the distance between the position of the object and the goal. On the other hand, shaping the fitness function can be challenging and can have unexpected undesirable consequences.

A better solution consists in exploiting the availability of the goal representation to learn from failures. This can be realized by repeating each learning episode a second time during which the original goal is replaced with the position that corresponds to the position that the object actually assumed at the end of the previous episode. This permits creating episodes where the robot is positively rewarded since the target goal corresponds to the position in which the object was actually moved. In other words, this replay with goal modification technique permits to exploit the knowledge that the action performed by the agent are inadequate to achieve the intended goal but are adequate to achieve a different goal, i.e. the goal that correspond to the actual outcome of the episode. For these reasons the algorithm has been named hindsight experience replay (HER).

Video 11.5. Behaviors displayed by robots trained without and with hindsight experience replay (left and right respectively) at different stages of the learning process. The red ball indicates the target position (goal). Results obtained on a pushing task, a sliding task, and a pick and place task. The final part of the video shows the behavior obtained by post-evaluating the robot trained in simulation in the physical environment. (https://www.youtube.com/watch?v=Dz_HuzgMxzo, Andrychowicz et al., 2017).

Video 11.5 shows the behaviors of a robot trained with a Deep Deterministic Policy Gradient algorithm (Lillicrap et al., 2015) without and with Hindsight Experience Replay, left and right respectively. As can be seen in the video, the vanilla version of the algorithm keeps failing most of the time while the algorithm with hindsight experience replay quickly converges on effective solutions.

An additional advantage of the HER algorithm is that it generates implicitly an incremental learning process, i.e. a process where the complexity of the environmental conditions increases as the skills of the learning robot improves. Indeed, the offset between the initial and the target position of the object is small at the beginning, when the robot produces simple actions. The offset then increases later on when the actions produced by the robot complexify.

11.7 Learn how

Read Section 13.15 to analyse in detail the Stable Baseline3 implementation of the PPO algorithm. Read Section 13.16 to learn to train robots with reinforcement learning algorithms through the Stable Baseline tool. Make the Exercises 12 and 13.



12. Long-term and Open-ended Learning

12.1 Introduction

In this chapter I will discuss long-term and open-ended learning. Long-term indicates adaptive processes enabling the robots to improve and expand their skills for a prolonged time period. Open-ended refers to unbounded adaptive processes that keeps producing progresses and novelty forever.

An important factor that determines whether the progress continues in the long-term or not is the adaptive pressure, i.e. the gap between the cumulative reward obtained by the robot at a given stage of the adaptive process and the one that could be obtained by modifying appropriately the parameters of the robots. A second important factor is adaptability, the probability that variations of the adaptive parameters produce progress. The adaptive pressure and the adaptability should remain sufficiently high during the adaptive process to enable learning. Finally, a third crucial factor is the amplitude of the robots skills. The greater the behavioral and cognitive repertoire of the robot/s are, the greater the chance that they will manage to further expand and improve their skills is.

Cleary, for problems admitting optimal solutions, the adaptive process should not and cannot continue in the long term. It should continue only for the time necessary to discover an optimal solution. However, complex problems often admit the possibility to keep improving indefinitely, independently of the skill level possessed.

12.2 Incremental learning

An aspect that makes long-term learning difficult is that both the level of adaptability and the level of selective pressure depend on the skills of the adaptive robots which vary during the course of the process. At the beginning of the adaptive process, when the robots do not possess any skills, the adaptive pressure is high and the adaptability is low. The high adaptive pressure is due to the fact that the gap between the cumulative reward currently achieved and potentially achievable by the robot is high. The low adaptability is due to the fact that the gap between the current behavior of the robot and the behavior that could enable it to achieve its goal (at least in part) is high and the chances for skills re-use is low. During the successive phases of the adaptive process, the adaptive pressure decreases and the adaptability increases. The fact that both the adaptive pressure and the adaptability vary implies that the attempt to solve a complex problem in stationary conditions can be problematic. That leads to situations where the adaptive process fails to start producing progress, because the  adaptability is too low, or other situations where the adaptive process stagnates at a certain point, when the adaptive pressure becomes too low. The problem can be solved by varying the complexity of the task or the environmental conditions experienced by the adaptive agents so to maintain the adaptability and the adaptive pressure sufficiently high during the entire process. 

Incremental learning methods approach the problem by varying the complexity of the problem, i.e. by evaluating the robots for the ability to solve progressively more complex versions of the problem. For example the evolution of the ability to visually track target objects can be realized by exposing the evolving robots first to large immobile targets, then to small immobile targets, and finally to small moving targets (Harvey, Husband and Cliff, 1994). Similarly, the evolution of agents evolved for the ability to capture prey robots can be organized in a series of subsequent phases in which the speed of the escaping preys and the pursuit delay is progressively increased (Gomez & Miikkulainen, 1997).

An example that illustrates the efficacy of incremental learning in combination with reinforcement learning is reported in the work of Yu, Turk & Liu (2018) which describes humanoid robots trained to walk. In this case the complexification of the problem is realized by initially exposing the learning the robot to simplified environmental conditions and by progressively releasing the simplification when the skills of the robot increase.

This is realized by using an automated assistant that helps the robot to maintain lateral balance and to move forward. The assistant consists of a proportional-derivative controller that applies a lateral force to the pelvis of the robot along the left-right axis, in this way helping the robot to avoid falling sideways, and a propelling force along the frontal-rear axis, helping the robot to move forward. The intensity of the forces is modulated by progressively reducing and by finally zeroing the damping (Kd) and the stiffness (Kp) coefficients. To avoid overfitting, the learning agents are evaluated for multiple episodes by using Kd and Kp values extracted randomly from an interval. The center of the interval, which determines the average amount of assistance, is progressively reduced when the performance of the learning robot overcomes a given threshold.

The advantage provided can be appreciated by comparing a typical behavior obtained with and without incremental learning (Video 12.1, ECL versus PPO conditions). The video also illustrates the advantage obtained by encouraging the learning robots to produce symmetrical behavior (Video 12.1, MSL conditions, see Yu, Turk & Liu (2018) for details). The authors use the term curriculum learning to refer to their incremental condition. Instead, I restrict the usage of the term to methods in which the articulation of the learning phases is learned and is not handcrafted in advance by the experimenter (see the next section).

Video 12.1 Typical behavior obtained by using a standard reinforcement learning algorithm (PPO), an algorithm empowered with curriculum learning (ECL), an algorithm that encourage the development of symmetrical behavior (MSL), and an algorithm that benefit from curriculum learning and symmetrical behavior bias (ECL + MSL). (https://www.youtube.com/watch?v=zkH90rU-uew, Yu, Turk & Liu, 2018)

Incremental approaches can be effective but show drawbacks. They require to use domain knowledge to define the learning phases and they introduce additional hyperparameters, e.g. the transition threshold/s, which can be hard to tune. Moreover, they presuppose that the problem can be divided in sub-problems ordered by difficulty along a single axis, when in reality the problem might vary along multiple axes of difficulty.

12.3 Curriculum learning

Curriculum learning, instead, attempts to maintain both the adaptive pressure and the adaptability high by automatically manipulating the learning experiences of the robots.

In the context of evolutionary methods this can be done by extending the evolutionary algorithm with a curriculum learning component (Milano & Nolfi, 2021). This component estimates the difficulty level of the environmental conditions encountered by the evolving robots of recent generations and exploits this information to select environmental conditions with the appropriate level of difficulty. The difficulty level of the environmental conditions is estimated on the basis of the inverse of the relative fitness obtained by the robots recently evaluated in such conditions.

More specifically, the evolving robots are evaluated for η episodes in corresponding environmental conditions. During the first 1/10 of the evolutionary process the environmental conditions are chosen randomly as in standard evolutionary methods. Then, they are chosen randomly from among η subsets characterized by different levels of difficulty. The subsets are obtained by: (i) normalizing the performance ρ achieved on each environmental condition during the last five evaluations in the range [0.0, 1.0], (ii) calculating η performance intervals on the basis of a difficulty function (see below), and (iii) selecting η environmental conditions randomly from η subsets, i.e. selecting η environmental conditions characterized by η different levels of difficulty. See the pseudocode included below.

This method permits exposing the evolving robots of any given generation to environmental conditions that have a similar level of difficulty, overall. In other words it permits reducing the stochasticity of the fitness measure due to the over and under-estimation of the fitness of individuals that encountered easy and hard environmental conditions, respectively. Moreover, the algorithm permits selecting preferentially difficult environmental conditions through the usage of a power difficulty function. Adopting a linear difficulty function does not alter the overall difficulty of the selected environmental conditions with respect to a standard approach (in which the environmental conditions are selected randomly, with an uniform distribution, among all possible conditions). The usage of a power difficulty function implies that the subsets including easier environmental conditions are larger than the subsets including more difficult environmental conditions. Consequentially, harder environmental conditions are selected more often than easier conditions.

The algorithm also permits selecting the environmental conditions that are challenging for the current evolving agents. This is due to the fact that the difficulty level of the environmental conditions is estimated on the basis of the performance displayed by recent evolving agents on those conditions. Notice also that the algorithm does not require domain knowledge and does not introduce additional hyperparameters.

The results collected on the long double pole and on the bipedal hardcore problems show that the curriculum learning algorithm speeds-up learning and produces significantly better solutions than a standard algorithm (Milano & Nolfi, 2021).

12.4 Competitive learning and self-play

A third method that can promote long-term learning is competitive learning or self-play. In this method, the learning robot is situated in an environment containing one or more other learning robot/s that compete with it. This setting can generate a form of incremental learning and curriculum learning automatically since the complexity of the task faced by the adaptive agents increases over time, as a result of the progresses made by competitors, and since the competitors tend to generate environmental conditions which are challenging for the learning robots. Moreover, it can produce a form of open-ended learning since the complexity of the task can increase in an unbounded manner.

A competitive scenario that has been widely studied through evolutionary methods concerns predator and prey robots, i.e. the concurrent evolution of two populations of robots rewarded for the ability to catch the competitor and to avoid being caught, respectively (Figure 12.1, Nolfi & Floreano, 1999). The prey is considered caught when touched by the predator. Each evolving robot is evaluated against multiple competitors selected among the best competitors of the current and eventually of the last generations.

Figure 12.1. Predator and prey robots (left and right, respectively) situated in an arena surrounded by walls. The cables have rotating contacts and provide the electric power to the two robots to enable long-term experiments (Floreano, Nolfi & Mondada, 1998)

Competitive evolutionary experiments, however, require special care to ensure that the co-evolutionary dynamics yields true progress and to verify whether such progress is occurring or not. To clarify this aspect we should distinguish among: local progress, i.e. progress against the current competitors; historical progress, i.e. progress against the competitors of previous generations, and global progress, progress against the competitors of previous and future generations (Miconi, 2008). Competitive evolution generally guarantees local progress but not historical and global progress. Indeed, local progress can be produced by re-adopting ancient solutions that are effective against contemporary competitors but are ineffective against other competitors. In this situation, the co-adaptation process enters in a limit cycle dynamics characterized by the repeated rediscovery and abandonment of qualitatively similar strategies. This is indeed the dynamics observed in the experiments described in Nolfi & Floreano (1999). After a first phase producing true progress, the co-evolving robots keep producing local progress but achieves it by re-discovering and then abandoning the same set of strategies over and over again.

A method that can be used to favor historical progress consists in maintaining an archive including the best robots of previous generations (Rosin & Belew, 1997) which grows in size over generations or in maintaining an archive including only the best N non-identical robots of the previous generations (De Jong, 2005). The evolving individuals are evaluated against competitors selected randomly from the full archive, in the former case, or against all competitors included in the restricted archive, in the latter case.

An alternative approach consists of the generalist algorithm which does not rely on archives but includes a procedure to discard opportunistic individuals, i.e. robots performing well against the competitors against which they were evaluated but poorly against other competitors (Simione & Nolfi, 2021). This is realized by organizing the evolutionary process in phases lasting N generations in which only the first or the second population evolve in an alternated fashion. During each phase the algorithm operates by: (i) evolving a subset of one population against a subset of competitors for N generations, (ii) post-evaluating the evolved individuals at the end of each phase against all competitors, and (iii) discarding the individuals which achieved the worst performance against all opponents. The discarded individual set includes opportunistic robots that progressed against the competitors encountered but regressed against the other competitors.

The algorithm was evaluated in a predator-prey setting (Simione & Nolfi, 2021). The observations of the robots include the state of the infrared sensors and the information extracted from an omnidirectional camera. The actions encode the desired speed of the two wheels. The maximum speed of the motors of the predator is lower than the prey’s. The predators are rewarded proportionally to the fraction of time required to capture the prey (0 in episodes in which the prey is not caught). The prey are rewarded proportionally to the fraction of time in which they manage to escape the predator (1.0 in episodes in which the prey is not caught).

The analysis of the robots evolved over 150,000 generations demonstrates that the generalist algorithm produces historical and global progress (Figure 11.2). Indeed, robots generally perform better against ancient competitors than contemporary competitors regardless of the specific generation considered. For example, predators of generation 150,000 achieved a fitness of about 0.6 against competitors of generation 10,000 and a fitness of about 0.25 against contemporary competitors. Prey of generations 150,000 achieve a fitness of about 0.9 against competitors of generation 10,000 and a fitness of 0.75 against contemporary competitors. Moreover, robots at any particular generation generally perform better against opponents of future generations than robots of previous generations. This implies that the strategies acquired by predators and prey of succeeding generations are not only more and more effective against previous opponents but also more and more effective in general, i.e. are more and more capable of defeating opponents.

Figure 12.2. Performance of predators and prey robots every 10,000 generations post-evaluated against competitors of previous and following generations. Performance vary in the range [0.0, 1.0]. For each comparison, we indicate the performance of the predators. The performance of the prey corresponds to 1.0 minus the performance of the predators. The left and right figures display the results of the first replication and the average result of 5 replications, respectively. (Simione & Nolfi, 2021)

The visual inspection of the behavior displayed by co-evolved robots indicate the possession of quite complex skills including: avoiding fixed and moving obstacles, optimizing the motion trajectory with respect to multiple constraints, integrating sensory information over time and use this information to appropriately determine future actions, disorienting the competitor by producing irregular behaviors that are hard to anticipate (in the case of prey), anticipating the behavior of the competitor and coping with the irregular behavior produced by the prey (in the case of predators). The analysis of the course of the evolutionary process also indicates that the complexity of the behaviors increases across generations although progresses in performance does not always lead to complexifications of robots’ behavior (see Simione and Nolfi, 2021).

An interesting set of competitive experiments realized with reinforcement learning is described in Bansal et al. (2018). The authors considered both asymmetrical problems like “kick and defend” in which a humanoid robot learns to shoot a ball toward a goal defended by a goalkeeper robot and symmetrical problems, like the “sumo” in which two robots learn to push the opponent out the ring (see Video 12.2). Asymmetric problems are handled by training two network controllers which are embedded in the two corresponding robots. Symmetric problems, instead, are handled by training a single network controller which is embedded in both robots. The production of global progress is promoted by evaluating the learning robots against opponents selected randomly from archives containing older networks, i.e. copies of the control network after each training epoch.

The observation vector of the robots includes the position and velocity of the agent’s own joint, the contact point acting on the agent’s body, the relative position of the opponent, and the angular position of the opponent’s joints. The robots are rewarded and punished with 1000 and -1000 at the end of won and lost episodes, respectively. To facilitate the learning process, the robots are also rewarded for walking forward and for being able to stand. However, these additional reward components are gradually annealed to zero during the initial phase of the training process. The training is continue for 5,000 to 10,000 epochs, depending on the problem. During each epoch the agent is evaluated against several competitors for a total of 409,600 steps.

As shown by Video 12.2, the method permits synthesizing rather effective behaviors. For example, the humanoids playing sumo display a stable fighting stance and knocks the opponent with their head. The humanoids playing kick-and-defend display good kicking skills, uses their feet to kick the ball high and try to avoid the defender. The kicker robot also tries to fool the defender by moving left and right quickly once close to the ball. The defender responds to the motion of the kicker and uses both its hands and legs to obstruct the ball.

Video 12.2. Behaviors of robots trained to play “sumo” (push the opponent outside the ring or topple it over), “kick and defend” (move the ball through net and defend), “you shall not pass” (move forward and prevent the opponent from reaching the other side). The final part of the video shows the post-evaluation of a robot trained to play sumo situated alone over the ring and subjected to strong perturbation forces. (Bansal et al., 2018, https://www.youtube.com/watch?v=OBcjhp4KSgQ)

12.5 Intrinsic motivation

Finally, a fourth mechanism that can promote long-term and open-ended learning is intrinsic motivation (Schmidhuber, 2010), namely the realization of a learning process that encourage the robots to perform new behaviors and/or to experience new observations, independently from their effects on the reward. This can be realized by using a reward function with two scalar real-valued components:

where rext(t) denotes the standard external reward linked to the ability to achieve a given goal, rint(t) denotes the intrinsic reward associated to new behaviors and/or new observations, and g maps pairs of real values to real values, e.g. g(a,b) = a + b.

The reason why intrinsic motivation can promote long-term and open-ended learning is that the novel behaviors can be later reused to produce extrinsically rewarded behaviors. Similarly, novel observations can promote the development of new behaviors afforded by them. In other words, the acquisition of novel abilities, that are neutral with respect to the extrinsic reward, increases the evolvability of the robots. For an analysis of the role of intrinsic motivation in natural systems and relative implications for robotics, see Baldassarre & Mirolli (2013) and Cangelosi & Schlesinger (2015).

More specifically, the intrinsic reward should promote the production of novel behaviors and/or the access to novel observations which are predictable or compressible (i.e. which include regularities that can be used to predict or compress them). Accessing unpredictable observations, e.g. observations including irregular noise, and producing irregular actions, instead, is useless and should not be rewarded (Schmidhuber, 2010).

From the point of view of the learning robot, novel and potentially predictable observations and actions correspond to states including regularities that are not yet known but that can be discovered by a predictor or compressor network trained trough self-supervision. This implies that the intrinsic reward can be set on the basis of the first derivative of the error of a predictor/compressor network trained on the sensory/motor states experienced by the learning robot (Schmidhuber, 1991; Oudeyer, Kaplan & Hafner, 2007). Indeed, the prediction or compression error will decrease during the production of behaviors and during the observations of states that are novel and that include learnable regularities. The error, instead, will not decrease during the production of irregular behaviors and/or the observation of irregular states. Moreover, the error will stop decreasing during the production of behaviors and/or the experience of observations that were novel, but that lost their novelty nature after being produced or experienced several times.

Eventually, one can encourage the adapting robots to maximize both long term and short term novelties, i.e. observations and behaviors that are novel with respect to the previous phases of the learning process and with respect to the previous phases of the evaluation episode, respectively (Badia et al., 2020). Moreover, one can usefully reduce the weight of the two corresponding intrinsic rewards at the corresponding time scale. In other words, one can reduce the weight of long-term and short-term intrinsic rewards during the course of the adaptive process and during the course of the evaluation episodes, respectively (Badia et al., 2020). This enables the learning process to focus on the the exploration of novelties in the initial phase and on the maximization of the extrinsic reward during the final phase. Moreover, it enables the robot to focus on the achievement of the goal in the final phase of the evaluation episode, after an initial phase dedicated to explore the environment.   

Unfortunately learning to predict/compress new sensory/motor states while preserving the ability to predict/compress states experienced in previous phases of the learning process is challenging, especially when the adaptive process continues in the long term. Another challenge concerns the ability to separate predictable and unpredictable states/actions. Consequently, one might need to trade generality versus simplicity and completeness versus computational efficiency. An example of a simplified but computationally efficient method is random network distillation (Burda et al., 2018). The method relies on two feed-forward networks that receive as input the current observation (Figure 12.3). The target network is initialized with random weights and is not trained (Figure 12.3, left). The predictor network is trained to predict the output of the target network (Figure 12.3, right). The prediction error can be used as intrinsic reward. Indeed the error will be high for novel observations that did not yet become familiar to the predictor network. The model has a series of shortcomings: (i) it restricts itself to observations and ignore actions, (ii) it cannot detect regularities over time, and (iii) it is unable to discriminate between predictable and unpredictable observations. Despite of that, it results effective for problems with sparse rewards, e.g. the Atari Freeway, Gravitar, Montezuma’s Revenge, Pitfall!, Private Eye, Solaris, and Venture, in which the probability to obtain even a single positive reward at the beginning of the training process is minimal. Moreover, in combination with a method that permits to automatically adapt the hyperparameters to the characteristics of the problem, it allowed achieving super-human performance on the full set of the Atari games (Badia et al., 2020)

Figure 12.3 The distillation component include a target network and a predictor network shown on the left and right hand side respectively. The lower layer of the two networks encodes the observation. The number of output neurons of the two networks is arbitrary (adapted from Burda et al., 2018).

As proposed by Lehman & Stanley (2011), one can also consider a radical form of curiosity learning that operate on the basis of an intrinsic reward only. This means that the adaptive robots are rewarded exclusively for the ability to produce new behaviors. The behaviors produced are also evaluated on the basis of a conventional extrinsic reward to identify the solution that best suit a given desired function among all the candidate solutions generated during the adaptation process. The latter evaluation, however, does not affect the course of the adaptive process that focus exclusively on the production of behaviors that are novel with respect to those produced previously. As shown by the authors, this novelty search method outperforms conventional methods relying on extrinsic rewards in scenarios where the number of potential alternative solutions is limited. However, it is unlikely to scale to scenarios where the number of such solutions is huge or infinite.

12.6 Learn how

Read Section 13.17 and make the Experiment 15 to familiarize with competitive coevolution.



13. Learning to Train Robots through Evolutionary and Reinforcement Learning Methods

13.1 Introduction

This chapter explains how to train robots in simulation through evolutionary and reinforcement learning methods. The objective is that to enable the reader to get a hands-on experience with these methods and with the software tools used by researchers working in these areas.

A basic knowledge of the Python programming language is required. A basic knowledge of C++ is recommended. 

The chapter includes 15 exercises that can be completed in about 25 hours.

13.2 AI Gym

An useful tool which can be used to simulate the robot, the environment, and their interaction is AI-Gym (https://gym.openai.com Brockman et. al., 2016). The tool includes a library of ready-to-use adaptive problems (called environments), facilitates the implementation of new environments, and introduces a programming standard which permits to create algorithms that can be applied directly to any AI-Gym environment without modifications.

The ready-to-use environments include: (i) a set of classic control problems (e.g. the cart-pole problem, the pendulum, the acrobat, etc.), (ii) the MuJoCo locomotors problems involving multi-segment robots with actuated joints that can be trained for the ability to swim, jump or walk (Todorov, Erez and Tassa, 2012), (iii) a series of robotic environments based on the Fetch robotic arm and on the Shadow Hand robot, and (iv) the Atari 2600 games (Bellemare et. al., 2013). AI-Gym and the associated environments are available as free software with the exception of the MuJoCo dynamic simulator that is commercial (free one-year licenses are available for students only). Additional AI-Gym environments developed by third parties, including a variation of many of the problems based on MuJoCo, are available as free software (see below).

AI-Gym environments are implemented as sub-classes of the Gym environmental class that includes methods for determining the number and type of sensors and actuators, accessing the state of the sensors, setting the state of the actuators, and advancing the simulation of one step.

To start practicing with Gym, install the AI Gym package by following the instruction included in the https://gym.openai.com/docs/ page.

The principal commands available in Gym are summarized in the python script included below that: creates the CartPole-v0 environment (line 2), resets the position of the agent (line 3), and performs 200 simulation steps (line 4) during which renders graphically the agent and the environment (line 5) and sets the state of the actuator randomly (line 6).

The instructions included in the line 3-6 can be used to perform an evaluation episode. In concrete applications, however, the action should be determined by a neural network that receives as input the observation vector returned by the function env.step().


import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(200):
    env.render()
    observation, reward, done, info = env.step(env.action_space.sample())
env.close()

The “CartPole-v0” environment (indicated in line 2) consists of a cart that can move along the x axes with a pole attached on the top through a passive joint. The task of the agent consists in controlling the actuator of the cart, by applying a positive or a negative force with a fixed intensity along the x axis, to maintain the pole balanced. The agent is evaluated during episodes lasting up to 200 steps that should be terminated prematurely when the inclination of the pole exceeds a given maximum threshold. When this happens, the done variable returned from the function env.step() become True.

The gym.make(env-name) function creates the environment named env-name.

The reset() function initializes the position of the agent and eventually the properties of the environment. This function is usually called at the beginning of each evaluation episode. In the case of the “CartPole-v0” environment, it initializes the position of the cart in the center of the x axis with the pole inclined of an angle selected randomly within a given range. The function returns the observation vector.

The render() function displays graphically the agent and the environment.

The step(action-vector) function updates the state of the actuators of the agent on the basis of the values included in the action-vector, performs a simulation step, computes the reward for the step, and checks whether the episode should continue or should terminate. It returns the observation vector, the reward gained in the current step, a Boolean value that indicates whether the episode should continue or not, and a dictionary that might contain additional information. In the case of the “CartPole-v0” environment the observation includes the position and the velocity of the cart along the x axes, the angle of the pole, and the velocity of the tip of the pole. The reward is 1 when the inclination of the pole does not exceed the maximum threshold and 0 otherwise.

The action_space.sample() function returns a random action vector. In the case of the CartPole-v0 environment, the action vector includes a single number (1 or 2) that specifies one of the two possible actions: applying a torque to the cart toward the left or toward the right direction.

The type and the dimensionality of the observation and action vectors are stored in the observation_space and action_space variables. The observation and action spaces can be continuous (Box) or discrete (Discrete). For example, in the case of the  CartPole-v0 problem, the observation are continuous and are constituted by a vector of four real numbers. The action space is discrete and is constituted by an integer that can assume two alternative values. The number of elements forming a vector of continuous values or the number of alternative values that a discrete space can assume are indicated by the integers that follow the ‘Box’ or ‘Discrete’ labels.  The minimal and maximum values that the elements of the observation and action vectors can assume are stored in the observation_space.high, observation_space.low, action_space.high and action_space.low variables. See the example below:


import gym
env = gym.make('CartPole-v0')
print(env.action_space)
#> Discrete(2)
print(env.observation_space)
#> Box(4,)
print(env.observation_space.high)
#> array([ 2.4       ,         inf,  0.20943951,         inf])
print(env.observation_space.low)
#> array([-2.4       ,        -inf, -0.20943951,        -inf])

The implementation of the Gym functions (i.e. step(), reset(), render() etc.) is contained in the Python file of the environment. For example, the implementation of the “CarPole-v0” environment is included in the https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py file.

Exercise 1. Familiarize with AI-Gym commands by using the “Cartpole-v0” environment and the other classic control environments.  Exploit the possibility to execute one command at a time and to print variables to familiarize with the structure of the data. Inspect the cartpole.py and pendulum.py files and verify how the reward is calculated. Clearly agents performing random actions are not able to solve their task. The agents should determine their action appropriately on the basis of the current observation and eventually of the previous observation/internal/action states. We will see how to realize that in the next sections.

13.3 Implement a neural network policy

This section illustrates how to implement a neural network policy which determines the action of the agent on the basis of the agent’s observation. The network will be constituted by a 3-layers feed-forward neural network in which the first layer includes the sensory neurons encoding the observation, the second layer includes internal neurons, and the third layer includes the motor neurons encoding the action. Each sensory neurons is connected to each internal neuron and each internal neuron is connected to each motor neuron. The activation of the internal and motor neurons is computed with the tanh() function. We assume that the observation is constituted by a vector of real numbers. The number of sensory neurons correspond to the length of the observation vector. Moreover, we assume that the action is constituted by a vector of real numbers or by an integer in the range [1, amax]. In the former case the number of motor neurons correspond to the length of the action vector. In the latter case, the network includes amax motor neurons and the action is set to the index of the most activated motor neuron. In the case of the CartPole-v0 environment, the observation vector is composed of four floating point values and the action vector is constituted by an integer in the range [1, 2]. Consequently the network will include four sensory neurons and two motor neurons.

Exercise 2. Implement the neural network controller for an AI-Gym environment. Initialize the network with random parameters. You can use the following code to create the variables required and to initialize the connection weights and the biases parameters.


import gym
import numpy as np
env = gym.make('CartPole-v0')

pvariance = 0.1     # variance of initial parameters
ppvariance = 0.02   # variance of perturbations
nhiddens = 5        # number of internal neurons
# the number of inputs and outputs depends on the problem
# we assume that observations consist of vectors of continuous value
# and that actions can be vectors of continuous values or discrete actions
ninputs = env.observation_space.shape[0]
if (isinstance(env.action_space, gym.spaces.box.Box)):
    noutputs = env.action_space.shape[0]
else:
    noutputs = env.action_space.n

# initialize the training parameters randomly by using a gaussian
# distribution with average 0.0 and variance 0.1
# biases (thresholds) are initialized to 0.0
W1 = np.random.randn(nhiddens,ninputs) * pvariance      # first connection layer
W2 = np.random.randn(noutputs, nhiddens) * pvariance    # second connection layer
b1 = np.zeros(shape=(nhiddens, 1))                      # bias internal neurons
b2 = np.zeros(shape=(noutputs, 1))                      # bias motor neurons

During each step, the state of the first layer of neurons corresponds to the observation. The state of the second and third layer of neurons, instead, should be computed by propagating the activation through the connection weights from the sensory neurons to the internal neurons and then, from the internal neurons to the motor neurons. The activation state of the motor neurons is used to determine the action vector. You can use the following code to create the neural network and to compute the activation of the neurons.


# convert the observation array into a matrix with 1 column and ninputs rows
observation.resize(ninputs,1)
# compute the netinput of the first layer of neurons
Z1 = np.dot(W1, observation) + b1
# compute the activation of the first layer of neurons with the tanh function
A1 = np.tanh(Z1)
# compute the netinput of the second layer of neurons
Z2 = np.dot(W2, A1) + b2
# compute the activation of the second layer of neurons with the tanh function
A2 = np.tanh(Z2)
# if the action is discrete
#  select the action that corresponds to the most activated unit
if (isinstance(env.action_space, gym.spaces.box.Box)):
    action = A2
else:
    action = np.argmax(A2)

Finally, evaluate the agent for multiple evaluation episodes (e.g. 10 episodes) lasting up to 200 steps and compute the fitness by summing the reward collected during all steps and by averaging the fitness for the number of episodes. Episodes should terminate prematurely when the done value returned by the env.step() function is True.

13.4 Implement a steady state evolutionary strategy

This section illustrates how to implement the steady state evolutionary strategy (Pagliuca, Milano and Nolfi, 2018) illustrated in the pseudocode included below and how to use it to train the parameters of the neural network policy implemented in the previous section.

The algorithm starts by initializing the random seed (line 1) and the parameters of the population randomly (line 2). The population (θ) consists of a matrix of λ vectors of length p, where p corresponds to the number of parameters, i.e. the number of connection weights and biases of the neural network. Each vector thus encodes the parameters of a corresponding individual (genotype). Subsequently, for a certain number of generations, the algorithm measures the fitness of the individuals (line 5), ranks the individuals of the population on the basis of their fitness (line 6), and replaces the parameters of the worse λ2 individuals with a copy with variations of the λ2 fittest individuals (line 9). Variations consists of a vector of random numbers of length p generated with a normal distribution with average 0 and variance σ. The fitness is measured by: (i) creating the policy network of the robot with the parameters specified in the vector, (ii) allowing the robot to interact with its environment for one or more evaluation episodes, and (iii) storing the sum of the rewards collected during evaluation steps normalized for the number of episodes.

When the population includes only 2 individuals, the algorithm is equivalent to a stochastic hill-climber that operates on a single candidate solution by: (i) adding random variation to the parameters, and (ii) by retaining or discarding the variations depending on whether they produce an increase or a decrease of the fitness, respectively. The usage of larger populations, however, generally leads to better results.

Exercise 3. Implement the evolutionary strategy described above and use it to train the neural network policy of an agent on the CartPole-v0 environment. Initialize the population with random values extracted with a Gaussian distribution with average 0.0 and standard deviation 0.1. Initialize the parameters encoding the biases to 0.0. Run the evolutionary process few times, by varying the seed of the random numbers generator. Check whether the evolving robots develop an ability to solve the problem by rendering graphically the behavior of evolved agents and analyze how performance vary in different replications of the experiment. Finally, test your program on other classic problem, e.g. the Pendulum-v0.

13.5 Evorobotpy2: A software tool for evolving robots

Evorobotpy2 (https://github.com/snolfi/evorobotpy2) is a software library implemented by the author and by Paolo Pagliuca that permits to evolve robots for AI-Gym environments. It is an extended version of the Evorobot and Farsa software tools, developed over many years (Nolfi, 2000; Nolfi and Gigliotta, 2010; Massera et al., 2013; Massera et al., 2014). Evorobotpy2 is characterized by simplicity and computational efficiency, includes additional environments complementary to those available in AI-Gym, and run on Linux, Mac OS, and Windows.

Evorobotpy2 requires Python3, the cython, pyglet and matplotlib python packages, a command line C++ compiler, and the gsl GNU Scientific Library available from: https://www.gnu.org/software/gsl. To install the gsl libraries download and un-compress the installation file, enters in the directory, and configure, compile and install the libraries with the commands “./configure”, “make”, and “make install”. You need root privileges to install the library for all users. The default installation directory might vary depending on the operating system and on the version of the library. Consequently, keep note of the installation directory and be ready to correct the include_gsl_dir variable specified in the setup*.py files, when you compile C++ modules of evorobotpy2.

The ./evorobotpy2/bin folder contains the python scripts that can be used to run evolutionary experiments and to analyze the results. The ./evorobotpy2/lib folder includes a series of C++ libraries wrapped in Python. More specifically, the net module that permits to create and update neural network policies and a series of environments illustrated in the next sections. The evorobotpy2/pybullet folder includes files for customizing the Pybullet environments and other utilities illustrated in Section 13.11. The evorobotpy2/exercises folder includes source codes samples for the exercises. The folders starting with the letter x (e.g. evorobotpy2/xacrobot) include initialization files with recommended hyperparameters.

The source code of the net module (evonet.cpp, evonet.h, utilities.cpp, utilities.h, net.pxd, net.pyx, and setupevonet.py) can be compiled and installed from the evorobotpy2/lib folder with the following commands:


python3 setupevonet.py build_ext --inplace
cp net*.so ../bin # or cp net*.dll ../bin in Windows

As mentioned above, if the compiler does not find the gsl library, you might need to correct the value of the include_gsl_dir variable in the file setupevonet.py. The compiled net*.so or net*.dll library files should then be copied in the evorobotpy2/bin folder. For more information on the net library see Section 13.18.

The evorobotpy2/bin/es.py script permits to evolve and test robots. It uses the net module and the following python scripts: (i) evorobotpy2/bin/policy.py that creates and update the policy network, (ii) evorobotpy2/bin/openaies.py that contains an implementation of the OpenAI-ES algorithm (Salimans et al., 2017; Section 6.2.2), and (iii) evorobotpy2/bin/evoalgo.py that includes functions for saving and loading data. For an explanation of the implementation of the OpenAI-ES algorithm and for directions on the usage of the multi-core version of the algorithm see Section 13.9.

To evolve the policy network for one of the problems listed in the ‘x’ folders, e.g. the xacrobot folder, enter in the folder and run the command:


python3 ../bin/es.py -f acrobot.ini -s 11

The program parameter -f is mandatory and should be followed by the name of a text file containing the setting of the hyperparameters, i.e. the file acrobot.ini in this case. The program parameter -s specifies the seed of the random number generator. The seed number is also used to differentiate the name of the files containing evolved robots and associated performance data. To visualize the explanation of program parameters and hyperparameters execute the es.py script without specifying any program parameter.

The program will start to evolve a policy for the Acrobot problem with the hyperparameters specified in the evorobotpy2/xacrobot/acrobot.ini file. During the evolutionary process, the program displays after each generation the fraction of evaluation steps performed to date, the best fitness obtained to date, the best fitness obtained during post-evaluation to date, the fitness of the best individual of the current generation, the average fitness of the population, and the average absolute value of parameters. Moreover, during the evolution process, the es.py script saves the parameters of the individuals that achieved the best fitness during evaluations and post-evaluations in the files bestSs.npy and bestgSs.npy, respectively, where s corresponds to the number of the seed. Moreover, the program saves performance data across generations in the file statSs.npy and FitSs.txt.

The behavior of an evolved agent can be observed by running the following command from the ./evorobotpy2/xacrobot folder:


python3 ../bin/es.py -f acrobot.ini -g bestgS11.npy -p

where the program parameter -g is used to indicate the name of the file containing the genotype of the evolved robot and -p is used to post-evaluate the robot by graphically rendering the robot's behavior. To display the behavior of a robot at the beginning of the evolutionary process specify the program parameter -p only, i.e. do not specify the file containing the genotype to be loaded.

To display the fitness throughout generations and the average and standard deviation of performance among multiple runs, you can use the following commands:


python3 ../bin/plotstat.py
python3 ../bin/plotave.py

Exercise 4. Run few replications of the acrobot experiment by using different seeds. You can use the evorobotpy2/acrobot.ini file to set the hyperparameters. While the program is running check the implementation of the Acrobot environment from https://github.com/openai/gym/blob/master/gym/envs/classic_control/acrobot.py to figure out the content of the observation vectors, the content of the action vector, and the way in which the reward is calculated. Plot the fitness across generations and then observe the behavior of the best evolved robot stored in the bestgSs.npy file, where s is the seed number. You might want to load this file instead than the BestSs.npy file since the robot that achieved the best performance during the post-evaluation is usually better than the robot that achieved the best performance during the normal evaluation.

13.6 Evolve robots with co-adapted body and brain through Evorobotpy2 and Box2D

In this section we will see how to evolve the morphology and the brain of robots selected for the ability to walk with evorbotpy2. Moreover, we will briefly introduce the PyBox2D dynamic simulator (https://github.com/pybox2d/pybox2d).

The co-evolution of body and brain can be realized in different manners (see Chapter 3). Here we use the method described in Pagliuca and Nolfi (2020) in which the body is formed by a fixed number of elementary body elements with variable properties. The genotype of the evolving individuals consists of a vector of floating point values with a fixed length which includes the morphological parameters and the connection weights and biases of the robot’s neural network.

Video 13.1. An example of robot with coevolved body and brain (https://www.youtube.com/watch?v=8gn2Qstmaw8, https://www.youtube.com/watch?v=8gn2Qstmaw8)

More specifically, the body of the robot is composed by 11 cylindrical segments attached in sequence through actuated hinge joints (see Video 13.1). Each joint has a default angle, which determines the intermediate angular orientation of each element with respect to its previous element, and two angular limits, which determine the minimum and maximum rotation angles of the joint. The morphological parameters include, for each element, the length, the diameter, the default angle with respect to the previous element, and the clockwise and anti-clockwise maximum rotation angles with respect to the default angle.

The observation of the robot includes the position angle and the angular velocity of each joint, the state of a contact sensor located on each body element, and the velocity and the absolute angle of the intermediate element. The action vector encodes the torque applied to each joint.

The implementation of this environment is contained in the file evorobotpy2/bin/envBodybrainCustom.py and is based on the PyBox2D simulator which is normally installed together with AI-Gym. If not, see installation instructions in https://github.com/pybox2d/pybox2d.

The robot, the environment, and the interaction between the two can be simulated in different manners. In simple cases, one can implement a simulation from scratch by using the Python programming language. This is the method used to implement the classic control problems in AI-Gym, i.e. the cart-pole, the pendulum, the acrobat etc. Alternatively, the  simulation can be realized by using a more efficient programming language wrapped in python. This is the method used for the environments included evorobotpy2 that are implemented in C++ and wrapped in Python. In more complex cases, however, the usage of a physical simulation library is strongly recommended. These libraries take care of the simulation of Newtonian physics and of collisions. Consequently, they reduce the programming burden to the description of the body elements composing the robot and the environment.  Box2D is a fast and easy to use dynamical simulation library restricted to 2D environments. We will introduce a tool that permits to simulated 3D systems in Section 13.11.

Exercise 5. Run few replications of the ./evorobotpy2/xbodybrain experiment. Observe the behavior and the morphology of the evolved agents. Describe how they vary in different replications. Analyze whether the morphological and behavioral properties are co-adapted, i.e. whether the morphological characteristics of the evolved agents are adapted to the characteristics of the controller and vice versa. Check the file ./evorobotpy2/bin/bodybrain.py to verify how the agent is implemented by using the Box2D primitives and how the morphological parameters are encoded in the genotype of the evolving robots.

13.7 Evolving wheeled robots by using a fast kinematic simulator

In this Section we will see how to carry on experiments with wheeled robots by using the fast kinematic simulator included in evorobotpy2. More specifically, we will use this software to replicate the experiment of the foraging robot described in Section 4.4.

Dynamical simulators, like Box2D and Pybullet, are necessary for simulating accurately the dynamics of articulated robots and the effects of collisions. In the case of wheeled robots moving on a flat surface, instead, kinematic simulations can suffice. Kinematic are much faster than dynamic simulations and consequently permits to reduce considerably the time required to evolve the robots. For this reason, evorobotpy2 includes a set of environments involving wheeled robots that rely on kinematic simulations implemented in C++. The environment that we will consider here is ErDiscrim environment which involves a Khepera robot (Mondada, Franzi & Ienne, 1993), i.e. a small wheeled robot with infrared sensors and two motorized wheels. The robot is situated on a flat arena surrounded by walls which contains a cylindrical object located in a random selected position. The task of the robot consists in finding and remaining near the cylinder.

The source code of the experiment (discrim.cpp, discrim.h, utilities.cpp, utilities.h, robot-env.cpp, robot-env.h, ErDiscrim.pxd, ErDiscrim.pyx, and setupErDiscrim.py) can be compiled and installed from the ./evorobotpy2//lib folder with the following commands:


python3 setupErDiscrim.py build_ext --inplace
cp ErDiscrim*.so ../bin # or cp ErDiscrim*.dll ../bin

In case of compilation errors, please check and eventually correct the name of the directory of your GSL library in the file setupErDiscrim.py.

Give a look to the source files to understand how the environment is simulated. The discrim.cpp and discrim.h files include the definition of the gym functions (i.e. env-reset(), env.step(), env.render() ext.) and the calculation of the reward. The robot-env.cpp and robot-env.h files include a set of functions that simulate the translation and rotation movements of the robot over the plane, the sensors of the robot, and the occurrence of collisions. Evaluation episodes are terminated prematurely when a collision occur. Consequently, there is no need to model the effect of collisions in detail. The robot-env.cpp and robot-env.h files permit to simulate also two other wheeled robots: the ePuck (Mondada et al., 2009) and the MarXbot (Bonani et al., 2010).

Exercise 6. Run 10 replications of the experiment from the ./evorobotpy2/xdiscrim folder by using different seeds. To speed-up the training process you can run 2 instances of the program in parallel or eventually more, depending on the cores available on your computer. Inspect the behavior displayed by the best evolved robot of each replication. Group the evolved robots in families relying on similar strategies and describe how the different strategies approach the problem. In this experiment the robot is provided with a LSTM network (see Section 10.3). Describe how the evolved robots use memory to vary the way in which they react to the current observations.

13.8 Plot the phase portrait of an agent evolved to balance a pendulum

This Section describes how to plot the phase portrait of an actuated pendulum controlled by a neural network with random and evolved connection weights.

As in the case of Figure 5.6, we should plot a two-dimensional state space in which the horizontal and vertical axes indicate the angle and the velocity of the pendulum. To generate the data to be plotted we can use the AI-Gym simulator of the actuated pendulum included in the file ./gym/gym/envs/classic_control/pendulum.py. More precisely, we can generate the data by evaluating the pendulum with its neural network controller for few episodes and by storing the angular position and velocity of the pendulum during each step in a file. We can then plot the data by writing a python program that loads the data from the file and plots, for each couple of succeeding steps, an arrow that indicates the variation of the angular position and velocity of the pendulum. 

You can generate multiple plots displaying the phase portraits of the agent at different stages of the evolutionary process, by starting from the agent of generation 0 that has randomly generated connection weights. To evolve the pendulum agent you can use the .ini file included in the ./evorobotpy2/xpendulum folder.

When you run an evolutionary process, evorobotpy2 save the parameters of the agent that achieved the best performance during post-evaluation to date in the file bestgSs.npy. This implies that the best agents of previous generations are overwritten. To obtain the best agent after a certain number of generations, you can make a copy of the bestgSs.npy file immediately after evorobotpy2 processes the relevant generation or you can extend evorobotpy2 with instructions for saving the best genotypes every n steps or every n generations.

Exercise 7. Plot the phase portrait of the actuated pendulum. Compare the phase portrait of an agent at different stages of the evolutionary process. Describe how the dynamics displayed by the agent at the beginning of the evolutionary process is altered throughout generations in a way that permits to bring up and balance the pendulum.

13.9 The Evorobotpy2 implementation of the OpenAI-ES algorithm

In this section we will analyze the evorobotpy2 implementation of the OpenAI-ES algorithm (Salimans et al., 2017; Section 6.2.2).

The single thread implementation of the algorithm is included in the file ./bin/openaies.py. The generation and evaluation of the population is performed in the evaluate() method shown below.


def evaluate(self):
        cseed = self.seed + self.cgen * self.batchSize  # Set the seed for current generation
        self.rs = np.random.RandomState(cseed)
        self.samples = self.rs.randn(self.batchSize, self.nparams)
        self.cgen += 1

        # evaluate samples
        candidate = np.arange(self.nparams, dtype=np.float64)
        for b in range(self.batchSize):
            for bb in range(2):
                if (bb == 0):
                    candidate = self.center + self.samples[b,:] * self.noiseStdDev
                else:
                    candidate = self.center - self.samples[b,:] * self.noiseStdDev
                self.policy.set_trainable_flat(candidate)
                self.policy.nn.normphase(0) # normalization data is collected during the post-evaluation of the best sample
                eval_rews, eval_length = self.policy.rollout(self.policy.ntrials, seed=(self.seed + (self.cgen * self.batchSize) + b))
                self.samplefitness[b*2+bb] = eval_rews
                self.steps += eval_length

        fitness, self.index = ascendent_sort(self.samplefitness)       # sort the fitness
        self.avgfit = np.average(fitness)                         # compute the average fitness

        self.bfit = fitness[(self.batchSize * 2) - 1]
        bidx = self.index[(self.batchSize * 2) - 1]
        if ((bidx % 2) == 0):                                     # regenerate the genotype of the best samples
            bestid = int(bidx / 2)
            self.bestsol = self.center + self.samples[bestid] * self.noiseStdDev
        else:
            bestid = int(bidx / 2)
            self.bestsol = self.center - self.samples[bestid] * self.noiseStdDev

        self.updateBest(self.bfit, self.bestsol)                  # Stored if it is the best obtained so far
		
        # postevaluate best sample of the last generation
        # in openaiesp.py this is done the next generation, move this section before the section "evaluate samples"
        gfit = 0                                                                   # to produce identical results
        if self.bestsol is not None:
            self.policy.set_trainable_flat(self.bestsol)
            self.tnormepisodes += self.inormepisodes
            for t in range(self.policy.nttrials):
                if self.policy.normalize == 1 and self.normepisodes < self.tnormepisodes:
                    self.policy.nn.normphase(1)
                    self.normepisodes += 1  # we collect normalization data
                    self.normalizationdatacollected = True
                else:
                    self.policy.nn.normphase(0)
                eval_rews, eval_length = self.policy.rollout(1, seed=(self.seed + 100000 + t))
                gfit += eval_rews
                self.steps += eval_length
            gfit /= self.policy.nttrials
            self.updateBestg(gfit, self.bestsol)

Line 117 creates the matrix of random numbers that constitutes the population. The sizes of the matrix is determined by the batchSize hyperparameter, that encodes the number of individual forming the population, and by nparams, that encodes the number of parameters of each individual. Lines 125 and 127 create symmetrical offspring, i.e. two copies of the parameters of the parent (center) perturbed by adding and by subtracting a vector of random numbers (samples[b]). The perturbation vectors consists of random numbers generated with a standard gaussian distribution normalized on the basis of the noiseStdDev hyperparameter (that is usually set to 0.02). The genotype of each individual is used to set the parameters of the corresponding neural network controller and eventually to set the parameters of the corresponding body through the function set_trainable_flat() (line 128). The individuals are then evaluated in line 130 through the policy.rollout() function. Finally, the vector containing the fitness of the population is sorted in an ascending order in line 134.

The remaining part of the evaluate method is used to post-evaluate the best offspring and eventually to update the vectors containing the best individual and the best post-evaluated individual obtained to date.

The calculation of the gradient and the update of the center of the population is performed in the optimize() method shown below.


def optimize(self):
        popsize = self.batchSize * 2      # compute a vector of utilities [-0.5,0.5]
        utilities = zeros(popsize)
        for i in range(popsize):
            utilities[self.index[i]] = i
        utilities /= (popsize - 1)
        utilities -= 0.5
		
        weights = zeros(self.batchSize)   # Assign the weights (utility) to samples on the basis of their fitness rank
        for i in range(self.batchSize):
            idx = 2 * i
            weights[i] = (utilities[idx] - utilities[idx + 1])    # merge the utility of symmetric samples

        g = 0.0
        i = 0
        while i < self.batchSize:         # Compute the gradient (the dot product of the samples for their utilities)
            gsize = -1
            if self.batchSize - i < 500:  # if the popsize is larger than 500, compute the gradient for each 500
                gsize = self.batchSize - i
            else:
                gsize = 500
            g += dot(weights[i:i + gsize], self.samples[i:i + gsize,:])
            i += gsize
        g /= popsize                      # normalize the gradient for the popsize
		
        if self.wdecay == 1:
            globalg = -g + 0.005 * self.center # apply weight decay
        else:
            globalg = -g

        # adam stochastic optimizer
        a = self.stepsize * sqrt(1.0 - self.beta2 ** self.cgen) / (1.0 - self.beta1 ** self.cgen)
        self.m = self.beta1 * self.m + (1.0 - self.beta1) * globalg
        self.v = self.beta2 * self.v + (1.0 - self.beta2) * (globalg * globalg)
        dCenter = -a * self.m / (sqrt(self.v) + self.epsilon)
		
        self.center += dCenter                                    # move the center in the direction of the momentum vectors
        self.avecenter = np.average(np.absolute(self.center))

The instructions included in lines 170-175 create a vector of utilities of size batchSize ranging from -0.5 to 0.5. Lines 177-180 compute the weights that determine the contribution of each perturbation vector to the estimated gradient of the expected fitness. Notice that the weight is calculated for each perturbation vectors by summing the utilities associated to the two symmetrical individuals generated by adding and subtracting the perturbation vector, respectively. The gradient of the expected fitness is calculated in line 190 by performing the average dot product of the weights and of the perturbation vectors (samples). The center of the population is then updated in line 205 on the basis of the Adam stochastic optimizer that is implemented in lines 200-203. Notice that the center of the population is not updated directly on the basis of the gradient estimated in the current generation but rather on the basis of the m and v momentum vectors that are updated on the basis of the gradient estimated in each generation. Notice also how the stepsize used to update the center of the population (a) starts small and reaches the value indicated by the stepsize hyperparameter after 1000 generations. The initial reduction of the stepsize is used by Adam to reduce the effect of the bias caused by the fact that the momentum vectors are initialized to null values, i.e. to an arbitrary value. The weight decay normalization is performed in line 195. The virtual batch normalization, instead, is implemented in the net library, i.e. in the library that is used to set the activation of the sensory neurons and to calculate the activation of the internal and motor neurons on the basis of the parameters encoded in the genotype.

At each generation, the algorithm print a text line with the number of the seed, the generation, the number of steps performed so far, the fitness of the best individual of the current generation, the fitness obtained by the best individual during the post-evaluation, the best fitness achieved so far, the best fitness achieved during post-evaluations so far, and the average absolute size of the parameters (see line 235 of the file ./bin/openaies.py).

Exercise 8. Run few evolutionary experiments by using the classic control problems. Verify the stability of the Openai-ES evolutionary algorithm, i.e. whether the fitness keep increasing monotonically over generations, whether it increases over generations by recovering from temporary and moderate regression phases, or whether it increases over generation while eventually displaying severe regression phases. For this purpose use the data printed during each generation or the data stored in the statSn.npy files which can be visualized with the python ../bin/plotstat.py command.

To use the parallel implementation of the algorithm you should run the program by using the -w program parameter that indicates the number of workers that you want to use. The workers evaluate the members of the population in parallel. This implies that you can reduce the time necessary to complete the evolutionary process approximately n times by using n workers.

To determine the number of workers you should consider: (i) the maximum number of cores available in your machine, (ii) the size of the population, (iii) the fact that one worker is used to post-evaluate the best individual of the previous generation. For example, if the size of the population is 40 and your machine has 6 cores, you can set the number of workers to 5 so that each of the four workers evaluate 10 individuals and one worker post-evaluate the best individual of the last generation. Instead, if your machine has 48 cores, you can set the number of workers to 41.

13.10 Setting Evorobotpy2 hyperparameters

Evolutionary algorithms generally require to set few hyperparameters (i.e. the population size, the mutation rate and few other hyperparameters) and are robust with respect to hyperparameter setting, i.e. tend to work reasonably well also when the value of the hyperparameters is not optimized. Setting-up an adaptive experiment, however, requires to set also additional hyperparameters that determine the architecture of the neural network and the conditions in which the adapting agents are evaluated. The setting of these additional parameters is often critical.

For many standard problems you will find recommended hyperparameters in the .ini text files included in the experimental folders of evorobotpy2, i.e. the folder starting with the letter ‘x’.

Evorobotpy2 hyperparameters are divided in three sections that include the name of the environment (EXP section), the characteristics of the policy network and the way in which agents are evaluated (POLICY section), the evolutionary algorithm used and the associated hyperparameters (ALGO section). You can print the list of all hyperparameters and the associated default values by executing the ./evorobotpy2/es.py script without specifying any program parameter.

The evolutionary algorithm can be specified with the hyperparameter algo. The algorithms available in evorobotpy2 are: OpenAI-ES, i.e. a single-thread and a parallel implementation of the OpenAI-ES algorithm (one of the most effective state of the art algorithms, see Salimans et. al. 2017, Pagliuca, Milano and Nolfi, 2020), SSS, i.e. the stochastic steady state algorithm described in Section 13.4 (see also Pagliuca and Nolfi, 1998), and coevo2, i.e. a competitive co-evolutionary algorithm (see Section 13.14 and Simione and Nolfi, 2020).

Commonly used policies include: feed-forward multilayer networks with one or more internal layers, recurrent networks, long short term memory network (LSTM), and convolutional networks. Feed-forward multilayer networks constitute the default choice. Recurrent and LSTM networks should be used in problems that benefit from the possibility to determine the current action vector also on the basis of previous observations/internal/action states. Unfortunately, as we discussed in Chapter 4, identifying whether a problem requires a recurrent network or not is far from trivial since problems often admit qualitatively different solutions that are difficult to identify from the experimenter in advance. Consequently, you should rely on your intuition or you should verify experimentally whether the usage of recurrent or LSTM networks is advantageous. In general LSTM networks works better than simple recurrent neural networks. Convolutional networks are generally useful in problems that require to process visual information. You should rely on your intuition also to determine the number of layers and the number of internal neurons per layer by considering that the optimal number of layers and of neurons correlates with the size of the action and observation vector and with the complexity of the problem. The type of the policy, the number of layers and the number of internal neurons per layer can be specified with the hyperparameters architecture, nlayers and nhiddens. The activation functions of the internal and motor neurons can be set with the afunction hyperparameter. You can specify the activation function of the motor neurons with the out_type hyperparameter, if it differs from the activation function of the internal neurons. The tanh function can be considered the default choice for the activation function. The relu function usually works better in combination with convolutionary networks. For the motor neurons, the linear function is often the better choice since it permits to compute a gradient for all activation ranges. The advantage of using the tanh function is that it naturally squash the activation of the motors within a bounded range. The winit hyperparameter determines the connection weights initialization method. The xavier method (Glorot & Bengio, 2010) is usually a suitable choice in combination with tanh neurons. The norm incoming method (see Salimans at al., 2017) works well in combination with linear motor neurons.

The action_noise and action_noise_range hyperparameters can be used to apply noise to the motor neurons. The addition of noise permits to increase the robustness of the evolving solutions. Usually the addition of a small amount of noise with a fixed distribution works well. However, the usage of a diagonal gaussian policy method, in which the distribution of noise applied to each motor neuron is encoded in a genetically encoded parameter, can be beneficial for problems in which exploration play an important role. This method implies that the amount of noise affecting each motor is initially high and can vary during the course of the evolutionary process.

The normalized hyperparameter normalizes the distribution of the value of sensors through the virtual batch normalization method (Salimans at al., 2017). This form of normalization is useful when the range of sensors is highly variable. An example of problem of this kind are the problems that include velocity sensors. Indeed, the activation values that velocity sensors can assume can vary widely during the course of the evolutionary process.

The maxsteps hyperparameter determines the maximal duration of evaluation episodes in steps. This value should be sufficiently high to evaluate the relative efficacy of learning agents including agents displaying poor performance.

The episodes hyperparameter determines the number of evaluation episodes. Evaluating the robots for multiple episodes permits to maximize the robustness of the robots with respect to environmental variations and is particularly important in problems in which the environmental conditions vary significantly among evaluation episodes. In the case of modern evolutionary strategies, like the OpenAI-ES, a similar effect can be obtained by using large population sizes.

The pepisodes hyperparameter determines the number of episodes used to post-evaluate the best individual of each generation. The higher the number of post-evaluation episodes, the higher the chance that the individual selected as the best is the truly best.

Finally, for what concerns the parameters of the evolutionary algorithm, the hyperparameter maxmsteps determines the duration of the evolutionary process in million steps. The minimal duration necessary to discover effective solution depends on the complexity of the problem. The duration can also be expressed in number of generations. However, the total number of steps correlates better with the actual duration and the computational cost of the evolutionary process.

The hyperparameters of the evolutionary algorithm depends on the algorithm chosen. In the case of the OpenAI-ES algorithm, the size of the population can be set with the sampleSize hyperparameter. Since each sample is used to generate two symmetrical offspring, the actual population size is the double of sampleSize.  A suitable value for this hyperparameter is 250 but it can be reduced up to 20 for problems that are not particularly difficult. The variance of perturbations is encoded in the noiseStdDev hyperparameter. The default value of this hyperparameter (0.02) works well for most of the problems. The amplitude of the step in the direction of the gradient is determined by the stepsize hyperparameters. Also in this case the default value of 0.01 works well in the majority of the cases. The wdecay hyperparameter determines the application of the weight decay normalization, that is usually beneficial.

In the case of the SSS algorithm, instead, the population size and the amplitude of perturbations can be set through the popsize and mutrate hyperparameters. The default value of 20 and 0.02 works well in the majority of the cases.

13.11 Pybullet Locomotors

In this section we introduce the Pybullet dynamic simulator (https://pybullet.org/, Coumans & Yunfei, 2016-2019) and the locomotors problems, an interesting family of environments that are widely used to benchmark adaptive algorithms.

Pybullet is a powerful 3D dynamical simulator that comes as a free software and can be used to carry on experiments with several interesting robotic platforms including the Ghost Minitaur quadruped (https://www.ghostrobotics.io/), KUKA robotics arms (https://kuka.com),  the Atlas humanoid robot (https://www.bostondynamics.com/), the iCub Humanoid robot (Sandini et al., 2004), the Baxter robot (https://www.rethinkrobotics.com/), the Franka Emika Panda manipulator (https://www.franka.de/technology), the Cassie biped robot (https://www.agilityrobotics.com/), the Laikago quadruped robot (www.unitree.cc), the MIT racecar (https://mit-racecar.github.io/), and the Turtlebot wheeled robot (https://www.turtlebot.com/).

Pybullet also includes additional AI-Gym environments. A particularly interesting family of these environments is constituted by the locomotors, a series of robots with varying shapes (ant-like, quadruped, humanoid ext.) that can be trained for the ability to swim, jump, and walk. They are a more realistic implementation of the original locomotors implemented with the MuJoCO dynamic simulator (Todorov, Erez and Tassa, 2012) included in the AI-Gym library.

Follow the instructions included in https://pybullet.org/ to install the library.

You can familiarize with the morphology of the locomotors and with their actuators by running the following command from the ./evorobot2/pybullet/showrobot folder:


python3 showrobot.py

Follow the instructions displayed by the program to load one of the locomotors model and to manually control its actuators so to familiarize with the actuators and the morphology of the robots. Notice that some locomotors, e.g. the Ant and the Humanoid, operate in a real 3D environment while other, e.g. the Hopper and the Halfcheetah, are constrained on the third dimension through fake joints.

The version 0 of the locomotors (e.g. HopperBulletEnv-v0, AntBulletEnv-v0 ext.) included in the Pybullet installation has reward functions optimized for reinforcement learning algorithm. To use the Version 5  (e.g. HopperBulletEnv-v5, AntBulletEnv-v5 ext.) with reward functions optimized for evolutionary strategies (see Pagliuca, Milano and Nolfi, 2020) you should: (1) add the registration instruction included in the evorobotpy2/pybullet/init.py file in the <pybullet-folder>/pybullet_envs/__init__.py file,  and (ii) copy  the files evorobotpy2/pybullet/gym_locomotion_envs2.py and  evorobotpy2/pybullet/robot_locomotors.py in the <pybullet-folder>/pybullet_envs/ folder. The first modification is required to register the version 5 of the problems. The two additional files contains the implementation of the version 5 o the locomotors with reward functions optimized for evolutionary algorithms.

Observe the behavior displayed by robots pre-evolved with evorobotpy2 by loading the bestgSs.npy files included in the evorobotpy2/xhopper, evorobotpy2/xhalfcheetah, evorobotpy2/xant, evorobotpy2/xwalker and evorobotpy2/xhumanoid folders.

Exercise 9. Inspect the gym_locomotion_env.py and gym_locomotion_env2.py files and/or check (Pagliuca, Milano and Nolfi, 2020) to identify the differences between the reward function optimized for evolutionary and reinforcement learning algorithms. Evolve the robots with the version 0 of the problems and then compare the behavior of the robots evolved by using version 5. Could you explain why the original rewards functions are not suitable for evolutionary strategies ?

13.12 Implementing a new AI-Gym environment

In this Section we will see how to create and a new AI-Gym based on the Pybullet dynamical simulator. The example that we will consider is an adapted version of a tutorial developed by Backyard Robotics. It will constitute the Exercise 10.

The environment to be implemented includes a simple wheeled robot that can be trained for the ability to balance. We can thus name the environment BalancebotBulletEnv-v0 and the folder containing the implementation of the environment balance-bot.

13.12.1 Structuring the balancing bot project

The project requires the following folders and files (for more detailed instructions see the AI-Gym documentation):


balance-bot/
  README.md
  setup.py
  balance_bot/
    __init__.py
    envs/
      __init__.py
      balancebot_env.py

The file setup.py should include the libraries required, in this case gym and pybullet:


from setuptools import setup
setup(name='balance_bot',
      version='0.0.1',
      install_requires=['gym', 'pybullet'])

The file __init__.py should include a register instruction with the name of the environment (the name that will be passed to the function gym.make() to create the environment) and the entry point of the environmental class in the format:  {base module name}.envs:{Env subclass name}:


from gym.envs.registration import register
register(
    id='BalanceBotBulletEnv-v0',
    entry_point='balance_bot.envs:BalancebotEnv',)

Finally, the file /envs/__init__.py should include the instruction for importing the environment:


from balance_bot.envs.balancebot_env import BalancebotEnv

13.12.2 Creating the robot

The balancing bot is a robot with two motorized wheels on the same axis and a central body that should be maintained upright by controlling appropriately the motor of the wheels. The robot has three sensors that encode the orientation and angular velocity of its body and the angular velocity of its wheels. Moreover, it has an actuator that controls the target angular velocity of the wheels. The central body element consists of a cuboid. The two wheels are attached to one side of the cuboid.

The physical structure of the robot can be specified in a text file that follows the URDF (Universal Robot Description Format) introduced in ROS (Quigley et al., 2009). A prefilled version of the balancebot_simple.xml file is included in the ./evorobotpy2/exercises/folder. Analyze the file to understand how bodies and joints are specified. Pybullet includes the URDF files of many robotic platforms. Consequently, constructing an environment based on those platforms simply require to load the appropriate URDF file.

13.12.3 Implementing the environment class

The balancebot_env.py file should include the class that inherit from the gym.Env class and the implementation of the associated methods: reset(), step(), render(), and seed().

The head of the file should include the instructions for importing the components required and the definition of the environmental class:


import os
import math
import numpy as np

import gym
from gym import spaces
from gym.utils import seeding

import pybullet as p
import pybullet_data

class BalancebotEnv(gym.Env):
    metadata = {
        'render.modes': ['human', 'rgb_array'],
        'video.frames_per_second' : 50
    }

The __init__() method should allocate and initialize the observation and action vectors. The properties of the observation and action space can be initialized with the spaces.Box() method. In the case of the balance-bot environment, the output space includes a single continuous value which encodes the desired velocity of the wheels. The observation space include three continuous values which encode the inclination of the robot, the angular velocity of the robot, and the angular velocity of the wheels, respectively.


    def __init__(self, render=False):
        # action encodes the torque applied by the motor of the wheels
        self.action_space = spaces.Box(-1., 1., shape=(1,), dtype='float32')
        # observation encodes pitch, gyro, com.sp.
        self.observation = []
        self.observation_space = spaces.Box(np.array([-math.pi, -math.pi, -5]),
                np.array([math.pi, math.pi, 5]), dtype='float32')
        self.connectmode = 0
        # starts without graphic by default
        self.physicsClient = p.connect(p.DIRECT)
        # used by loadURDF
        p.setAdditionalSearchPath(pybullet_data.getDataPath())
        self.seed()

The seed() method sets the seed of the random number generator which is used to initialize of posture of the robot at the beginning of evaluation episodes.


    def seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

The reset() method loads the files containing the description of the environmental plane and of the robot, initializes the position, velocity, and orientation of the robot, and computes and returns the observation. The orientation of the robot is set randomly within a given range.


    def reset(self):
        self.vt = 0      # current velocity pf the wheels
        self.maxV = 24.6 # max lelocity, 235RPM = 24,609142453 rad/sec
        self.envStepCounter = 0
        p.resetSimulation()
        p.setGravity(0, 0,-10) # m/s^2
        p.setTimeStep(0.01)    # the duration of a step in sec
        planeId = p.loadURDF("plane.urdf")
        robotStartPos = [0,0,0.001]
        robotStartOrientation = p.getQuaternionFromEuler([self.np_random.uniform(low=-0.3, high=0.3),0,0])
        path = os.path.abspath(os.path.dirname(__file__))
        self.botId = p.loadURDF(os.path.join(path, "balancebot_simple.xml"),
                           robotStartPos,
                           robotStartOrientation)
        self.observation = self.compute_observation()
        return np.array(self.observation)

The step() method sets the state of the actuators, advances the simulation of one step, computes the new observation, computes the reward, checks whether the conditions for terminating prematurely the episode are satisfied, and increases the step counter. The method receives as input the action vector and returns the observation vector, the reward, a Boolean value that indicates whether the episode should terminate, and an empty dictionary.


    def step(self, action):
        self.set_actuator(action)
        p.stepSimulation()
        self.observation = self.compute_observation()
        reward = self.compute_reward()
        done = self.compute_done()
        self.envStepCounter += 1
        status = "Step " + str(self.envStepCounter) + "  Reward " + '{0:.2f}'.format(reward)
        p.addUserDebugText(status, [0,-1,3], replaceItemUniqueId=1)
        return np.array(self.observation), reward, done, {}

The render() method initializes the graphic window that visualizes the robot and of the environment.


    def render(self, mode='human', close=False):
        if (self.connectmode == 0):
            p.disconnect(self.physicsClient)
            # connect the graphic renderer
            self.physicsClient = p.connect(p.GUI)
            self.connectmode = 1
        pass

Finally, the following four methods: (i) update the desired velocity of the actuators, (ii) compute the observation on the basis of the current position, orientation and velocity of the robot, (iii) compute the reward, and (iv) check whether the episode should terminate.


    def set_actuator(self, action):
        deltav = action[0]
        vt = np.clip(self.vt + deltav, -self.maxV, self.maxV)
        self.vt = vt
        p.setJointMotorControl2(bodyUniqueId=self.botId,
                                jointIndex=0,
                                controlMode=p.VELOCITY_CONTROL,
                                targetVelocity=vt)
        p.setJointMotorControl2(bodyUniqueId=self.botId,
                                jointIndex=1,
                                controlMode=p.VELOCITY_CONTROL,
                                targetVelocity=-vt)

    def compute_observation(self):
        robotPos, robotOrn = p.getBasePositionAndOrientation(self.botId)
        robotEuler = p.getEulerFromQuaternion(robotOrn)
        linear, angular = p.getBaseVelocity(self.botId)
        return (np.array([robotEuler[0],angular[0],self.vt], dtype='float32'))

    def compute_reward(self):
        # receive a bonus of 1 for balancing and pay a small cost proportional to speed
        return 1.0 - abs(self.vt) * 0.05
		
    def compute_done(self):
        # episode ends when the barycentre of the robot is too low or after 500 steps
        cubePos, _ = p.getBasePositionAndOrientation(self.botId)
        return cubePos[2] < 0.15 or self.envStepCounter >= 500

13.12.4 Installing and using the environment

You can install the environment with the following instructions:


cd balance-bot
pip install -e .

Finally, you can use BalanceBotBulletEnv-v0 as a standard gym environment after importing it with the following command:


import balance_bot

For example, you can train the robot with the evorobotpy2/bin/es.py script after adding the import statement in the script.

13.12.5 Creating C++ environments

For an example of an environment implemented in C++ and wrapped in Python, see the long double-pole balancing problem available from https://github.com/snolfi/longdpole. A version of this environment optimized for evorobotpy2 is included in the evorobotpy2 software.

13.13 Back-propagation

Reinforcement learning algorithms use back-propagation to reduce the loss, i.e. the error between the desired and actual state of the motor neurons. It is reduced by computing the gradient of the parameters with respect to the loss and by varying the parameters in the direction of the gradient. As illustrated in Chapter 11, back-propagation can also be used to train feature-extracting networks.

The implementation of back-propagation is facilitated by the availability of tools, like torch (https://pytorch.org/, Collobert, Bengio, & Mariéthoz, 2002) and tensorflow (https://github.com/tensorflow/tensorflow, Abadi et al., 2016), which provide ready-to-use methods. These tools also permit to use GPU processors to speed-up the process.

An example of a back-propagation algorithm implemented in pytorch which can be used to train a neural network to perform a regression problem, i.e. to map a list of input vectors into a corresponding list of desired output vectors, is included in the ./evorobotpy2/exercises/regression.py script shown below:

import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.optim as optim
import numpy
import matplotlib.pyplot as plt

# hyper-parameters
TrainingEpochs = 5000                # the duration of the training process in epochs
nhiddens = 2                         # the number of internal neurons
DisplayEvery = 100                   # the frequency with which the program display data
                                     # the training set can be changed by modifying the inputs and targets matrices below

# define the input patters. They consists of a list of tensors. Tensors are special data type
# analogous to vectors and matrices used by pytorch which can be processed also on in parallel on GPU
inputs = torch.Tensor([[0,0],[0,1], [1,0], [1,1]])

# define and print the desired output patters. They consists of a list of tensors
targets = torch.Tensor([[0],[1],[1],[0]])

# define the class which implements the neural network
class Net(nn.Module):                         # the network class

    def __init__(self):                       # initialize the network
        super(Net, self).__init__()
        self.c1 = nn.Linear(inputs.size(1), nhiddens, True)  # creates the first layer of connection weights from 2 input to 3 internal neurons
        self.c2 = nn.Linear(nhiddens, targets.size(1), True) # creates the second layer of connection weights from 3 internal to 1 output neurons

    def forward(self, inp):                   # update the activation of the neurons
        hid = torch.sigmoid(self.c1(inp))     # computes the activation of the internal layer by using the sigmoid activation function
        out = self.c2(hid)                    # computes the activation of the output layer by using a linear activation function
        if ((epoch % DisplayEvery) == 0):
            for i in inp:
                print("%.2f " % (i.numpy()), end="")          # print the activation of the input neurons
            print("-> ", end="")
            for h in hid:
                print("%.2f " % (h.detach().numpy()), end="") # print the activation of the hidden neurons
            print("-> ", end="")
            for o in out:
                print("%.2f " % (o.detach().numpy()), end="") # print the activation of the output neurons
        return out

net = Net()                                      # creates the network
print("3-layers feed-forward neural network with %d input, %d internal, and %d output neurons created " % (inputs.size(1), nhiddens, targets.size(1)))

lossfunction = nn.MSELoss()                      # creates a Mean Square Error loss function
print("We use the Mean Squared Error Loss Function") # for other type of loss function see the pytorch documentation

optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9) # create a Vanilla Gradient Descent Optimizer
print("We use the SGD Optimizer with learning rate set to 0.01 and momentum set to 0.9")
print("")

print("We train the network")
for epoch in range(0, TrainingEpochs):
    if ((epoch % DisplayEvery) == 0):
        print("Epoch %d, we print the activation of the neurons, the loss, the weights and the biases" % (epoch))
    epochloss = 0
    for inputp, targetp in zip(inputs, targets):
        optimizer.zero_grad()                  # initialize the gradient buffer to 0.0
        output = net(inputp)                   # activate the network and obtain the output
        loss = lossfunction(output, targetp)   # compute the loss
        epochloss += loss.data.numpy()         # compute the loss on the entire training set
        if ((epoch % DisplayEvery) == 0):
            print(" loss %.2f" % (loss.data.numpy()))
        loss.backward()                        # compute the gradient
        optimizer.step()                       # call the optimizer which updates the parameters by using the gradient to reduce the loss
    if ((epoch % DisplayEvery) == 0):          # display the parameters, i.e. the weights of the two layers and the biases of the internal and output layer
        w1 = net.c1.weight.detach().numpy()
        w2 = net.c2.weight.detach().numpy()
        b1 = net.c1.bias.detach().numpy()
        b2 = net.c2.bias.detach().numpy()
        print(numpy.concatenate((w1.flatten(), w2.flatten(),b1.flatten(), b2.flatten())))
    if (epochloss < 0.0001):                   # terminate the training when the loss is lower than a threshold
        break

print("post-evaluation at the end of training")
for inputp, targetp in zip(inputs, targets):                       # post-evaluate the network at the end of training
    outputp = net(inputp)
    print("input: ", end="")
    for i in inputp:
        print("%.3f " % (i.detach().numpy()), end="")              # print the activation of the output neurons
    print("output: ", end="")
    for o in outputp:
        print("%.3f " % (o.detach().numpy()), end="")              # print the activation of the output neurons
    print("error: ", end="")
    for o, t in zip(outputp, targetp):
        print("%.3f " % (t.detach().numpy() - o.detach().numpy())) # print the error, i.e. the difference between the desired and actual outputs
....
....

In this example the regression problem consists in mapping the [1.0, 0.0] and [0.0, 1.0] observation vectors to [1.0] and to the [0.0, 0.0] and [1.0, 1.0] observation vectors to [0.0]. The training set, i.e. the list of observation and desired output vectors is defined in line 35 and 38. The size of the observation and action vectors is used to determine the size of the input and output layer of the network (line 45-46). The architecture of the neural network is defined in lines 45-49 by specifying the size of the connection layers (line 45-46) and the activation function of neurons (line 49). By default neurons are updated with a linear function. Line 65 and 68 specify the loss function and the optimizer to be used, i.e. the mean square error loss function and the SGD optimizer in this case. The calculation of the loss and of the gradient and the update of the parameters is performed automatically during each training epoch by calling loss.backward() and optimizer.step() methods (line 84-85).

You should install the torch and matplotlib python packages to run this script. The latter package is used to display the features extracted by the internal neurons, in case of problems with a bi-dimensional observation space. In case you do not want to install matplotlib, you can comment lines 108-128.

Exercise 11. Inspect closely the source code of the script. Run the program several times and observe the results. Notice that the script also illustrates how to access and print the parameters and the activation of sensory, internal, and motor neurons. This can require to convert the data from tensor vectors, the data type required by torch, to numpy vectors. Modify the regression problem by modifying the training set. You can change the number of patterns, the length of observation and action vectors, and the content of the vectors.

The efficacy of back-propagation can be improved by diving the training set in batches, by varying the vectors included in each batch randomly during each training epoch, by computing the gradient and updating the parameters after each batch, and by normalizing the activation of neurons.

13.14 Cross-entropy

Cross-entropy is a minimal reinforcement learning algorithm. It uses a stochastic policy network to generate variations, collects the observations and the actions experienced and produced in each step during evaluation episodes, and use this data to generate a training set that increases the probability to produce good actions. It simply operates by increasing the probability to produce the actions performed in the evaluation episodes in which the robot collected the highest total reward.

In the example illustrated in the ./evorobotpy2/exercises/crossentropy.py script shown below, the algorithm is applied to a problem with a discrete action space. The policy network includes N motor neurons that encode the probability to execute any possible action. During each step the action is selected randomly on the basis of the probability distribution encoded in the action vector of the policy network.

import gym
from collections import namedtuple
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

# hyper-parameters
NINTERNAL = 128              # Number of internal neurons
NEPISODES = 16               # Number of episode performed in each iteration
PERCENTILE = 70              # Fraction of the best evaluation episode used to train the policy network


class Net(nn.Module):        # The policy network class
    def __init__(self, obs_size, ninternal, nactions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, ninternal),
            nn.ReLU(),                        # The internal neurons use the ReLU activation function
            nn.Linear(ninternal, nactions)    # The output neurons encode the probability if execution of the n different possible actions
        )

    def forward(self, x):
        return self.net(x)

# define two tuple helper classes by using the collection package
Episode = namedtuple('Episode', field_names=['reward', 'steps'])               # store the total undiscounted reward and the steps, i.e. the list of observations and actions of an episode
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action']) # list of observations and action of an episode


def rollout(env, net, nepisodes): # carry on the evaluation episodes and store data (cumulative reward, observations and actions)
    rolldata = []           # initialize the rollout data list
    episode_reward = 0.0    # initialize the episode_reward
    episode_steps = []      # initialize the list containing the rollout data
    obs = env.reset()       # initialize the episode
    sm = nn.Softmax(dim=1)  # create a softmax class
    while True:
        obs_v = torch.FloatTensor([obs])                                    # convert the observation vector to a tensor vector
        act_probs_v = sm(net(obs_v))                                        # compute the action vector with the policy network and convert the action vector to a probability distribution with softmax
        act_probs = act_probs_v.data.numpy()[0]                             # convert the action tensor vector in a numpy vector
        action = np.random.choice(len(act_probs), p=act_probs)              # choses the action randomly by using the action probabilities
        next_obs, reward, is_done, _ = env.step(action)                     # perform a step
        episode_reward += reward                                            # update the total undiscounted reward
        episode_steps.append(EpisodeStep(observation=obs, action=action))   # append the observation and the action to the data
        if is_done:                                                         # at the end of the episode:
            rolldata.append(Episode(reward=episode_reward, steps=episode_steps))   # create the data of the episode, i.e. the summed undiscounted reward and the list of observation and actions
            episode_reward = 0.0                                                # re-initialize the reward
            episode_steps = []                                                  # re-initialize the steps
            next_obs = env.reset()                                              # re-initialize the episode
            if len(rolldata) == nepisodes:
                yield rolldata                                                  # return a list of data collected during the rollout
                rolldata = []
        obs = next_obs                                                       # set the next observation


def filter_rollout(rolldata, percentile): # filter out the data of the worse episode
    rewards = list(map(lambda s: s.reward, rolldata))     # extracts the list of comulative rewards collected in the corresponding episodes
    reward_bound = np.percentile(rewards, percentile)  # computes the minimum reward which episodes should have to be used for training the policy network
    reward_mean = float(np.mean(rewards))              # computes the reward mean

    train_obs = []                                     # initializes the matrix of observation to be used for training
    train_act = []                                     # initializes the matrix of actions to be used for training
    for episode in rolldata:
        if episode.reward < reward_bound:              # check whether the reward of the episode exceed the minimal bound
            continue                                                        # data of low performing episodes are discarded
        train_obs.extend(map(lambda step: step.observation, episode.steps)) # store the observation for training
        train_act.extend(map(lambda step: step.action, episode.steps))      # store the actions for training

    train_obs_v = torch.FloatTensor(train_obs)         # transform the observation matrix in a tensor
    train_act_v = torch.LongTensor(train_act)          # transform the action matrix in a tensor
    return train_obs_v, train_act_v, reward_bound, reward_mean


if __name__ == "__main__":
    print("CartPole-v0 trained through a crossentropy reinforcement learning algorithm")
    print("Nepisodes x epoch: %d fraction episode discarded %.2f" % (NEPISODES, PERCENTILE))
    env = gym.make("CartPole-v0")               # create the environment
    nsensory = env.observation_space.shape[0]   # extract the number of sensory neurons
    nmotor = env.action_space.n                 # extract the number of motor neurons by assuming that the action space of the environment is discrete

    net = Net(nsensory, NINTERNAL, nmotor)      # create the network policy
    print("The policy network has %d sensory, %d internal and %d motor neurons" % (nsensory, NINTERNAL, nmotor))
    objective = nn.CrossEntropyLoss()           # the Cross Entropy Loss function combine cross entropy and softmax. It works with raw values (digits) thus avoiding the need to add a softmax layer in the policy network
    optimizer = optim.Adam(params=net.parameters(), lr=0.01) # initialize the Adap stochastic optimizer
    print("")

    for epoch, rolldata in enumerate(rollout(env, net, NEPISODES)):
        obs_v, acts_v, reward_b, reward_m = filter_rollout(rolldata, PERCENTILE)
        optimizer.zero_grad()
        action_scores_v = net(obs_v)
        loss_v = objective(action_scores_v, acts_v)
        loss_v.backward()
        optimizer.step()
        print("epoch %d: loss=%.3f, reward_mean=%.1f, reward_threshould=%.1f" % (epoch, loss_v.item(), reward_m, reward_b))
        if reward_m > 199:
            print("Solved!")
            break

During each training epoch, the algorithm: (i) performs N evaluation episodes (lines 45-67), (ii) stores the total cumulative reward and the list of observations and actions produced during the episodes (line 58), (iii) eliminates the data of the episodes with the worst cumulative reward (lines 70-85), and (iv) uses the remaining data to train the policy, i.e. to increase the probability to produce the actions performed in the best episodes (lines 101-107).

Despite its simplicity, this method works well in simple environments and is robust with respect to the setting of the hyperparameters.

The aspects that are missing in this algorithm and which will be illustrated in the next section are: the computation and usage of the discounted expected reward, the implementation and usage of the critic network, and the calculation of the advantage.

The script requires the torch and collections python packages.

Exercise 12. Inspect closely the source code of the script. Run the program several times and observe the results. Modify the script to train a policy network for another environment of your choice.

13.15 The Stable Baseline3 implementation of the PPO algorithm

In this section we will analyze the StableBaseline3 (https://github.com/DLR-RM/stable-baselines3) implementation of the PPO algorithm (Shulman et al., 2017), a state-of-the-art reinforcement learning algorithm particularly effective for robotic applications. The PPO is an extended version of the vanilla policy gradient algorithm described in Section 6.3.3 which includes a mechanism for penalizing excessive variations of the policy during training.

The main loop of the algorithm is included in the train() method of the PPO class [file ./stable_baseline3/ppo/ppo.py, line 166-285]:

    def train(self) -> None:
        """
        Update policy using the currently gathered rollout buffer.
        """
        # Update optimizer learning rate
        self._update_learning_rate(self.policy.optimizer)
        # Compute current clip range
        clip_range = self.clip_range(self._current_progress_remaining)
        # Optional: clip range for the value function
        if self.clip_range_vf is not None:
            clip_range_vf = self.clip_range_vf(self._current_progress_remaining)

        entropy_losses = []
        pg_losses, value_losses = [], []
        clip_fractions = []

        continue_training = True

        # train for n_epochs epochs
        for epoch in range(self.n_epochs):
            approx_kl_divs = []
            # Do a complete pass on the rollout buffer
            for rollout_data in self.rollout_buffer.get(self.batch_size):
                actions = rollout_data.actions
                if isinstance(self.action_space, spaces.Discrete):
                    # Convert discrete action from float to long
                    actions = rollout_data.actions.long().flatten()

                # Re-sample the noise matrix because the log_std has changed
                # TODO: investigate why there is no issue with the gradient
                # if that line is commented (as in SAC)
                if self.use_sde:
                    self.policy.reset_noise(self.batch_size)

                values, log_prob, entropy = self.policy.evaluate_actions(rollout_data.observations, actions)
                values = values.flatten()
                # Normalize advantage
                advantages = rollout_data.advantages
                advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

                # ratio between old and new policy, should be one at the first iteration
                ratio = th.exp(log_prob - rollout_data.old_log_prob)

                # clipped surrogate loss
                policy_loss_1 = advantages * ratio
                policy_loss_2 = advantages * th.clamp(ratio, 1 - clip_range, 1 + clip_range)
                policy_loss = -th.min(policy_loss_1, policy_loss_2).mean()

                # Logging
                pg_losses.append(policy_loss.item())
                clip_fraction = th.mean((th.abs(ratio - 1) > clip_range).float()).item()
                clip_fractions.append(clip_fraction)

                if self.clip_range_vf is None:
                    # No clipping
                    values_pred = values
                else:
                    # Clip the different between old and new value
                    # NOTE: this depends on the reward scaling
                    values_pred = rollout_data.old_values + th.clamp(
                        values - rollout_data.old_values, -clip_range_vf, clip_range_vf
                    )
                # Value loss using the TD(gae_lambda) target
                value_loss = F.mse_loss(rollout_data.returns, values_pred)
                value_losses.append(value_loss.item())

                # Entropy loss favor exploration
                if entropy is None:
                    # Approximate entropy when no analytical form
                    entropy_loss = -th.mean(-log_prob)
                else:
                    entropy_loss = -th.mean(entropy)

                entropy_losses.append(entropy_loss.item())

                loss = policy_loss + self.ent_coef * entropy_loss + self.vf_coef * value_loss

                # Calculate approximate form of reverse KL Divergence for early stopping
                # see issue #417: https://github.com/DLR-RM/stable-baselines3/issues/417
                # and discussion in PR #419: https://github.com/DLR-RM/stable-baselines3/pull/419
                # and Schulman blog: http://joschu.net/blog/kl-approx.html
                with th.no_grad():
                    log_ratio = log_prob - rollout_data.old_log_prob
                    approx_kl_div = th.mean((th.exp(log_ratio) - 1) - log_ratio).cpu().numpy()
                    approx_kl_divs.append(approx_kl_div)

                if self.target_kl is not None and approx_kl_div > 1.5 * self.target_kl:
                    continue_training = False
                    if self.verbose >= 1:
                        print(f"Early stopping at step {epoch} due to reaching max kl: {approx_kl_div:.2f}")
                    break

                # Optimization step
                self.policy.optimizer.zero_grad()
                loss.backward()
                # Clip grad norm
                th.nn.utils.clip_grad_norm_(self.policy.parameters(), self.max_grad_norm)
                self.policy.optimizer.step()

            if not continue_training:
                break
...
...

The train() method collects the rollout_data, i.e. the observations, the log_probs (i.e. the logarithmic probability of actions), the actions, and the values (i.e. the predicted discounted returns) experienced and produced by the policy network during multiple evaluation episodes. Moreover it computes the discounted returns and the advantages on the basis of the reward received in each step during the evaluation episodes and on the basis of the values computed by the critic network (see Section 6.3.3). This is realized by calling the rollout_buffer.get() method (line 188) which is defined the file ./stable_baseline3/commons/buffers.py. Notice that the rollout_buffer.get() method divides the rollout data in multiple batches and returns the data as a list of batches. Then, for each batch, the algorithm:

Exercise 13. Install stable baseline 3 and run some experiment by following the documentation and the examples illustrated in https://github.com/DLR-RM/stable-baselines3. Inspect closely the source code until you have a detailed understanding with particular reference to the portion illustrated above. Run the program several times and observe the results. In addition observe the feature extracted by the back-propagation example illustrated in Section 13.13.

13.16 Run reinforcement learning experiments with Stable Baseline

Stable Baseline (https://stable-baselines.readthedocs.io/en/master/, Hill et al., 2018) is a library that includes the most powerful reinforcement learning algorithms, i.e. A2C, ACER, ACKTR, DDPG, DQN, GAIL, HER, PPO, SAC, TD3 and TRPO.  It is an improved and documented version of the OpenAI Baselines (Dhariwal et. al., 2017).

Stable Baseline can be used in combination with Stable Baselines Zoo (https://github.com/araffin/rl-baselines-zoo, https://stable-baselines.readthedocs.io/en/master/, Raffin, 2018) a companion library which includes Python scripts for training and testing agents, a collection of collection of pre-trained agents, and a tool for optimizing hyperparameters. The hyperparameters optimized for different algorithms and environments can be inspected in the folder ./rl-baselines-zoo/hyperparameters (see also Section 13.13.14).

Another valuable resources that you might use is Spinning Up (https://spinningup.openai.com/, Achiam, 2018), a well-documented library designed for educational purposes. Both tools require Tensorflow (https://github.com/tensorflow/tensorflow, Abadi et al., 2016). Stable Baseline3 (https://github.com/DLR-RM/stable-baselines3) is a version based on Pytorch (https://pytorch.org/) which, however, is not yet complete as the original version.

13.16.1 Installing Stable Baselines and RL-Baselines-Zoo

To install stable baselines follow the directions included in https://stable-baselines.readthedocs.io/en/master/. Then clone the rl-baselines-zoo repository with the following command:


git clone https://github.com/araffin/rl-baselines-zoo.git
cd rl-baselines-zoo

13.16.2 Training and testing with the rl-baselines-zoo scripts

To train an agent, you can use the train.py script from the ./rl-baseline-zoo folder by specifying with the --algo and --env program parameters the name of the algorithm and the name of the environment. For example, to train a policy network for the CartPole-v1 environment with the PPO algorithm you can use the following command from the rl-baseline-zoo folder:


python train.py --algo ppo2 --env CartPole-v1 -f logs -tb logs/ppo2/cartpole1

where -f indicates the folder in which the program will save the trained policy, and -tb indicates the folder in which the program will save the monitoring data that can be displayed with tensorboard (see below). The trained policy will be saved in the file ./logs/ppo2/CartPole-v1_1/best_model.zip where the _1 number corresponds to the ID number of the experiment. If you run a second experiment with the same algorithm and environment the ID number is automatically incremented.

To visualize the behavior of a trained agent you can use the enjoy.py script from the rl-baseline-zoo folder as shown below:


python enjoy.py --algo ppo2 --env CartPole-v1 -f logs --exp-id 1 --load-best

To visualize the behavior of a pre-trained agent you should only specify the algorithm and the environment, as shown in the example below. The program will automatically load a pre-trained agent from the ./rl-baseline-zoo/trained_agents folder.


python enjoy.py --algo ppo2 --env HumanoidBulletEnv-v0

To visualize the learning data with tensorboard you can run the command below from the rl-baseline-zoo folder and then follow the instruction to visualize the data on your browser. Tensorboard will update the graphic display automatically during the training process as soon as new data is generated.


tensorboard --logdir logs/ppo2/cartpole1

Finally, to create a video file with the behavior of the agent you can use the following command from the rl-baseline-zoo folder:


python -m utils.record_video --algo ppo2 --env CartPole-v1 -n 1000 -f logs -f logs --exp-id 2

where the -n program parameter indicates the length of the video in steps.

13.16.3 Monitoring the training process

Monitoring the course of the training process is particularly important in the case of reinforcement learning since the success of the learning process often depends on the setting of the hyperparameters and since monitoring the course of the learning process can provide hints on how to set the hyperparameters. 

The indicators that can be visualized with tensorboard include: the cumulative reward obtained in the current episode (episode reward), the discounted reward, the advantage, the probability distribution of actions, the policy loss, the value loss, the entropy loss, and the distribution divergence. Entropy indicates the predictability of the effect of the agents actions on the expected reward. The distribution divergence, that can be calculated with the kullback-leibler method, indicates the variation of the distribution of the actions performed by the policy before and after the update of the parameters.

The output text produced by the program include additional indicators: the average episodes reward (ep_reward_mean), the average episodes length (ep_len_mean), and the explained variance.

13.16.4 Setting hyperparameters

The ./rl-baseline-zoo/train.py script has two mandatory flags (--algo and --env) that should be used to specify the algorithm and the environment and the optional --hyperparameters or -params flags that can be used to set the other hyperparameters. The default hyperparameters are loaded automatically from the file ./hyperparams/algo.yml (where algo is the name of the algorithm chosen). The --hyperparameters flag can be used to overwrite the default parameters. The two examples below show how we can overwrite the default value of the learning_rate and n_timesteps hyperparameters or the number of internal neurons of the policy network.


python train.py --algo ppo2 --env CartPole-v1 -params learning_rate:0.001 n_timesteps:30000
python train.py --algo ppo2 --env CartPole-v1 -params policy_kwargs:"dict(net_arch=[32, 32])"

For what concern the choice of the algorithm, according to the benchmarking data included in rl-baselines-zoo, the ppo2, sac, td3 and trpo produce the best results on problems with continuous action space such as the Pybullet locomotors problems (AntBulletEnv-v0, HalfCheetahBulletEnv-v0, HopperBulletEnv-v0, HumanoidBulletEnv-v0, Walker2DBulletEnv-v0),  the BipedalWalker-v2, and the BipedalWalkerHardcore-v2. The a2c, acer, actkr, ddpg and dqn algorithms produce the best results on problems with discrete action spaces such as the BeamRiderNoFrameskip-v4, the MsPacmanNoFrameskip-v4, and the QbertNoFrameskip-v4 Atari games. In some cases the algorithms of the first group (e.g. the PPO) achieve the best performance also on problems with discrete action space such as the EnduroNoFrameskip-v4 and the SeaquestNoFrameskip-v4.

The policy hyperparameter can be used to specify the architecture of the policy. The multi-layer feed-forward policy (‘MlpPolicy’) can be considered the default option. By default this policy includes two layers with 64 neurons.  For problems that benefit from memory, you can use the multi-layer Short Term Long Term memory policy (‘MlpLstmPolicy’).  For problems that require to process visual information, such as the Atari problems, you can use the convolutional neural network policy (‘CnnPolicy’) or eventually the convolutional long term short term neural network policy (‘CnnLstmPolicy’). All these policy support actor-critic networks.

The policy_kwargs hyperparameter can be used to customize the characteristic of the policy network. For example, the number of layers and the number of neurons per layer can be specified with the string "dict(net_arch=[N1, N2])" where the length of the vector corresponds to the number of layers and the elements of the vector to the number of neurons per layer. For an exhaustive description of the hyperparameters see: https://stable-baselines.readthedocs.io/en/master/modules/policies.html.

Eventually, you can declare your own custom policy as a sub-class of an existing policy. An example of custom policy is the CustomMlpPolicy shown below that is defined in the file ./rl-baseline-zoo/utils/utils.py. This feed-forward network policy, which includes a single layer composed of 16 neurons, is the policy recommended by default for the AntBulletEnv-v0 problem trained with the PPO2 algorithm.


class CustomMlpPolicy(BasePolicy):
    def __init__(self, *args, **kwargs):
        super(CustomMlpPolicy, self).__init__(*args, **kwargs,
                                              layers=[16],
                                              feature_extraction="mlp")
register_policy('CustomMlpPolicy', CustomMlpPolicy)

Different algorithms have partially different hyperparameters. Here we refer in particular to the hyperparameters of the PPO2.

The number and duration of evaluation episodes, and consequently the length of data collected and used to optimize the policy, can be specified with the n_envs=1 and n_steps=128 hyperparameters, where the values indicated after the equal sign represent the default values for the PPO algorithm.  The number of mini batches in which the data is divided is indicated with the hyperparameter nminibatches=4. The number of epochs used to update the parameters of the policy on the basis of the current training data can be set with the hyperparameter noptepochs=4. The learning rate is specified with the learning_rate=0.00025 hyperparameter. Finally, the total duration of the training process in step can be set with the n_timesteps hyperparameter.

The n_steps hyperparameter should be sufficiently large to enable the agents to exhibit the desired behavior and to ensure that the agent keep behaving correctly over time. For example a value of 1000 steps can be appropriate for the robot locomotors problems to evaluate whether the robot manage to start walking and keep walking without falling down for a while. A much smaller value can be sufficient for problems that admit a quick solution, like the MountainCar-v0 environment. Please consider, however, that the time required to achieve the goal during the first stages of the learning process can be significantly longer than the time required by high-performing agents. The value of the n_env hyperparameter should matches the variation range of the environmental conditions among evaluation episodes. Consequently, it should be greater for problems in which the characteristics of the environments vary significantly among episodes, such as in the case of the BipedalWalkerHardcore-v3. Instead, in can be smaller in the case of the robot locomotors problems in which the environmental variations are restricted to minor modifications of the initial posture of the robots. The n_env hyperparameter can also be used to exploit parallel computation since some algorithms are able to evaluate multiple episodes in parallel. The minimal size of the n_timesteps hyperparameter depends on the complexity of the problem. It can be in the order of 1e5 in the case of simple classic control problem such as the Cartpole-v1, in the order of 2e6 in the case of the robot locomotors problems, in the order of 1e7 in the case of the Atari problems, and in the order of 1e8 in the case of particularly difficult problems such as BipedalWalkerHardcore-v3. In general a suitable value for the notepochs hyperparameters is 10. For the learning_rate hyperparameter the default value of 0.00025 is usually good. For some problems, however, linearly decreasing the learning rate during the training process leads to better performance.

The gamma=0.99 hyperparameter determines the importance of future rewards in the calculation of the expected reward. Consequently, one might want to set this value as large as possible. On the other hand, the utilization of smaller gammas can speed-up learning. A good practice is to use a value in the range [0.9, 0.99] and to choose a value closer to the minimum or to the maximum depending on whether one want to prioritize learning speed or learning efficacy. The lam=0.95 (lambda) hyperparameter determines how much the algorithm bootstrap on earlier learned value versus new experiences. This implies a trade-off between more bias (low lambda) and more variance (high lambda). The default value of this hyperparameter works well in most of the cases.

The normalize=False hyperparameter can be used to normalize the observation vector. Normalization is generally useful in particular when the distribution of the observation values vary widely during the course of training. It however produces lower performance in the case of the Atari problems and in the case of simple classic control problems such as the Cartpole-v1 and the Pendulum-v0. The ent_coef=0.01 (entropy coefficient) hyperparameter encourage the exploration of new strategies. It can be set to 0.1 in the case of the Atari problems but should be reduced to 0.01 in the case of the BipedalWalker-v3 and BipedalWalkerHardcore-v3 problems and to 0.0 in the case of  robot locomotors problems.

The PPO algorithm includes a method which reduces the amplitude of the variations of the parameters to avoid introducing changes that alter too much the current policy with respect to the previous policy. The reduction of the size of variation is realized by clipping the gradient to the threshold specified with the cliprange=0.2 hyperparameter. Eventually, you can also set a separate clipping threshold for the gradient of the value function network by assigning a value to the cliprange_vf=None hyperparameter. In general the default values of these hyperparameters work well. Smaller values can be beneficial in problems that lead to instabilities with the default values.

Exercise 14. Train robots on two problems of your choice with the PPO2 algorithm. Repeat the experiment by varying some of the parameters and describe the effects on the indicators illustrated in section 3.5.

13.17 Competitive Co-evolution

In this section we will see how to replicate and analyze the competitive co-evolutionary experiment involving predator and prey robots described in Section 12.4 (see also Simione & Nolfi, 2019).

Open the ./evorobotpy2/expredprey/ErPredprey.ini file that contains the default hyperparameters of this experiment. Notice how the file include an nrobots hyperparameter set to 2. Indeed, in this case the environment contains two robots: a predator robot that is evolved for the ability to catch the prey robot, and a prey robot that is evolved for the ability to avoid being caught from the predator robot. The parameter heterogeneous=1 indicates that the connection weights of the two robots are different. The parameter algo=coevo2 indicates that the robots are evolved with a competitive coevolutionary algorithm which handles two populations. The algorithm is implemented in the file ./evorobotpy2/expredprey/coevo2.py The evaluation of each robot is carried out in a in a series of episodes in which the robot is evaluated in interaction with different competitors selected from the opponent population.

Notice also how the popsize and selsize hyperparameters are set to 80 and 10, respectively. This means that the population of each species of robot include 80 individuals. However, during each evolutionary phase, a sub-set of 10 individuals are evolved against a subset of 10 competitors for 20 generations. The utilization of part of the individuals enables the algorithm identify and discard opportunistic individuals by post-evaluating at the end of each evolutionary phase the 10 evolved individuals also against the 70 other opponents that they did not encounter during the evolutionary phase.

To run the experiment you should compile and install the predprey environment (predprey.cpp, predprey.h, utilities.cpp, utilities.h, robot-env.cpp, robot-env.h, ErPredprey.pxd, ErPredprey.pyx, and setupErDiscrim.py) from the ./evorobotpy2/lib folder with the following commands:


python3 setupErPredprey.py build_ext --inplace
cp ErPredprey*.so ../bin # or cp ErPredprey*.dll ../bin

You can then evolve the robots from the /evorobotpy2/xpredprey/ folder with the command:


python3 ../bin/es.py -f ERPredprey.ini -s 1

As you will see from the text printed by the program during the evolutionary process, the algorithm starts by evaluating each predator of the initial generation against each prey of the initial generation. Then it performs 1000 generations during which it: (i) selects 10 differentiated competitors, (ii) selects 10 evolving individuals by choosing those that perform better against the selected competitors, (iii) evolves the 10 selected individual against the 10 selected competitors, one at a time, for 20 sub-generations, (iv) post-evaluate the new 10 evolved individuals against the remaining 70 competitors, (v) discard the worse 10 individual from the 90 individuals composed of the 80 original individuals and the 10 additional evolved individuals. This implies that only the individuals which perform well against all competitors are selected.

Notice that since 20 sub-generations are performed during each phase, 1000 generations correspond to 20,000 sub-generations. The two populations are evolved in an alternated fashion. During even generations 10 selected predators are evolved against 10 selected prey. During odd generations, 10 selected prey are evolved against 10 selected predator.

The ./evorobotpy2/xpredprey folder includes the result of a pre-evolved evolutionary experiment performed by using 12 as seed. You can observe the behavior of evolved predators evaluated against evolved prey with the command:


../bin/es.py -f ErPredprey.ini -s 12 -g P-1000-5 -p

Notice that the coevolutionary algorithm interpret the -g program parameter in a different manner from the other algorithms. When the string start with the letter ‘P’ it interprets the following two numbers as the number of the generation and the number of individuals to be post-evaluated, respectively. This means that the command specified above will evaluate the 5 best predators of generation 1000 against the 5 best prey of generation 1000 by performing 5x5=25 evaluation episodes.

Finally, you can post-evaluate evolved robots against competitors of current and previous generations with the following command:


python ../bin/es.py -f ErPredprey.ini -s 12 -g m-1000-100 -p

A string starting with the letter ‘m’ followed by the 1000 and 100 numbers run a master tournament in which predators and then prey of generation 1000 are post-evaluated against competitor of the same generation and of previous generations, every 100 generations.

Exercise 15. Perform a master tournament analysis to verify whether predator of successive generations become stronger and stronger, i.e. achieve higher and higher performance when evaluated against ancient and more ancient competitors. Observe the behavior of the robots at different generations and describe how it changes during the evolutionary process.

13.18 The net library

The usage of the net library is summarized in the python script /bin/testnet.py shown below. The script shows how to import the compiled library, create a recurrent neural network policy, and update the network for 10 steps by setting the value of the sensory neurons randomly.

To save the computation time required to pass data from Python to C++ and vice versa, Evorobotpy2 allocates the vectors of the observation and action in Python and passes the pointer of the vectors to the C++ library only once, when the policy is created. The same technique is used to speed-up the communication between the training algorithms implemented in Python and the environments implemented in in C++, e.g. the ErDiscrim, the ErDpole, and the ErPredprey environments. Overall this permits to exploit the advantages of the Python programming language by maintaining an efficiency analogous to a pure C++ application.

The library also permits to create and update multiple networks with identical topologies. This permits to realize experiment with multiple robots, as the experiment illustrated in Section 13.14. A description of the parameters and associated default values can be obtained by executing the es.py script without parameters.


#!/usr/bin/python
import numpy as np
import net
# parameters to be set on the basis of the environment specifications
ninputs = 4            # number of input neurons
noutputs = 2           # number of output neurons
# set configuration parameters
nnetworks = 1          # number of networks
heterogeneous = 0      # whether multiple networks are heterogeneous or not
nlayers = 1            # number of internal layers
nhiddens = 10          # number of hidden units
nhiddens2 = 0          # number of hiddens of the second layer, if any
bias = 1               # whether we have biases
architecture = 2       # full recurrent architecture
afunction = 2          # activation function of internal neurons is tanh
out_type = 3           # activation function of output neurons is linear
winit = 0              # weight initializaton is xavier
clip = 0               # whether the activation of output neuron is clipped in the [-5.0, 5.0] range
normalize = 0          # whether the activation of input is normalized
action_noise = 0       # we do not add noise to the state of output neurons
action_noise_range = 0 # the range of noise added to output units
wrange = 0.0           # range of the initial weights (when winit=2)
nbins = 1              # number of outputs neurons for output values
low = 0.0              # minimun value for output clipping, when clip=1
high = 0.0             # maximum value for output clipping, when clip=1
seed = 101             # random seed
# allocate the array for inputs, outputs, and for neuron activation
inp = np.arange(ninputs, dtype=np.float32)
out = np.arange(noutputs, dtype=np.float32)
nact = np.arange((ninputs + (nhiddens * nlayers) + noutputs), dtype=np.float64)
# create the network
nn = net.PyEvonet(nnetworks, heterogeneous, ninputs, nhiddens, noutputs, nlayers, nhiddens2, bias, architecture, afunction, out_type, winit, clip, normalize, action_noise, action_noise_range, wrange, nbins, low, high)
# allocate an array for network paramaters
nparams = nn.computeParameters()
params = np.arange(nparams, dtype=np.float64)
# allocate the array for input normalization
if (normalize == 1):
    normvector = np.arange(ninputs*2, dtype=np.float64)
else:
    normvector = None
# pass the pointers of allocated vectors to the C++ library
nn.copyGenotype(params)
nn.copyInput(inp)
nn.copyOutput(out)
nn.copyNeuronact(nact)
if (normalize == 1):
    nn.copyNormalization(normvector)
# You can pass to the C++ library a vector of parameters (weights) with the command:
#nn.copyGenotype(params)  # the type of the vector should be np.float64
# set the seed of the library
nn.seed(seed)
#initialize the parameters randomly
nn.initWeights()
# Reset the activation of neurons to 0.0
nn.resetNet()
# For 10 steps set the input randomly, update the network, and print the output
for step in range(10):
    # generate the observation randomly
    inp = np.random.rand(ninputs).astype(np.float32)
    # pass the pointer to the input vector to the C++ library
    nn.copyInput(inp)
    # update the activation of neurons
    nn.updateNet()
    # print the activation of neurons
    print("step %d ,  print input, output, and input-hidden-output vectors" % (step))
    print(inp)
    print(out)
    print(nact)

References

Abadi M., Barham P., Chen J., Chen Z., Davis A., Dean J. et al. (2016). Tensorflow: A system for large-scale machine learning. In 12th Symposium on Operating Systems Design and Implementation (16): 265-283.

Achiam J. (2018). Openai spinning up. GitHub, https://github.com/openai/spinningup. See also https://spinningup.openai.com/

Andrychowicz M., Baker B., Chociej M. et. al. (2018). Learning dexterous in-hand manipulation. arXiv:1808.00177v5

Andrychowicz M., Wolski F., Ray A., Schneider J., Fong R., Welinder P. ... & Zaremba W. (2017). Hindsight experience replay. arXiv preprint arXiv:1707.01495.

Argalla B.D, Chernova S., Veloso M & Browning B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems, (57) 5: 469-483.

Arkin R. (1998). Behavior-based Robotics. Cambridge, MA: MIT Press.

Asada M. & Cangelosi A. (in press). Cognitive Robotics Handbook. Cambridge, MA: MIT Press.

Auerbach J.E., Aydin D., Maesani A., Kornatowski P.M., Cieslewski T., Heitz G., Fernando P.R., Loshchilov I., Daler L & Floreano D. (2014). RoboGen: Robot generation through artificial evolution. In H. Sayama, J. Reiffel, S. Risi, R. Doursat, & H. Lipson (Eds.), Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems (ALIFE 14). New York: The MIT Press.

Badia, A. P., Piot, B., Kapturowski, S., Sprechmann, P., Vitvitskyi, A., Guo, Z. D., & Blundell, C. (2020, November). Agent57: Outperforming the atari human benchmark. In International Conference on Machine Learning (pp. 507-517). PMLR.

Baldassarre G., Parisi D. & Nolfi S. (2006). Distributed coordination of simulated robots based on self-organisation. Artificial Life, 12(3):289-311.

Baldassarre G., Trianni V., Bonani M., Mondada F., Dorigo M. & Nolfi S. (2007). Self-organised coordinated motion in groups of physically connected robots. IEEE Transactions on Systems, Man, and Cybernetics, 37(1):224-239.

Baldassarre, G., & Mirolli, M. (Eds.). (2013). Intrinsically motivated learning in natural and artificial systems. Berlin: Springer Verlag.

Bansal T., Pachocki J., Sidor S., Sutskever I. & Mordatch, I. (2017). Emergent complexity via multi-agent competition. arXiv preprint arXiv:1710.03748.

Beer R.D. (1995). On the dynamics of small continuous-time recurrent neural networks. Adaptive Behavior 3(4): 469-509.

Bellemare M.G., Naddaf Y., Veness J & Bowling M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279.

Billard A. & Grollman D. (2013). Robot learning by demonstration. Scholarpedia, 8 (12): 3824.

Billard A.G., Calinon S. & Dillmann R. (2016). Learning from humans. In B. Siciliano & O. Khatib (Eds.), Handbook of Robotics, II Edition. Berlin: Springer Verlag.

Bonabeau E., Dorigo M. & Theraulaz G. (1999). Swarm Intelligence: From Natural to Artificial Systems. Oxford, U.K.: Oxford University Press.

Bonani M., Longchamp V., Magnenat S., Retornaz P., Burnier D., Roulet G., Vaussard F. & Mondada F. (2010). The marXbot, a miniature mobile robot opening new perspectives for the collective-robotic research. IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, 2010: 4187-4193, doi: 10.1109/IROS.2010.5649153.

Bourgine P. & Stewart J. (2006). Autopoiesis and cognition. Artificial Life (10) 3: 327-345.

Braitenberg V. (1986). Vehicles: Experiments in Synthetic Psychology. Cambridge, MA: MIT press.

Brockhoff D., Auger A., Hansen N., Arnold D.V. & Hohm T. (2010). Mirrored sampling and sequential selection for evolution strategies. In International Conference on Parallel Problem Solving from Nature. Berlin, Germany: Springer Verlag.

Brockman G., Cheung V., Pettersson L, Schneider J., Schulman J., Tang J. and Zaremba W. (2016). OpenAI Gym. arXiv:1606.01540

Brooks R.A. (1986). A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation (2) 1: 14-23.

Brooks R.A. (1986). A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation (2) 1: 14-23.

Brooks R.A. (1991). New approaches to robotics. Science, 253:1227-1232.

Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2018). Exploration by random network distillation. arXiv preprint arXiv:1810.12894.

Camazine S., Deneubourg J.-L., Franks N.R., Sneyd J., Theraulaz G. & Bonabeau E. (2001). Self-Organization in Biological Systems. Princeton University Press.

Cangelosi A. & Parisi D. (2012). Simulating the evolution of language. Springer Science & Business Media.

Cangelosi A. & Schlesinger M. (2015). Developmental Robotics: From Babies to Robots. Cambridge, MA: MIT press.

Cangelosi A. & Schlesinger M. (2015). Developmental robotics: From babies to robots. MIT press.

Cangelosi, A., Metta, G., Sagerer, G., Nolfi, S., Nehaniv, C., Fischer, K., ... & Zeschel, A. (2010). Integration of action and language knowledge: A roadmap for developmental robotics. IEEE Transactions on Autonomous Mental Development, 2(3), 167-195.

Carvalho J.T. & Nolfi S. (2016). Behavioural plasticity in evolving robots. Theory in Biosciences: 135(4): 201–216

Carvalho J.T. & Nolfi S. (2016). Cognitive offloading does not prevent but rather promotes cognitive development. PLoS ONE. 11(8): e0160679.

Chemero A. (2011). Radical Embodied Cognitive Science. Cambridge, MA: MIT Press.

Cheney N., Clune J. & Lipson H. (2014). Evolved electrophysiological soft robots. In H. Sayama, J. Reiffel, S. Risi, R. Doursat, & H. Lipson (Eds.), Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems (ALIFE 14). New York: The MIT Press.

Collins S., Ruina A., Tedrake R & Wisse M. (2005). Efficient bipedal robots based on passive-dynamic walkers. Science 307 (5712): 1082–1085.

Collobert R., Bengio S., & Mariéthoz J. (2002). Torch: a modular machine learning software library (No. REP_WORK). Idiap.

Coumans E. & Bai Y. (2016). Pybullet, a python module for physics simulation for games, robotics and machine learning. https://pybullet.org, 2016–2019.

Cully A., Clune J., Tarapore D. & Mouret J-B. (2015). Robots that can adapt like animals. Nature 521: 503-507.

De Greef J. & Nolfi S. (2010). Evolution of implicit and explicit communication in a group of mobile robots. In S. Nolfi & M. Mirolli (Eds.), Evolution of Communication and Language in Embodied Agents. Berlin: Springer Verlag.

De Jong E.D. (2005). The maxsolve algorithm for coevolution. In Genetic and Evolutionary Computation (GECCO 2005), Lecture Notes in Computer Science, Vol 3120. Chicago, USA: Springer-Verlag. In H.G. Beyer et al. (Eds).  GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary computation. New York: ACM Press.

Dhariwal P., Hesse C., Plappert M., Radford A., Schulman J., Sidor S., and Wu Y. (2017). Openai baselines. https://github.com/openai/baselines.

Dillman R. (2004). Teaching and learning of robot tasks via observation of human performance. Robotics and Autonomous Systems, (47) 2-3: 109-116.

Dorigo M., Birattari M. & Brambilla M. (2014). Swarm robotics. Scholarpedia 9 (1): 1463.

Duarte M., Costa V., Gomes J., Rodrigues T., Silva F., Oliveira S.M. et al. (2016). Evolution of collective behaviors for a real swarm of aquatic surface robots. PLoS ONE 11(3): e0151834. https://doi.org/10.1371/journal.pone.0151834

Ferrante E., Turgut A.E., Duenez-Guzman E., Dorigo M. & Wenseleers T. (2015). Evolution of self-organized task specialization in robot swarms. PLoS Computational Biology, 11(8): e1004273.

Floreano D. & Nolfi S. (1997). Adaptive behavior in competing co-evolving species. In P. Husband & I. Harvey (Eds), Proceedings of the Fourth Conference on Artificial Life, MIT Press, Cambridge, MA, 378-387

Floreano D. & Urzelai J. (2000). Evolutionary robots with online self-organization and behavioral fitness. Neural Networks, 13: 431-443.

Floreano D., Dürr P. & Mattiussi C. (2008). Neuroevolution: from architectures to learning. Evolutionary intelligence, 1(1): 47-62.

Floreano D., Mitri S., Magnenat A. & Keller L. (2007) Evolutionary conditions for the emergence of communication in robots. Current Biology 17:514-519.

Floreano D., Nolfi S. & Mondada. F. (1998). Competitive Co-Evolutionary Robotics: From Theory to Practice. In R. Pfeifer, B. Blumberg, J-A. Meyer, S.W. Wilson (Eds.), From Animals to Animats V, Cambridge, MA: MIT Press, pp 512-524.

Freeman D., Ha D. & Metz L. (2019). Learning to predict without looking ahead: World models without forward prediction. In Advances in Neural Information Processing Systems (pp. 5379-5390).

Fukushima K. (1979). Neural network model for a mechanism of pattern recognition unaffected by shift in position - Neocognitron. Trans. IECE, J62-A(10): 658–665.

Gauci J. & Stanley K.O. (2010). Autonomous evolution of topographic regularities in artificial neural networks. Neural Computation, 22: 1860–1898.

Gers F.A. & Schmidhuber, J. (2001). LSTM recurrent networks learn simple context free and context sensitive languages. IEEE Transactions on Neural Networks, 12(6):1333–1340.

Gibson J. (1979). The Ecological Approach to Visual Perception. Boston: Houghton-Mifflin.

Gilbert S.J. (2015). Strategic offloading of delayed intentions into the external environment. The Quarterly Journal of Experimental Psychology, 68 (5): 971-992.

Glorot X. & Bengio Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249-256).

Gomez F. & Miikkulainen R. (1997). Incremental evolution of complex general behavior. Adaptive Behavior, 5(3-4): 317-342.

Grey Walter W. (1953). The Living Brain. G. Duckworth London, W.W. Norton, New York.

Ha D. & Schmidhuber J. (2018). World models. arXiv:1803.10122.

Hansen N. & Ostermeier A. (2001). Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, (9) 2: 159–195.

Harnad S. (1990). The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3): 335-346.

Harvey I., Husbands P. & Cliff D. (1994). Seeing the light: Artificial evolution, real vision. From Animals to Animats, 3, 392-401.

Helbing D. (1991) A mathematical model for the behavior of pedestrians. Behavioral Science, 36: 298-310.

Helbing D. et al. (1995) Social force model for pedestrian dynamics. Phys. Rev. E 51, 4282

Helbing D., Buzna L., Johansson A. & Werner T. (2005). Self-organized pedestrian crowd dynamics: Experiments, simulations, and design solutions. Transportation Science, 39 (1):1-24.

Hill A., Raffin A., Ernestus M., Gleave A., Kanervisto A., Traore R., Dhariwal P., Hesse C.,  Klimov O., Nichol A., Plappert M., Radford A.,  Schulman J., Sidor S. and Wu Y. (2018). Stable baselines. https://github.com/hill-a/stable-baselines.

Hochreiter S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut fuer Informatik, Lehrstuhl Prof. Brauer, Tech. Univ. Munich.

Hochreiter S. and Schmidhuber J. (1997). Long Short-Term Memory. Neural Computation, 9(8):1735–1780. Based on TR FKI-207-95, TUM (1995).

Holland J.H. (1975). Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, 1975.

Holland J.H. (1992). Adaptation in Natural and Artificial Systems, 2nd edition. MIT Press, Cambridge, MA, 1992.

Hu W., Turk G & Liu C.K. (2018). Learning symmetric and low-energy locomotion. ACM Transactions on Graphics (37): 4, 144:156.

Husband P., Smith T., Jakobi N & O’Shea M. (1998). Better living through chemistry: Evolving GasNets for robot control. Connection Science 10(4): 185-210.

Iizuka H. & Ikegami T. (2004). Adaptability and diversity in simulated turn-taking behavior. Artificial Life, 10(4): 361-378.

Ishida T. (2004). Development of a small biped entertainment robot QRIO. In Micro-Nanomechatronics and Human Science, 2004 and The Fourth Symposium Micro-Nanomechatronics for Information-Based Society, (pp. 23-28). IEEE Press.

Jakobi N. (1997). Evolutionary robotics and the radical envelope-of-noise hypothesis. Adaptive behavior, 6 (2): 325-368.

Joachimczak M., Suzuki R. & Arita T. (2016). Artificial metamorphosis: evolutionary design of transforming, soft-bodied robots. Artificial life, 22 (3): 271-298.

Kamimura A., Kurokawa H., Yoshida E., Murata S., Tomita K. & Kokaji S. (2005). Automatic locomotion design and experiments for a modular robotic system. IEEE/ASME Transactions on mechatronics, 10(3), 314-325.

Kempka M., Wydmuch M., Runc G., Toczek, J. & Jaśkowski W. (2016). Vizdoom: A doom-based ai research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG) (pp. 1-8). IEEE Press.

Kingma D.P. & Ba J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Klimov O. (2016). Carracing-v0. URL https://gym. openai.com/envs/CarRacing-v0/.

Kober J. & Peters J. (2011). Policy search for motor primitives in robotics. Machine Learning, 84 (1-2): 171-203.

Kormushev P., Calinon S. & and Caldwell D. (2010). Robot motor skill coordination with EM-based reinforcement learning. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Krichmar J.L., Seth A.K., Nitz D.A., Fleischer J.G. & Edelman (2005). Spatial navigation and causal analysis in a brain-based device modeling cortical–hippocampal interactions. Neuroinformatics, 5: 197-222.

Kriegman S., Blackiston D., Levin M. & Bongard J. (2020). A scalable pipeline for designing reconfigurable organisms. PNAS: 1910837117.

Lange S., Riedmiller M. & Voigtländer A. (2012). Autonomous reinforcement learning on raw visual input data in a real world application. In The 2012 international joint conference on neural networks (IJCNN) (pp. 1-8). IEEE.

Lehman, J., & Stanley, K. O. (2011). Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation, 19(2), 189-223.

Levine S., Pastor P., Krizhevsky A., Ibarz J. & Quillen D. (2017). Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research, 37 (4-5): 421-436.

Lillicrap T.P., Hunt J.J., Pritzel A. et al. (2015). Continuous control with deep reinforcement learning. arXiv:1509.02971.

Lipson H. & Pollack J.B. (2000) Automatic design and manufacture of robotic lifeforms. Nature 406 (6799): 974-978

Lorenz E.N. (1963). Deterministic nonperiodic flow. Journal of Atmospheric Sciences, (20): 130-141.

Luck K.S., Amor H.B. & Calandra R. (2019). Data-efficient co-adaptation of morphology and behaviour with deep reinforcement learning. arXiv:1911.06832v1

Massera G., Ferrauto T., Gigliotta O., and Nolfi S. (2013). FARSA: An open software tool for embodied cognitive science. In P. Lio', O. Miglino, G. Nicosia, S. Nolfi and M. Pavone (Eds.), Proceeding of the 12th European Conference on Artificial Life. Cambridge, MA: MIT Press.

Massera G., Ferrauto T., Gigliotta O., Nolfi S. (2014). Designing adaptive humanoid robots through the FARSA open-source framework. Adaptive Behavior, 22 (3):255-265

Massera G., Tuci E., Ferrauto T. & Nolfi S. (2010). The facilitatory role of linguistic instructions on developing manipulation skills, IEEE Computational Intelligence Magazine, (5) 3: 33-42.

Mattner J., Lange S. & Riedmiller M. (2012). Learn to swing up and balance a real pole based on raw visual input data. In: Proceedings of the 19th International Conference on Neural Information Processing (5) (ICONIP 2012), pp. 126–133. Dohar, Qatar.

May R.M. (1976). Simple mathematical models with very complicated dynamics. Nature, 261 (5560): 459-467.

McCulloch W. & Pitts W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:115-133.

McGeer T. (1990). Passive dynamic walking. International Journal of Robotics Research 9 (2): 62–82.

Metta G., Sandini G., Vernon D., Natale L. & Nori F. (2008). The iCub humanoid robot: an open platform for research in embodied cognition. In Proceedings of the 8th workshop on performance metrics for intelligent systems, pp. 50-56.

Miconi T. (2008). Evolution and complexity: The double-edged sword. Artificial Life (14) 3: 325-334.

Miconi, T. (2016). Learning to learn with backpropagation of Hebbian plasticity. arXiv preprint arXiv:1609.02228.

Miglino O., Lund H.H. & Nolfi S. (1995). Evolving mobile robots in simulated and real environments. Artificial Life, (2) 4: 417-434,

Milano N. & Nolfi S. (2020). Autonomous Learning of Features for Control: Experiments with Embodied and Situated Agents. arXiv preprint arXiv:2009.07132.

Milano N. & Nolfi S. (2021). Automated curriculum learning for embodied agents: A neuroevolutionary approach. arXiv preprint arXiv:2102.08849.

Min H., Badia A.P., Mirza M. et al. (2016).  Asynchronous methods for deep reinforcement learning. Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016.

Mirolli M. & Parisi D. (2008). How produced biases can favor the evolution of communication: An analysis of evolutionary dynamics. Adaptive Behavior 16: 27-52.

Mitri S., Floreano D. & Keller L. (2009). The evolution of information suppression in communicating robots with conflicting interests. Pnas, 106, 15786–15790.

Mnih V., Kavukcuoglu K., Silver D., Rusu A. A., Veness J., Bellemare M. G., Graves A., Riedmiller M., Fidjeland A. K., Ostrovski G. et al.  (2015). Human-level control through deep reinforcement learning. Nature, 518: 529–533.

Mondada, F., Bonani, M., Raemy, X., Pugh, J., Cianci, C., Klaptocz, A., ... & Martinoli, A. (2009). The e-puck, a robot designed for education in engineering. In Proceedings of the 9th conference on autonomous robot systems and competitions (Vol. 1, No. CONF, pp. 59-65). IPCB: Instituto Politécnico de Castelo Branco.

Mondada, R., Franzi, E. & Ienne, P. (1993) Mobile robot miniaturization: A tool for investigation in control algorithms. In: Proceedings of the Third International Symposium on Experimental Robots, eds. T. Yoshikawa & F. Miyazaki. Kyoto, Japan.

Mouret J-B. & Doncieux S. (2009). Overcoming the bootstrap problem in evolutionary robotics using behavioral diversity. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC-2009). Washington, DC: IEEE Press.

Nilsson N.J. (1984). Technical Note No. 323. SRI International, Menlo Park, CA: USA. This is a collection of papers and technical notes, some previously unpublished, from the late 1960s and early 1970s.

Nishimoto R., & Tani J. (2009). Development of hierarchical structures for actions and motor imagery: a constructivist view from synthetic neuro-robotics study. Psychological Research, 73, 545-558.

Nolfi S. (1996). Adaptation as a more powerful tool than decomposition and integration. In: T.Fogarty and G.Venturini (Eds.), Proceedings of the Workshop on Evolutionary Computing and Machine Learning, 13th International Conference on Machine Learning, University of Bari, Italy.

Nolfi S. (2000). Evorobot 1.1 user manual. Technical Report, Roma, Italy: Institute of Psychology, National Research Council. 

Nolfi S. (2005). Categories formation in self-organizing embodied agents. In H. Cohen & C. Lefebvre (Eds), Handbook of Categorization in Cognitive Science. Oxford, UK: Elsevier.

Nolfi S. (2009). Behavior and cognition as a complex adaptive system: Insights from robotic experiments. In C Hooker (Ed.), Handbook of the Philosophy of Science. Volume 10: Philosophy of Complex Systems. General editors: Dov M. Gabbay & Paul Thagard and John Woods. Elsevier.

Nolfi S. & Floreano D. (1998). Co-evolving predator and prey robots: Do 'arm races' arise in artificial evolution? Artificial Life, 4(4): 311-335.

Nolfi S. & Floreano D. (1999). Learning and evolution. Autonomous robots, 7(1): 89-113.

Nolfi S. & Floreano D. (2000). Evolutionary Robotics: The Biology, Intelligence, and Technology of Self-Organizing Machines. Cambridge, MA: MIT Press/Bradford Books.

Nolfi S. & Mirolli M. (2010). Evolution of Communication and Language in Embodied Agents. Berlin: Springer Verlag.

Nolfi S. and Gigliotta O. (2010). Evorobot*: A tool for running experiments on the evolution of communication. In S. Nolfi & M. Mirolli (Eds.), Evolution of Communication and Language in Embodied Agents. Berlin: Springer Verlag.

Nolfi S., Bongard J., Husband P. & Floreano D. (2016). Evolutionary Robotics, in B. Siciliano and O. Khatib (eds.), Handbook of Robotics, II Edition. Berlin: Springer Verlag.

Omer A.M.M., Ghorbani R. Lim H. & Takanishi A. (2009). Semi-passive dynamic walking for biped walking robot using controllable joint stiffness based on dynamic simulation. Proceedings of the IEEE/ASME International Conference on Advanced Intelligent Mechatronics. Singapore: IEEE Press.

Onofrio G., Pezzulo G. & Nolfi S. (2011). Evolution of a predictive internal model in an embodied and situated agent. Theory in Biosciences, vol. 130(4): 259-276.

Oudeyer P-Y., Kaplan F. & Hafner V.F. (2007). Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11:265–286.

Pagliuca and Nolfi (2020). The dynamics of body and brain co-evolution. arXiv:2011.11440.

Pagliuca P. & Nolfi S. (2019). Robust optimization through neuroevolution. PLoS ONE 14 (3): e0213193.

Pagliuca P. & Nolfi S. (2020). The dynamic of body and brain co-evolution. arXiv preprint arXiv:2011.11440.

Pagliuca P., Milano N. & Nolfi S. (2018). Maximizing adaptive power in neuroevolution. PLoS One, 13(7): e0198788.

Pagliuca P., Milano N. & Nolfi S. (2019). Efficacy of modern neuro-evolutionary strategies for continuous control optimization. arXiv:1912.05239.

Pagliuca P., Milano N. & Nolfi, S. (2018). Maximizing adaptive power in neuroevolution. PloS one, 13(7): e0198788.

Petrosino G., Parisi D. & Nolfi S. (2013). Selective attention enables action selection: evidence from evolutionary robotics experiments. Adaptive Behavior, 21(5):356-370

Pfeifer R. & Bongard J. (2016). How The Body Shapes the Way We Think: A New View of Intelligence. Cambridge, MA: MIT Press.

Pfeifer R. & Scheier C. (1999). Understanding Intelligence. Cambridge, MA: MIT Press.

Pfeifer R., Iida F & Gómez G. (2006). Morphological computation for adaptive behavior and cognition. International Congress Series, 1291: 22-29. Berlin, Germany: Springer Verlag.

Philippides A.O., Husband P., Smith T & O’Shea M. (2005). Flexible coupling: Diffusing neuromodulators and adaptive robotics. Artificial Life 11 (1-2): 139-160.

Pugh J.K., Soros L.B. & Stanley K.O. (2016). Quality diversity: A new frontier for evolutionary computation. Frontiers in Robotics and AI, 3, 40.

Quigley M., Conley K., Gerkey B., Faust J., Foote,T., Leibs J., ... & Ng A.Y. (2009). ROS: an open-source Robot Operating System. In ICRA workshop on open source software (Vol. 3, No. 3.2, p. 5).

Raffin A. (2018). RL Baselines Zoo. Github repository. https://github.com/araffin/rl-baselines-zoo

Rawat W. & Wang Z. (2017). Deep convolutional neural networks for image classification: A comprehensive review. Neural computation, 29(9), 2352-2449.

Real E., Moore S., Selle A., Saxena S., Suematsu Y.L., Tan J., Le Q.V. & Kurakin A. (2017). Large-scale evolution of image classifiers. In D. Precup & Y.W. The (Eds.), Proceedings of the 34th International Conference on Machine Learning, pp. 2902–2911.

Rechenberg I. & M. Eigen. (1973). Evolutionsstrategie: Optimierung Technischer Systeme nach Prinzipien der Biologischen Evolution. Frommann-Holzboog Stuttgart.

Reid C.R., Lutz M.J., Powell S., Kao A.B., Couzin I. D. & Garnier S. (2015). Army ants dynamically adjust living bridges in response to a cost–benefit trade-off. Proceedings of the National Academy of Sciences, 112(49): 15113-15118.

Reynold C.W. (1987). Flocks, herds and schools: A distributed behavioral model. In M.C. Stone (Ed.), Proceedings of the 14th annual conference on Computer graphics and interactive techniques. New York: Association for Computing Machine.

Rosin C.D. & Belew R.K. (1997). New methods for competitive coevolution. Evolutionary Computation, (5) 1:1-29.

Sadeghi F. & Levine S. (2017). CAD2RL: Real single-image flight without a single real image. arXiv:1611.04201v4

Salimans T., Ho J., Chen X., Sidor S & Sutskever I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864v2

Sandini G., Metta G. & Vernon D. (2004).  Robotcub: An open framework for research in embodied cognition, International Journal of Humanoid Robotics (8) 2: 18–31.

Schaff C., Yunix D. Chakrabarti A. & Walter M.R. (2019).  Jointly learning to construct and control agents using deep reinforcement learning. International Conference on Robotics and Automation (ICRA) Palais des congres de Montreal, Montreal, Canada.

Schmidhuber J.  (1991).  Curious model-building control systems. In Proceedings of the International Joint Conference on Neural Networks. Singapore, vol. 2, pp. 1458–1463.

Schmidhuber J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85-117

Schmidhuber, J. (2010). Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3), 230-247.

Schulman J., Levine S., Abbeel P., Jordan M.I. & Moritz P. (2015). Trust region policy optimization. In ICML, pp. 1889–1897.

Schulman J., Wolski F., Dhariwal P., Radford A. & Klimov O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

Schulman J., Wolski F., Dhariwal P., Radford A. & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

Schwefel H.P. (1981). Numerical Optimization of Computer Models, Wiley, New York, 1981.

Schwefel H.P. (1995). Evolution and Optimum Seeking, Wiley, New York, 1995.

Sehnke F., Osendorfer C., Ruckstieb T., Graves A., Peters J., and Schmidhuber J. (2010). Parameter-exploring policy gradients. Neural Networks, 23(4):551–559.

ShadowRobot (2005). ShadowRobot Dexterous Hand. https:// www.shadowrobot.com/products/dexterous-hand/.

Simione L. & Nolfi S. (2019). Long-term progress and behavior complexification in competitive co-evolution. arXiv:1909.08303.

Sims K. (1994). Evolving 3D morphology and behavior by competition. Artificial Life 1 (4): 353-372.

Skoglund A., Iliev B., Kadmiry B. & Palm R. (2007). Programming by demonstration of pick-and-place tasks for industrial manipulators using task primitives. International Symposium on Computational Intelligence in Robotics and Automation (CIRA 2007).

Skolicki Z. & Jong K. D. (2004).  Improving evolutionary algorithms with multi-representation island models. In Parallel Problem Solving from Nature – PPSN VIII 8th International Conference. Springer-Verlag.

Sperati V., Trianni V. & Nolfi S. (2011). Self-organised path formation in a swarm of robots. Swarm Intelligence, 5:97-119.

Spivey M. (2007). The Continuity of Mind. New York: Oxford University Press.

Spröwitz A., Tuleu A., Vespignani M., Ajallooeian M., Badri E. & Ijspeert A. J. (2013). Towards dynamic trot gait locomotion: Design, control, and experiments with Cheetah-cub, a compliant quadruped robot. The International Journal of Robotics Research 32(8): 932–950.

Stanley K.O. & Miikkulainen R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, vol. 10 (2): 99–127.

Stanley K.O., D’Ambrosio D.B. & Gauci J. (2009). A hypercube-based indirect encoding for evolving large-scale neural networks. Artificial Life, vol. 15 (2): 185–212.

Strogatz S. (2001). Nonlinear Dynamics and Chaos. Western Press

Sugita Y. & Tani J. (2005). Learning semantic combinatoriality from the interaction between linguistic and behavioral processes. Adaptive Behavior, (13) 1: 33–52.

Sutton R.S. & Barto A.G. (2018). Reinforcement learning: An introduction. 2nd Edition. MIT press.

Taleb N.N. (2012). Antifragile: Things that Gain from Disorder. New York: Random House.

Tani J. (2016). Exploring Robotic Minds: Actions, Symbols, and Consciousness as Self-Organizing Dynamic Phenomena. New York: Oxford University Press.

Tedrake R. (2019). Underactuated Robotics: Algorithms for Walking, Running, Swimming, Flying, and Manipulation (Course Notes for MIT 6.832). Downloaded on November 2019 from underactuated.mit.edu

Thelen E. & Smith L.B. (1994).  A Dynamic Systems Approach to the Development of Cognition and Action. Cambridge, MA: MIT Press.

Todorov E., Erez T. & Tassa Y. (2012). Mujoco: A physics engine for model-based control. In Proceeding of the IEEE/RSJ Intelligent Robots and Systems Conference (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE.

Tuci E., Ferrauto T., Zeschel A., Massera G. & Nolfi S. (2011). An Experiment on behaviour generalisation and the emergence of linguistic compositionality in evolving robots, IEEE Transactions on Autonomous Mental Development, (3) 2:176-189.

Tuci E., Massera G. & Nolfi S. (2010). Active categorical perception of object shapes in a simulated anthropomorphic robotic arm, Transaction on Evolutionary Computation Journal, (14) 6: 885-899.

Ude A., Atkeson C.G. & Riley M. (2004). Programming full-body movements for humanoid robots by observation, Robotics and Autonomous Systems, (47):2–3: 93–108.

Varela F. J., Thompson E. T. & Rosch E. (1991). The Embodied Mind: Cognitive Science and Human Experience. Cambridge, MA: MIT Press.

von Hofsten C. & Ronnqvist L. (1988). Preparation for grasping an object: A developmental study. Journal of Experimental Psychology Human Perception and Performance, 14 (4): 610-621.

Weng J., Ahuja N. & Huang T. S. (1993). Learning recognition and segmentation of 3-D objects from 2-D images. Proceeding of the 4th International Conference on Computer Vision. Berlin, Germany. pp. 121-128.

Werbos P.J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550-1560.

West-Eberhard M.J. (2003) Developmental plasticity and evolution. Oxford University Press, New York

Whitley D., Rana S., & Heckendorn R.B. (1997). Island model genetic algorithms and linearly separable problems. In Selected Papers from AISB Workshop on Evolutionary Computing, volume 1305 of Lecture Notes In Computer Science, pages 109-125. Springer-Verlag.

Wierstra D., Schaul T., Glasmachers T., Sun Y., Peters J. & Schmidhuber J. (2014). Natural evolution strategies. The Journal of Machine Learning Research, 15(1), 949-980.

Yamashita Y. & Tani J. (2008). Emergence of functional hierarchy in a multiple timescale neural network model: a humanoid robot experiment. PLoS Computational Biology, 4 (11): e1000220.