© Stefano Nolfi, 2021 | How to cite this book | Send your feedback | Collaborate
The physical world is highly uncertain. Almost no characteristic, dimension, or property remains constant. Variations affect both the external environment and the body of the robot. For example, the light conditions, the temperature, and the distribution of dust vary over time. The body parts of the robot break down or wear out, the quality of the materials changes, batteries run out, etc. Moreover, the relative position/orientation of the robot in the environment varies during different operation periods, the quantities measured by sensors are noisy, the effect of actuators is uncertain, the measures used to calculate the fitness or the reward are noisy.
To operate effectively robots should be robust to these forms of variations, i.e. should be able to perform effectively in varying conditions. Possibly, robots should be more than robust. They can be antifragile (Taleb, 2012), i.e. they can be capable to gain from the exposure to variations. Finally, robots should adapt on the fly to variations that cannot be handled properly by means of a single robust strategy. We will use the term plasticity to indicate the ability to adapt on the fly to the variations of the environmental conditions.
The problem of dealing with environmental variations can be solved by developing solutions that are robust, i.e. that perform effectively in varied environmental conditions. The development of robust solutions can be promoted simply by exposing the robots to variable environmental conditions during the adaptive process. When the training is performed in simulation this can be realized by: (i) varying the position, orientation and/or the posture of robot at the beginning of each evaluating episode, (ii) varying the characteristics of the external environment at the beginning of each evaluation episode, e.g. the position and the texture of the objects present in the environment, (iii) perturbing with random values the state of the sensors and actuators of the robot and eventually the state of the body parts forming the robot during each step. Usually, varying only some properties of the environment is sufficient. On the other hand, a complete lack of variation usually leads to the development of brittle solutions that are overfitted to the conditions experienced and that do not generalize to different environmental conditions.
In the case of evolutionary methods this implies that the environmental conditions shall vary among evaluation episodes (in the case of robots evaluated for multiple episodes), among the robots forming the population, and across generations. In the case of robots trained with reinforcement learning, the environmental conditions should vary among evaluation episodes and across learning epochs.
When we think to environmental variations we usually refer to dynamic environments, i.e. to environments embedding features varying over time. However, static environment varying over space, i.e. including areas with diversified characteristics, are functionally equivalent to dynamical environments from the point of view of a situated robot. From the perspective of a robot that is situated, what matters most is the local portion of the environment. The local portion of the environment changes as a result of the movements of the robot. Consequently, the variations of the environment over space can be equivalent to variations of the environment over time. Overall this implies that the variability of the environmental conditions can be increased either by training the robot in a dynamic environment or by training the robot in a spatially diversified environment.
The adaptive process also exposes the robots to other forms of variations that are necessary to discover how to improve the robots’ skills: variations of the parameters introduced during the reproduction process, in the case of evolutionary methods, and variations of the action introduced every step, in the case of reinforcement learning methods. This naturally promotes the synthesis of solutions that are robust with respect to these forms of variations. Indeed, adapted robots are generally more robust than hand designed robots for this reason. The development of solutions that are robust to these forms of variations generally provides also a certain level of robustness with respect to environmental variations.
The need to evaluate the adaptive robots in variable environmental conditions complicates the identification of the best solution. The best solution obtained during an adaptive process does not necessarily correspond to the solution obtained at the end of the evolutionary or training process. When the environmental conditions do not vary, the best solution can be identified simply by choosing the solution that obtained the highest fitness or the highest cumulative reward during the adaptive process. Instead, when the environmental conditions vary, this is not sufficient since the performance obtained depends also on the specific environmental conditions experienced. The solution that achieved the highest performance can correspond to a solution that was lucky, i.e. that encountered easy environmental conditions, or that is brittle, i.e. that achieved a high performance in the specific conditions experienced but performs much poorly in different conditions. The truly best solution should be rather identified by post-evaluating the most promising candidate solutions for multiple episodes in new environmental conditions and by choosing the solution that achieved the highest performance during such post-evaluation (Pagliuca & Nolfi, 2019).
The second way that permits dealing with environmental variations is plasticity, i.e. the ability of a robot to adapt to changing conditions on the fly. As mentioned above, in situations where robust solutions are not available, plasticity represents the only possible way to handle variations. The adaptation of the robot’s behavior to the current environmental conditions can be realized through behavioral or neural plasticity.
Behavioral plasticity consists in the ability to display multiple behaviors and to select the behavior that suits best the current environmental circumstances. As we have seen in Chapter 5, behavioral plasticity does not necessarily require specialized machineries. It can arise simply as a consequence of a bifurcation of the robot/environmental dynamics triggered by a variation of critical parameters correlated to observation states. For example, it may arise as a bifurcation triggered by the state of the time sensor that leads to the production of an explorative behavior, suited for cleaning open areas, or a wall-following behavior, suited for cleaning peripheral areas (see Section 5.7). As an alternative example, consider a predator robot that aims to catch prey. Some of these prey run away as soon as they perceive the predator. Others hide themselves, stay still and run away only if the predator approaches them. This problem does not admit a single strategy. To capture both types of prey the predator should display two different behaviors and select the behavior that suits the current prey. It should approach the hiding prey and anticipate the trajectory of the running-away prey without necessarily approaching it. The predator can extract from the observations an internal states encoding the category of the current prey and use this state to select the appropriate behavior.
Behavioral plasticity permits adapting to the current environmental conditions immediately. However, it only permits adapting to environmental conditions that already occurred repeatedly in previous phases of the adapting process. Indeed, it is the exposure to those environmental conditions which creates the adaptive pressure for the development of the appropriate associated behaviors. Variations leading to new conditions never experienced before can only be handled through neural plasticity (as discussed below) or through behavior generalization (as discussed in the next chapter).
Neural plasticity can be realized by simply continuing the adaptive process. However, standard adaptive methods can result too slow to cope with fast environmental variations. Consequently, one might consider specialized adaptive methods which can be faster.
A possible way to speed-up the adaptation to environmental variations consists in enabling the robots to vary their parameters on the basis of a genetically encoded unsupervised learning rules (Floreano & Urzelai, 2000). This can be performed by encoding the properties of the robot’s brain in a vector of tuples determining for each connection: (i) whether it is fixed or plastic, (ii) the weights value in the case of fixed connections, and (iii) the Hebbian rule type and the learning rate in the case of plastic connections (Floreano & Urzelai, 2000; see also Figure 7.2). Hebbian learning is a form of unsupervised learning that varies the strength of the connection weight on the basis of the activation of the presynaptic and postsynaptic neurons. The utilization of this method in variable environmental conditions support the evolution of robots capable to adapt on the fly to the current conditions. For example, in a co-evolutionary experiment involving the evolution of predator and prey robots adapted for the ability to catch prey and avoid being caught by predators, respectively, it allows predator robots to adapt on the fly to the characteristic of their current prey (Floreano & Nolfi, 1997). For a related approach that combine Hebbian and back-propagation learning see Miconi (2016).
Another approach has been proposed by Cully et. al (2015). In their case the adaptation to the environmental variations is realized by using a standard evolutionary algorithm. The speed of the training process is increased by applying the algorithm to a smaller set of parameters encoding the principal components of variations of the full set of parameters.
Antifragility is the ability to benefit from variability, disorder and stressors. Consequently, antifragility is the opposite of fragility and goes beyond robustness or resilience. Robust or resilient systems continue performing well in varied conditions. Antifragile systems benefit from the exposure to variations (Taleb, 2012).
Evolutionary and reinforcement learning systems are examples of antifragility. Systems subjected to an evolutionary process improve as a result of random variations of their parameters introduced during reproduction. Similarly, systems trained through reinforcement learning algorithms improve as a result of the random variations introduced in their actions. Both methods benefit from variations. They exploit variations to discover better solutions.
Adaptive robots can benefit also from the exposure to environmental variations. Let’s consider for example the case of a robot adapted for the ability to walk on a flat environment from a given initial posture. The adapted robot will be fragile to uneven terrains and to variations of its initial posture. On the contrary, a robot adapted for the ability to walk on rugged terrains from a given initial posture will be robust to uneven terrain and also, to a certain extent, to variations of its initial posture since the exposure to rugged terrains will expose it to a larger set of postures. Overall this implies that the exposure to environmental variabilities of one kind can lead to solutions that generalize to other forms of variability.
Often the advantages gained by variation at one level of organization are paid in term of fragility at other organization levels. In other words, antifragility for one is fragility for someone else. Some parts of the system should be fragile to make the entire system antifragile (Taleb, 2012). For example, the gain obtained by an adaptive robot at the end of an adaptation phase, in which it has been intentionally exposed to strong environmental variations, is paid with a reduction of performance during the adaptation phase. Similarly, the increase in performance obtained by sharing information with other robots through communication at the level of a group can produce a reduction of performance at the level of the single cooperating individuals (see Chapter 8). However, the disadvantages affecting certain levels of organization can be worth paying when the associated advantages affect the level of organization we care of.
A form of robustness that is particularly important for adaptive robots concerns the ability to be robust with respect to the differences between simulated and the real environments. This is necessary when the training process is performed in simulation and the trained solution is then be ported on the physical robot situated in the physical environment. The trained robot is moved from the simulated to the physical environment and consequently should be robust with respect to the variations that it encounters when it is moved from the environment in which it has been trained to the environment in which it should operate. The ability to keep performing effectively in the real environment is usually referred as the ability to cross the reality gap.
The problem of crossing the reality gap can be avoided by performing the adaptive process directly in hardware on the real environment. However, this is generally time consuming and costly. For example, Levine et al. (2017) trained a manipulator robot with a reinforcement learning algorithm in hardware for the ability to grasp objects with varying shapes on the basis of visual information. To collect the training data in hardware the authors used 14 robotic manipulators that performed a total 800,000 grasping attempts. Monitoring the operation of many robots for a prolonged period of time and solving the problems occurring during the process requires a quite significant amount of human work.
Carrying the adaptive process in simulation, instead, present several advantages. It can enables to simulate the behavior of robots at a rate that exceeds real time. It permits speeding up training by means of parallel computation (notably evolutionary methods can achieve almost linear speedup in the number of CPU cores [Salimans et al., 2017]). It permits accessing information that is not available in the physical environment and that can be used to drive the adaptive process (see below). It permits reducing the hardware cost and the cost for hardware maintenance.
The simulation, however, will never match perfectly the actual world. Consequently, the candidate solutions adapted in simulation should be sufficiently robust to cross the reality gap, i.e. should be robust to the differences between the simulated and the real environment.
A good illustration of the method that can be used to promote the development of solutions able to cross the reality gap is the experiment of Andrychowicz et al. (2018) in which the shadow-robot hand (ShadowRobot, 2005) has been trained in simulation for the ability to manipulate objects and in which the solutions trained in simulation were successfully ported in hardware.
The robotic hand has 24 DoF on the wrist and on the fingers actuated by 20 pairs of agonist–antagonist tendons. The policy network of the robot was trained for the ability to rotate and move a cubic object to a given target position and orientation (goal) generated randomly through the PPO reinforcement learning algorithm (Schulman et al., 2017). This problem is particularly hard to simulate accurately due to the need to simulate the effect of collisions between the object and a robot with an articulated morphology.
The reward function rewards the robot with 5.0 every time the goal is achieved and punishes it with -20.0 every time the object falls from its hand. The desired target position that the cubic object should assume, i.e. the goal, is shown in the bottom right part of the Video 7.1. As you can see, the trained robot manages to manipulate and rotate the object as requested. Once the goal is achieved and the robot is rewarded, a new goal is selected randomly, and the robot starts a new manipulation behavior.
The actor network of the robot, which is used both during training and testing, receives as input the target and current position and orientation of the object and the position of the palm and of the fingertips. The position and orientation information are extracted by three external cameras pointing to the robotic hand. The images perceived by the cameras are simulated during the training and are extracted by the real cameras during testing in the real environment.
The critic network receives as input also the position and velocity of the hand joints and the velocity of the object. These additional data, available in simulation, could not be extracted in a reliable manner from real sensors. However, this circumstance does not constitute a problem since the critic network is used during the training only, which is performed in simulation.
To ensure that the simulator is as accurate as possible the authors avoided using sensors that are too hard to simulate accurately such as the tactile sensors included in the shadow-robot hand. These sensors measure the pressure of the fluid included in elastic containers that varies as a result of pressure caused by physical contacts.
Moreover, to promote the development of robust solutions the authors perturbed the environmental conditions during training with correlated and uncorrelated noise, i.e., respectively, with variations that are selected randomly at the beginning of each evaluation episode and are then left constant for the length of the episode and with variations that are selected randomly every step. More specifically they perturbed: (i) the state of the sensors and of the actuators with correlated and uncorrelated noise, (ii) the refresh rate of sensors and actuators with uncorrelated noise, (iii) the position, orientation, and velocity of the cubic object and of the body parts of the robots with uncorrelated noise, (iv) friction and other physical parameters with correlated noise, (vi) the position of the cameras, the light conditions, and the texture of all objects in the scene with correlated noise.
In summary, to obtain solutions capable to cross the reality gap one should thus take the following actions:
These actions eventually permit crossing the reality gap without necessarily using high-fidelity and computationally expensive simulations. For example, they might allow porting successfully vision-based flying robots trained with low-quality 3D rendered images to real environments (Sadeghi and Levine, 2017).
Solving an adaptive problem in varying environmental conditions is usually significantly more complex than solving a problem in not varying or less varying conditions. Moreover, the higher the range of variations is, the higher the complexity of the adaptive problem is. The range of environmental variations is thus a crucial part of the problem faced by an adaptive robot.
This aspect clearly shows in simulation experiment where the variability of the environmental condition can be arbitrary reduced and eventually eliminated. For example, the time required to evolve or train a robot to solve a certain problem in simulation is generally much lower when the robot starts always from the same identical position and orientation than when the initial position and orientation of the robot vary. The first condition leads to brittle solutions that are simpler and consequently easier to discover. The latter condition leads to solutions that are robust with respect to the initial conditions, which are usually more complex.
Another domain where this aspect manifests is competitive games. Learning to defeat a single opponent is generally much easier that learning to defeat a varied set of opponents.
The range of environmental variations also influences the probability that the evolutionary or learning processes remain stuck in local minima. The smaller the variations of the environmental conditions are, the higher the risks to remain stuck in local minima are.
Overall this implies that introducing a sufficient amount of variations in the environmental conditions is essential to ensure the development of effective solutions.
Familiarize with the PyBullet 3D dynamic simulator and with the locomotor problems by reading section 13.11 and by making the Exercise 9.
Andrychowicz M., Baker B., Chociej M. et al. (2018). Learning dexterous in-hand manipulation. arXiv:1808.00177v5.
Cully A., Clune J., Tarapore D. & Mouret J-B. (2015). Robots that can adapt like animals. Nature 521: 503-507.
Floreano D. & Nolfi S. (1997). Adaptive behavior in competing co-evolving species. In P. Husband & I. Harvey (Eds), Proceedings of the Fourth Conference on Artificial Life, MIT Press, Cambridge, MA, 378-387
Floreano D. & Urzelai J. (2000). Evolutionary robots with online self-organization and behavioral fitness. Neural Networks, 13: 431-443.
Jakobi N. (1997). Evolutionary robotics and the radical envelope-of-noise hypothesis. Adaptive behavior, 6 (2): 325-368.
Levine S., Pastor P., Krizhevsky A., Ibarz J. & Quillen D. (2017). Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research, 37 (4-5): 421-436.
Miconi, T. (2016). Learning to learn with backpropagation of Hebbian plasticity. arXiv:1609.02228.
Miglino O., Lund H.H. & Nolfi S. (1995). Evolving mobile robots in simulated and real environments. Artificial Life, (2) 4: 417-434,
Pagliuca P. & Nolfi S. (2019). Robust optimization through neuroevolution. PLoS ONE 14 (3): e0213193.
Sadeghi F. & Levine S. (2017). CAD2RL: Real single-image flight without a single real image. arXiv:1611.04201v4
Salimans T., Ho J., Chen X., Sidor S & Sutskever I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864v2
Schulman J., Wolski F., Dhariwal P., Radford A. & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
ShadowRobot (2005). ShadowRobot Dexterous Hand. https:// www.shadowrobot.com/products/dexterous-hand/.
Taleb N.N. (2012). Antifragile: Things that Gain from Disorder. New York: Random House.