Behavioral and Cognitive Robotics
An adaptive perspective

Stefano Nolfi

© Stefano Nolfi, 2021   |   How to cite this book   |   Send your feedback   |   Collaborate


Index Next Chapter


12. Long-term and Open-ended Learning

12.1 Introduction

In this chapter I will discuss long-term and open-ended learning. Long-term indicates adaptive processes enabling the robots to improve and expand their skills for a prolonged time period. Open-ended refers to unbounded adaptive processes that keeps producing progresses and novelty forever.

An important factor that determines whether the progress continues in the long-term or not is the adaptive pressure, i.e. the gap between the cumulative reward obtained by the robot at a given stage of the adaptive process and the one that could be obtained by modifying appropriately the parameters of the robots. A second important factor is adaptability, the probability that variations of the adaptive parameters produce progress. The adaptive pressure and the adaptability should remain sufficiently high during the adaptive process to enable learning. Finally, a third crucial factor is the amplitude of the robots skills. The greater the behavioral and cognitive repertoire of the robot/s are, the greater the chance that they will manage to further expand and improve their skills is.

Cleary, for problems admitting optimal solutions, the adaptive process should not and cannot continue in the long term. It should continue only for the time necessary to discover an optimal solution. However, complex problems often admit the possibility to keep improving indefinitely, independently of the skill level possessed.

12.2 Incremental learning

An aspect that makes long-term learning difficult is that both the level of adaptability and the level of selective pressure depend on the skills of the adaptive robots which vary during the course of the process. At the beginning of the adaptive process, when the robots do not possess any skills, the adaptive pressure is high and the adaptability is low. The high adaptive pressure is due to the fact that the gap between the cumulative reward currently achieved and potentially achievable by the robot is high. The low adaptability is due to the fact that the gap between the current behavior of the robot and the behavior that could enable it to achieve its goal (at least in part) is high and the chances for skills re-use is low. During the successive phases of the adaptive process, the adaptive pressure decreases and the adaptability increases. The fact that both the adaptive pressure and the adaptability vary implies that the attempt to solve a complex problem in stationary conditions can be problematic. That leads to situations where the adaptive process fails to start producing progress, because the  adaptability is too low, or other situations where the adaptive process stagnates at a certain point, when the adaptive pressure becomes too low. The problem can be solved by varying the complexity of the task or the environmental conditions experienced by the adaptive agents so to maintain the adaptability and the adaptive pressure sufficiently high during the entire process. 

Incremental learning methods approach the problem by varying the complexity of the problem, i.e. by evaluating the robots for the ability to solve progressively more complex versions of the problem. For example the evolution of the ability to visually track target objects can be realized by exposing the evolving robots first to large immobile targets, then to small immobile targets, and finally to small moving targets (Harvey, Husband and Cliff, 1994). Similarly, the evolution of agents evolved for the ability to capture prey robots can be organized in a series of subsequent phases in which the speed of the escaping preys and the pursuit delay is progressively increased (Gomez & Miikkulainen, 1997).

An example that illustrates the efficacy of incremental learning in combination with reinforcement learning is reported in the work of Yu, Turk & Liu (2018) which describes humanoid robots trained to walk. In this case the complexification of the problem is realized by initially exposing the learning the robot to simplified environmental conditions and by progressively releasing the simplification when the skills of the robot increase.

This is realized by using an automated assistant that helps the robot to maintain lateral balance and to move forward. The assistant consists of a proportional-derivative controller that applies a lateral force to the pelvis of the robot along the left-right axis, in this way helping the robot to avoid falling sideways, and a propelling force along the frontal-rear axis, helping the robot to move forward. The intensity of the forces is modulated by progressively reducing and by finally zeroing the damping (Kd) and the stiffness (Kp) coefficients. To avoid overfitting, the learning agents are evaluated for multiple episodes by using Kd and Kp values extracted randomly from an interval. The center of the interval, which determines the average amount of assistance, is progressively reduced when the performance of the learning robot overcomes a given threshold.

The advantage provided can be appreciated by comparing a typical behavior obtained with and without incremental learning (Video 12.1, ECL versus PPO conditions). The video also illustrates the advantage obtained by encouraging the learning robots to produce symmetrical behavior (Video 12.1, MSL conditions, see Yu, Turk & Liu (2018) for details). The authors use the term curriculum learning to refer to their incremental condition. Instead, I restrict the usage of the term to methods in which the articulation of the learning phases is learned and is not handcrafted in advance by the experimenter (see the next section).

Video 12.1 Typical behavior obtained by using a standard reinforcement learning algorithm (PPO), an algorithm empowered with curriculum learning (ECL), an algorithm that encourage the development of symmetrical behavior (MSL), and an algorithm that benefit from curriculum learning and symmetrical behavior bias (ECL + MSL). (https://www.youtube.com/watch?v=zkH90rU-uew, Yu, Turk & Liu, 2018)

Incremental approaches can be effective but show drawbacks. They require to use domain knowledge to define the learning phases and they introduce additional hyperparameters, e.g. the transition threshold/s, which can be hard to tune. Moreover, they presuppose that the problem can be divided in sub-problems ordered by difficulty along a single axis, when in reality the problem might vary along multiple axes of difficulty.

12.3 Curriculum learning

Curriculum learning, instead, attempts to maintain both the adaptive pressure and the adaptability high by automatically manipulating the learning experiences of the robots.

In the context of evolutionary methods this can be done by extending the evolutionary algorithm with a curriculum learning component (Milano & Nolfi, 2021). This component estimates the difficulty level of the environmental conditions encountered by the evolving robots of recent generations and exploits this information to select environmental conditions with the appropriate level of difficulty. The difficulty level of the environmental conditions is estimated on the basis of the inverse of the relative fitness obtained by the robots recently evaluated in such conditions.

More specifically, the evolving robots are evaluated for η episodes in corresponding environmental conditions. During the first 1/10 of the evolutionary process the environmental conditions are chosen randomly as in standard evolutionary methods. Then, they are chosen randomly from among η subsets characterized by different levels of difficulty. The subsets are obtained by: (i) normalizing the performance ρ achieved on each environmental condition during the last five evaluations in the range [0.0, 1.0], (ii) calculating η performance intervals on the basis of a difficulty function (see below), and (iii) selecting η environmental conditions randomly from η subsets, i.e. selecting η environmental conditions characterized by η different levels of difficulty. See the pseudocode included below.

This method permits exposing the evolving robots of any given generation to environmental conditions that have a similar level of difficulty, overall. In other words it permits reducing the stochasticity of the fitness measure due to the over and under-estimation of the fitness of individuals that encountered easy and hard environmental conditions, respectively. Moreover, the algorithm permits selecting preferentially difficult environmental conditions through the usage of a power difficulty function. Adopting a linear difficulty function does not alter the overall difficulty of the selected environmental conditions with respect to a standard approach (in which the environmental conditions are selected randomly, with an uniform distribution, among all possible conditions). The usage of a power difficulty function implies that the subsets including easier environmental conditions are larger than the subsets including more difficult environmental conditions. Consequentially, harder environmental conditions are selected more often than easier conditions.

The algorithm also permits selecting the environmental conditions that are challenging for the current evolving agents. This is due to the fact that the difficulty level of the environmental conditions is estimated on the basis of the performance displayed by recent evolving agents on those conditions. Notice also that the algorithm does not require domain knowledge and does not introduce additional hyperparameters.

The results collected on the long double pole and on the bipedal hardcore problems show that the curriculum learning algorithm speeds-up learning and produces significantly better solutions than a standard algorithm (Milano & Nolfi, 2021).

12.4 Competitive learning and self-play

A third method that can promote long-term learning is competitive learning or self-play. In this method, the learning robot is situated in an environment containing one or more other learning robot/s that compete with it. This setting can generate a form of incremental learning and curriculum learning automatically since the complexity of the task faced by the adaptive agents increases over time, as a result of the progresses made by competitors, and since the competitors tend to generate environmental conditions which are challenging for the learning robots. Moreover, it can produce a form of open-ended learning since the complexity of the task can increase in an unbounded manner.

A competitive scenario that has been widely studied through evolutionary methods concerns predator and prey robots, i.e. the concurrent evolution of two populations of robots rewarded for the ability to catch the competitor and to avoid being caught, respectively (Figure 12.1, Nolfi & Floreano, 1999). The prey is considered caught when touched by the predator. Each evolving robot is evaluated against multiple competitors selected among the best competitors of the current and eventually of the last generations.

Figure 12.1. Predator and prey robots (left and right, respectively) situated in an arena surrounded by walls. The cables have rotating contacts and provide the electric power to the two robots to enable long-term experiments (Floreano, Nolfi & Mondada, 1998)

Competitive evolutionary experiments, however, require special care to ensure that the co-evolutionary dynamics yields true progress and to verify whether such progress is occurring or not. To clarify this aspect we should distinguish among: local progress, i.e. progress against the current competitors; historical progress, i.e. progress against the competitors of previous generations, and global progress, progress against the competitors of previous and future generations (Miconi, 2008). Competitive evolution generally guarantees local progress but not historical and global progress. Indeed, local progress can be produced by re-adopting ancient solutions that are effective against contemporary competitors but are ineffective against other competitors. In this situation, the co-adaptation process enters in a limit cycle dynamics characterized by the repeated rediscovery and abandonment of qualitatively similar strategies. This is indeed the dynamics observed in the experiments described in Nolfi & Floreano (1999). After a first phase producing true progress, the co-evolving robots keep producing local progress but achieves it by re-discovering and then abandoning the same set of strategies over and over again.

A method that can be used to favor historical progress consists in maintaining an archive including the best robots of previous generations (Rosin & Belew, 1997) which grows in size over generations or in maintaining an archive including only the best N non-identical robots of the previous generations (De Jong, 2005). The evolving individuals are evaluated against competitors selected randomly from the full archive, in the former case, or against all competitors included in the restricted archive, in the latter case.

An alternative approach consists of the generalist algorithm which does not rely on archives but includes a procedure to discard opportunistic individuals, i.e. robots performing well against the competitors against which they were evaluated but poorly against other competitors (Simione & Nolfi, 2021). This is realized by organizing the evolutionary process in phases lasting N generations in which only the first or the second population evolve in an alternated fashion. During each phase the algorithm operates by: (i) evolving a subset of one population against a subset of competitors for N generations, (ii) post-evaluating the evolved individuals at the end of each phase against all competitors, and (iii) discarding the individuals which achieved the worst performance against all opponents. The discarded individual set includes opportunistic robots that progressed against the competitors encountered but regressed against the other competitors.

The algorithm was evaluated in a predator-prey setting (Simione & Nolfi, 2021). The observations of the robots include the state of the infrared sensors and the information extracted from an omnidirectional camera. The actions encode the desired speed of the two wheels. The maximum speed of the motors of the predator is lower than the prey’s. The predators are rewarded proportionally to the fraction of time required to capture the prey (0 in episodes in which the prey is not caught). The prey are rewarded proportionally to the fraction of time in which they manage to escape the predator (1.0 in episodes in which the prey is not caught).

The analysis of the robots evolved over 150,000 generations demonstrates that the generalist algorithm produces historical and global progress (Figure 11.2). Indeed, robots generally perform better against ancient competitors than contemporary competitors regardless of the specific generation considered. For example, predators of generation 150,000 achieved a fitness of about 0.6 against competitors of generation 10,000 and a fitness of about 0.25 against contemporary competitors. Prey of generations 150,000 achieve a fitness of about 0.9 against competitors of generation 10,000 and a fitness of 0.75 against contemporary competitors. Moreover, robots at any particular generation generally perform better against opponents of future generations than robots of previous generations. This implies that the strategies acquired by predators and prey of succeeding generations are not only more and more effective against previous opponents but also more and more effective in general, i.e. are more and more capable of defeating opponents.

Figure 12.2. Performance of predators and prey robots every 10,000 generations post-evaluated against competitors of previous and following generations. Performance vary in the range [0.0, 1.0]. For each comparison, we indicate the performance of the predators. The performance of the prey corresponds to 1.0 minus the performance of the predators. The left and right figures display the results of the first replication and the average result of 5 replications, respectively. (Simione & Nolfi, 2021)

The visual inspection of the behavior displayed by co-evolved robots indicate the possession of quite complex skills including: avoiding fixed and moving obstacles, optimizing the motion trajectory with respect to multiple constraints, integrating sensory information over time and use this information to appropriately determine future actions, disorienting the competitor by producing irregular behaviors that are hard to anticipate (in the case of prey), anticipating the behavior of the competitor and coping with the irregular behavior produced by the prey (in the case of predators). The analysis of the course of the evolutionary process also indicates that the complexity of the behaviors increases across generations although progresses in performance does not always lead to complexifications of robots’ behavior (see Simione and Nolfi, 2021).

An interesting set of competitive experiments realized with reinforcement learning is described in Bansal et al. (2018). The authors considered both asymmetrical problems like “kick and defend” in which a humanoid robot learns to shoot a ball toward a goal defended by a goalkeeper robot and symmetrical problems, like the “sumo” in which two robots learn to push the opponent out the ring (see Video 12.2). Asymmetric problems are handled by training two network controllers which are embedded in the two corresponding robots. Symmetric problems, instead, are handled by training a single network controller which is embedded in both robots. The production of global progress is promoted by evaluating the learning robots against opponents selected randomly from archives containing older networks, i.e. copies of the control network after each training epoch.

The observation vector of the robots includes the position and velocity of the agent’s own joint, the contact point acting on the agent’s body, the relative position of the opponent, and the angular position of the opponent’s joints. The robots are rewarded and punished with 1000 and -1000 at the end of won and lost episodes, respectively. To facilitate the learning process, the robots are also rewarded for walking forward and for being able to stand. However, these additional reward components are gradually annealed to zero during the initial phase of the training process. The training is continue for 5,000 to 10,000 epochs, depending on the problem. During each epoch the agent is evaluated against several competitors for a total of 409,600 steps.

As shown by Video 12.2, the method permits synthesizing rather effective behaviors. For example, the humanoids playing sumo display a stable fighting stance and knocks the opponent with their head. The humanoids playing kick-and-defend display good kicking skills, uses their feet to kick the ball high and try to avoid the defender. The kicker robot also tries to fool the defender by moving left and right quickly once close to the ball. The defender responds to the motion of the kicker and uses both its hands and legs to obstruct the ball.

Video 12.2. Behaviors of robots trained to play “sumo” (push the opponent outside the ring or topple it over), “kick and defend” (move the ball through net and defend), “you shall not pass” (move forward and prevent the opponent from reaching the other side). The final part of the video shows the post-evaluation of a robot trained to play sumo situated alone over the ring and subjected to strong perturbation forces. (Bansal et al., 2018, https://www.youtube.com/watch?v=OBcjhp4KSgQ)

12.5 Intrinsic motivation

Finally, a fourth mechanism that can promote long-term and open-ended learning is intrinsic motivation (Schmidhuber, 2010), namely the realization of a learning process that encourage the robots to perform new behaviors and/or to experience new observations, independently from their effects on the reward. This can be realized by using a reward function with two scalar real-valued components:

where rext(t) denotes the standard external reward linked to the ability to achieve a given goal, rint(t) denotes the intrinsic reward associated to new behaviors and/or new observations, and g maps pairs of real values to real values, e.g. g(a,b) = a + b.

The reason why intrinsic motivation can promote long-term and open-ended learning is that the novel behaviors can be later reused to produce extrinsically rewarded behaviors. Similarly, novel observations can promote the development of new behaviors afforded by them. In other words, the acquisition of novel abilities, that are neutral with respect to the extrinsic reward, increases the evolvability of the robots. For an analysis of the role of intrinsic motivation in natural systems and relative implications for robotics, see Baldassarre & Mirolli (2013) and Cangelosi & Schlesinger (2015).

More specifically, the intrinsic reward should promote the production of novel behaviors and/or the access to novel observations which are predictable or compressible (i.e. which include regularities that can be used to predict or compress them). Accessing unpredictable observations, e.g. observations including irregular noise, and producing irregular actions, instead, is useless and should not be rewarded (Schmidhuber, 2010).

From the point of view of the learning robot, novel and potentially predictable observations and actions correspond to states including regularities that are not yet known but that can be discovered by a predictor or compressor network trained trough self-supervision. This implies that the intrinsic reward can be set on the basis of the first derivative of the error of a predictor/compressor network trained on the sensory/motor states experienced by the learning robot (Schmidhuber, 1991; Oudeyer, Kaplan & Hafner, 2007). Indeed, the prediction or compression error will decrease during the production of behaviors and during the observations of states that are novel and that include learnable regularities. The error, instead, will not decrease during the production of irregular behaviors and/or the observation of irregular states. Moreover, the error will stop decreasing during the production of behaviors and/or the experience of observations that were novel, but that lost their novelty nature after being produced or experienced several times.

Eventually, one can encourage the adapting robots to maximize both long term and short term novelties, i.e. observations and behaviors that are novel with respect to the previous phases of the learning process and with respect to the previous phases of the evaluation episode, respectively (Badia et al., 2020). Moreover, one can usefully reduce the weight of the two corresponding intrinsic rewards at the corresponding time scale. In other words, one can reduce the weight of long-term and short-term intrinsic rewards during the course of the adaptive process and during the course of the evaluation episodes, respectively (Badia et al., 2020). This enables the learning process to focus on the the exploration of novelties in the initial phase and on the maximization of the extrinsic reward during the final phase. Moreover, it enables the robot to focus on the achievement of the goal in the final phase of the evaluation episode, after an initial phase dedicated to explore the environment.   

Unfortunately learning to predict/compress new sensory/motor states while preserving the ability to predict/compress states experienced in previous phases of the learning process is challenging, especially when the adaptive process continues in the long term. Another challenge concerns the ability to separate predictable and unpredictable states/actions. Consequently, one might need to trade generality versus simplicity and completeness versus computational efficiency. An example of a simplified but computationally efficient method is random network distillation (Burda et al., 2018). The method relies on two feed-forward networks that receive as input the current observation (Figure 12.3). The target network is initialized with random weights and is not trained (Figure 12.3, left). The predictor network is trained to predict the output of the target network (Figure 12.3, right). The prediction error can be used as intrinsic reward. Indeed the error will be high for novel observations that did not yet become familiar to the predictor network. The model has a series of shortcomings: (i) it restricts itself to observations and ignore actions, (ii) it cannot detect regularities over time, and (iii) it is unable to discriminate between predictable and unpredictable observations. Despite of that, it results effective for problems with sparse rewards, e.g. the Atari Freeway, Gravitar, Montezuma’s Revenge, Pitfall!, Private Eye, Solaris, and Venture, in which the probability to obtain even a single positive reward at the beginning of the training process is minimal. Moreover, in combination with a method that permits to automatically adapt the hyperparameters to the characteristics of the problem, it allowed achieving super-human performance on the full set of the Atari games (Badia et al., 2020)

Figure 12.3 The distillation component include a target network and a predictor network shown on the left and right hand side respectively. The lower layer of the two networks encodes the observation. The number of output neurons of the two networks is arbitrary (adapted from Burda et al., 2018).

As proposed by Lehman & Stanley (2011), one can also consider a radical form of curiosity learning that operate on the basis of an intrinsic reward only. This means that the adaptive robots are rewarded exclusively for the ability to produce new behaviors. The behaviors produced are also evaluated on the basis of a conventional extrinsic reward to identify the solution that best suit a given desired function among all the candidate solutions generated during the adaptation process. The latter evaluation, however, does not affect the course of the adaptive process that focus exclusively on the production of behaviors that are novel with respect to those produced previously. As shown by the authors, this novelty search method outperforms conventional methods relying on extrinsic rewards in scenarios where the number of potential alternative solutions is limited. However, it is unlikely to scale to scenarios where the number of such solutions is huge or infinite.

12.6 Learn how

Read Section 13.17 and make the Experiment 15 to familiarize with competitive coevolution.

References

Badia, A. P., Piot, B., Kapturowski, S., Sprechmann, P., Vitvitskyi, A., Guo, Z. D., & Blundell, C. (2020, November). Agent57: Outperforming the atari human benchmark. In International Conference on Machine Learning (pp. 507-517). PMLR.

Baldassarre, G., & Mirolli, M. (Eds.). (2013). Intrinsically motivated learning in natural and artificial systems. Berlin: Springer.

Bansal T., Pachocki J., Sidor S., Sutskever I. & Mordatch, I. (2017). Emergent complexity via multi-agent competition. arXiv preprint arXiv:1710.03748.

Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2018). Exploration by random network distillation. arXiv preprint arXiv:1810.12894.

Cangelosi A. & Schlesinger M. (2015). Developmental robotics: From babies to robots. MIT press.

De Jong E.D. (2005). The maxsolve algorithm for coevolution. In Genetic and Evolutionary Computation (GECCO 2005), Lecture Notes in Computer Science, Vol 3120. Chicago, USA: Springer-Verlag. In H.G. Beyer et al. (Eds).  GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary computation. New York: ACM Press.

Floreano D., Nolfi S. & Mondada. F. (1998). Competitive Co-Evolutionary Robotics: From Theory to Practice. In R. Pfeifer, B. Blumberg, J-A. Meyer, S.W. Wilson (Eds.), From Animals to Animats V, Cambridge, MA: MIT Press, pp 512-524.

Gomez F. & Miikkulainen R. (1997). Incremental evolution of complex general behavior. Adaptive Behavior, 5(3-4): 317-342.

Harvey I., Husbands P. & Cliff D. (1994). Seeing the light: Artificial evolution, real vision. From Animals to Animats, 3, 392-401.

Hu W., Turk G & Liu C.K. (2018). Learning symmetric and low-energy locomotion. ACM Transactions on Graphics (37): 4, 144:156.

Lehman, J., & Stanley, K. O. (2011). Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation, 19(2), 189-223.

Miconi T. (2008). Evolution and complexity: The double-edged sword. Artificial Life (14) 3: 325-334.

Milano N. & Nolfi S. (2021). Automated curriculum learning for embodied agents: A neuroevolutionary approach. arXiv preprint arXiv:2102.08849.

Nolfi S. & Floreano D. (1998). Co-evolving predator and prey robots: Do 'arm races' arise in artificial evolution? Artificial Life, 4(4): 311-335.

Oudeyer P-Y., Kaplan F. & Hafner V.F. (2007). Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11:265–286.

Rosin C.D. & Belew R.K. (1997). New methods for competitive coevolution. Evolutionary Computation, (5) 1:1-29.

Schmidhuber J.  (1991).  Curious model-building control systems. In Proceedings of the International Joint Conference on Neural Networks. Singapore, vol. 2, pp. 1458–1463.

Schmidhuber, J. (2010). Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3), 230-247.

Simione L. & Nolfi S. (2021). Long-term progress and behavior complexification in competitive coevolution. Artificial Life, 26: 409-430.