© Stefano Nolfi, 2021 | How to cite this book | Send your feedback | Collaborate
As we have seen in the previous Chapters, robots adapted through evolutionary and reinforcement learning methods can develop behavioral and cognitive skills by relying only on the input provided by their sensors and on a scalar value, that rates how well they are doing, calculated automatically through a fitness or reward function. Robots adapted through learning by demonstration (Section 6.5 and 9.4) require more information, i.e. a detailed description of the actions to be produced each step extracted from a demonstrated behavior.
In this Chapter, we will see how the availability of additional training feedback and/or of additional input information can enhance the adaptive process. By additional training feedback is intended information that can be used to determine how to change the parameters of the adapting robot/s. The next sections will be focused in particular on information that can be extracted by the robot itself from its own sensor/internal/motor states and that can be generated without human intervention. By additional input information it is meant additional input vectors provided by a human caretaker or produced by another robot. These additional input vectors, that complement the information contained in the observation vector, can consist of affordance and/or goal vectors. Affordance vectors represent the behavior afforded by the current robot/environmental context. Goal vectors represent the state that a robot should attain to achieve its goal.
For an analysis of the relation between the development of behavioral and cognitive skills in humans and robots see Cangelosi & Schlesinger (2015) and Asada & Cangelosi (in press).
Self-supervised learning indicates a form of supervised learning in which the desired output vectors are generated automatically on the basis of the input vectors, without human intervention.
A first type of self-supervised networks are auto-encoders, i.e. feed-forward networks trained to produce output vectors identical to input vectors, where input vectors consist of observations. When the number of neurons forming an internal layer is lower than the size of the input vector, the information contained in the input vector is compressed into a smaller internal vector. The compression permits extracting the features that encode the way in which the element forming the input vector co-vary. Moreover, the compression permits filtering out noise and generate partially missing information.
A second type is constituted by predictor networks that are trained to produce as output the input vector that they will experience at time t+1. These networks can be used to extract features from input vectors that anticipate future states and to filter out noise.
A third type is constituted by forward-models networks, i.e. networks that are trained to predict the next observation on the basis of the current observation and of the action that the agent is going to perform. As predictor networks, these networks extract in their internal neurons information about future states. More specifically, they permit to predict the effect that the actions of the robot have on the robot’s perceived environment.
Finally, a fourth type is constituted by sequence-to-sequence networks, i.e. networks that are trained to produce as output the last n observations experienced during the current and the previous n-1 steps. The network first receives a sequence of n observations, one vector at a time, and then reproduces the same sequence in its output, one vector at a time. These networks extract information on how observations change over time.
Predictors, forward models, and sequence-to-sequence networks are implemented with recurrent neural networks.
The efficacy of adaptive methods can be enhanced by combining a control network, trained with an evolutionary or reinforcement learning algorithm, with one or more feature-extracting networks trained through self-supervised learning (Lange, Riedmiller & VoigtHinder, 2012; Mattner, Lange & Riedmiller, 2012; Ha & Schmidhuber, 2018; Milano & Nolfi, 2020). The feature-extracting networks are used to extract useful features from observations. The control network is used to map the features extracted by the former network/s to appropriate actions.
A notable example is constituted by the work of Ha & Schmidhuber (2018) that combines a control network with an auto-encoder and a forward-model network (Figure 11.1). The model has been applied to the CarRacing-v0 problem (Klimov, 2016) that consists in driving a car on a race track in simulation by receiving as input the images collected by a camera located above the track. The training is performed in four phases. The first phase is dedicated to the generation of the training set that is formed by the observations o and the actions a experienced and performed by the car during 10,000 episodes in which the car moves by selecting random actions. The second phase is used to train the auto-encoder network that receives in input and reproduces in output the observation vector. Such training enables the auto-encoder to extract an abstract compressed representation z of observations in its internal layer. During the third phase, the forward-model network is trained to predict zt+1 by receiving in input zt and at. The forward model extracts in its internal layer a vector of feature h that permits predicting the next compressed representation of the observation. Both the auto-encoder and the forward-model networks are trained on the basis of the training set collected in the first phase. Finally, in the fourth phase, the control network is trained to drive the car so to maximize the cumulative rewards by receiving as input the z and h vectors extracted from the observations from the pre-trained auto-associative and forward-model networks. The control network is trained through the Covariance-Matrix Adaptation Evolution Strategy (CMA-ES, Hansen & Ostermeier, 2001).
The agents provided with the auto-encoder network outperform the agents lacking feature extracting networks. Moreover, the agents provided with both an auto-encoder and a forward-model network outperform the agents provided with an auto-encoder network only (see Video 11.1). As shown in the videos, the agent that includes the two feature-extracting networks drives more smoothly and handle sharp corners more effectively than the other agents.
Video 11.2 illustrates the features extracted by the autoencoder network. On the left, central, and right portion of the video you can see examples of the 64x64 pixel images observed by the camera, the z vectors extracted from a trained auto-encoder network, and the 64x64 pixel images reconstructed by the autoencoder. The fact that the images reconstructed by the auto-encoder include most of the information present in the original images demonstrates that the autoencoder manages to compress most of the information contained in the 64x64 pixels in a vector composed of 15 values. The second portion of the video shows how the image reconstructed varies when the elements forming the z vector are altered manually. The effects of these variations indicate that some of the elements of the z vector encode abstract properties of the image, such as the presence of a straight path or of a curve ahead.
We can exploit the ability of the auto-encoder network to reconstruct the image from the z vector to look inside the brain of the agent and observe the visual perception of the agent while it is driving the car (Video 11.3).
In a following work, Milano & Nolfi (2020) introduced a method for continuing the training of the feature-extracting network/s during the training of the control network. Moreover, the authors compared the advantage provided by different feature-extracting methods on four problems: the HalfCheetahBullet, Walker2dBullet, BipedalWalkerHardcore, and MIT race car problems. The obtained results demonstrate that all feature-extracting methods provide an advantage with respect to the standard method in which the control network receives as input unprocessed observations, providing that the training of the feature-extracting networks is continued during the training of the control network. Moreover, the comparison indicates that the best results are obtained with the sequence-to-sequence method.
The necessity to continue the training of the feature-extracting network/s during the training of the control network can be explained by the fact that the problem considered by Milano & Nolfi (2020) involves agents operating on the basis of egocentric information while the problems considered in the other studies referred above involve agents operating on the basis of allocentric information, i.e. on the basis of a camera detached from the agent that observe the agent and the environment. Indeed, the difference between the observations experienced by producing random action with respect to actions selected to maximize the expected reward is greater for agents that operate on the basis of egocentric information than for agents that operate on the basis of allocentric information.
The observation generated by a predictor network or by a forward-model can be used as a proxy of the actual observation vector when it is missing or incomplete.
An example that illustrates how a predictor network can enable a robot to keep operating correctly during phases in which the agent is temporarily blind is described in Gigliotta, Pezzulo & Nolfi (2011). In this experiment, a pan-tilt camera robot located in front of a panel is evolved for the ability to foveate at the different portions of the image located around the center of the panel, by moving in the clockwise direction (Figure 11.2). The panel is colored with red and blue colors with an intensity that vary linearly along the vertical and horizontal axes.
The robot is provided with: (i) two sensors (R and B) that detect the intensity of the red and blue color at the center of the visual field of the camera, (ii) two sensory neurons (R1 and B1) encoding the state of the sensors or the state of two additional internal neurons (H21 and H22) during normal and blind phases, respectively, (iii) a layer of internal neurons with recurrent connections (H1), and (iv) two motor neurons controlling the pan and tilt motors of the camera. Therefore, a single network is responsible to set the state of the motors and to generate the observation that is used as a proxy of the missing observation at time t+1 during blind phases.
The connection weights of the neural network are encoded in a vector of parameters and evolved by using a fitness function that rewards the agent for the ability to move the focus of the camera along the circular path in the clockwise direction (Figure 11.2, center). Unlike the models described in the previous section, therefore, the predictor network is not trained with self-supervised learning to minimize the offset between the predicted and the actual state of the sensors. The robot is provided with a single network that includes a prediction module. The connection weights of the entire network, including the weights of the prediction module, are evolved on the basis of the feedback provided by fitness function that rates the robot for the ability to foveate along the desired circular trajectory.
The evolved robots shows the correct behavior also during blind phases, i.e. when the camera does not provide input data for several steps. As mentioned above, during these phases the state of the sensory neurons is set on the basis of the state of the H21 and H22 internal neurons at time t-1.
Interestingly, the state generated by the predictor network does not match the state that the sensors would assume if the camera was not blinded. The network predicts correctly whether the intensity of red and blue will increase or decrease during the current step as the result of motion but amplifies the rate of variation over time. These states enable the agent to produce a correct behavior even if the rate of variations is exaggerated. In other words, the predictor network generates states that are functionally equivalent to the missing observations but that differ significantly from them.
Similar results were reported by Freeman, Ha & Metz (2019) who evolved the neural network controller for a swing-up balancing problem. In this work, sensory deprivation was introduced by enabling the agent to access to the observation in each step with a probability p and by replacing the state of the sensory neurons with the state of predictor neurons in the remaining cases.
By varying the probability, the authors observed that the agents manage to solve the problem also when p was set to 10%, i.e. when the agent is allowed to gather information from its sensors in a minority of the cases. The comparison of the observation generated by the neural network and the missing observation indicates that, much like the previously discussed case, the two vectors differ significantly. The observation generated by the network accounts only for the features supporting the production of the effective behavior and neglects other features. After all, this is the function for which the predictor component of the network is rewarded for.
Overall these experiments demonstrate that an agent can acquire the ability to generate states that can be used as a proxy of missing sensory information even without the usage of self-supervision thanks to fact that the generation of these state is instrumental to the achievement of the agent’s goal. Moreover, these experiments demonstrate that the state generated by the agents should not necessarily account for all the information included in the missing observation. It should only incorporate the features that support the production of the correct behavior.
Forward models can also be used to replace completely the interaction between the robot and the environment. In other words, they can be used by the agent to stop acting in the real environment and to start acting in an imaginary world simulated through a forward-model network (world model). During these dream-like phases the actuators of the robot are blocked, the state of the sensors is ignored, and the action vector produced by the neural network controller of the robot is given as input to the forward model that generates the next observation vector and eventually the reward (Figure 11.3, right).
World models can be used to plan, i.e. to choose the actions to be executed after having checked mentally the expected outcomes of possible alternative actions, or to adapt mentally in a imagined world, i.e. to develop the skills required to achieve the robot’s goal mentally without interacting with the actual environment. Mental training can be realized by using a standard evolutionary or reinforcement learning algorithm combined with a forward-model network trained through self-supervision. During the training process, the parameters of the control network are varied based on the imagined outcome of the hypothetical behavior shown by the agent generated and modelled by the control and forward-model networks without involving the physical environment and the body of the robot.
Mental adaptation presents similarities with adaptation in simulation. Indeed, in both cases the brain of the robot interacts with a virtual environment simulated in a computer instead than with the physical environment. However, adaptation in simulation is realized by using a computer program designed manually which simulates the robot, the sensors and the actuators, the environment, and their interaction. Mental adaptation instead is realized by using a neural network trained through a self-supervised learning algorithm.
An example of mental adaptation is reported in Ha & Schmidhuber (2018). The experiment involves an agent trained to play the VizDoom game (Kempka et al., 2016). As in the case of the experiment on the CarRacing-v0 problem reported in Section 11.2: (i) the agent includes a control, an auto-encoder and a forward-model network, (ii) the control network receives as input the internal state of the auto-encoder and of the forward-model network, (iii) the auto-encoder is trained to re-generate the observation, and (iv) the forward-model is trained to predict the next internal state of the auto-encoder network (see Figure 11.1). Moreover, as in the case of the experiment on the CarRacing-v0 problem the training is realized in a series of phases dedicated to: (i) collect the training set during episodes in which the agent moves by performing random action, (ii) train the auto-encoder network, (iii) train the forward-model network, and (iv) train the control network. However, in the case of this experiment: (i) the forward-model network is trained to generate also the reward (which is +1 when the agent survives until the end of the episode, and -1 when the agent dies), (ii) the control network is trained mentally without using the game software, and (iii) the trained agent is post-evaluated by using the game software to verify that the skills acquired mentally generalize to the actual game environment.
The post-evaluation of the trained agent in the actual game environment demonstrates that the skills acquired by the agent through mental training enable the agent to operate effectively also in the actual game environment.
Clearly, the possibility to transfer successfully agents trained mentally in the actual world depends on the accuracy of the world model and on the capacity of the world model to generate sufficiently varied experiences. In their model Ha & Schmidhuber (2018) introduced two characteristics that are crucial in that respect. The first is that the forward-model is used to predict an abstract representation of the observation instead than the observation itself, i.e. the z vector instead than the o vector. This permits to increase the accuracy of the predicted state and to filter out features that are unpredictable. The second aspect is that the forward-model network produces as output a density function p(z) instead than a deterministic prediction of z. A probabilistic output combined a high temperature increases the variability of the conditions generated by the world model. This mechanism plays a role similar to the addition of noise in simulated environments discussed in Section 7.5.
The possibility to develop behavioral and cognitive skills can be facilitated and/or enhanced by the availability of additional input patterns that complement the observation state extracted from the sensors of the robot. One kind of such additional input patterns are affordance vectors encoding the opportunity for the execution of behaviors.
As any other feature, affordances can be extracted from observations within the internal layers of the robot’s brain (see for example Section 5.7). Alternatively, affordance vectors can be generated by another cooperating robot (as illustrated in Section 9.3). In this section, instead, we will consider situations in which the affordance vectors are provided by a human caretaker.
Clearly the possibility to receive affordance information directly in input can facilitate the adaptive process, especially in cases in which the relation between the affordances and observations is indirect.
An example that illustrates how the availability of affordance patterns can support the development of behavioral skills is illustrated in the experiment of Nishimoto & Tani (2009) in which a Curio robot (Ishida, 2004) is trained through learning by demonstration to produce three articulated behaviors (see also Yamashita & Tani, 2008). The behaviors consist of: (i) holding a cubic object with the two hands and moving the object up and down and then left and right four times by finally releasing the object and returning to the initial posture (Video 11.4, first part), (ii) touching the objects with both hands for three times and holding and moving the object along a circular trajectory for three times by finally releasing the object and returning to the original posture (Video 11.4, second part), and (iii) holding the object, moving the object forward and back for three times, touching the object with the left and right hand for three times, and returning to the initial posture.
The robot is trained through a learning by demonstration method realized through kinesthetic teaching and interactive tutoring. More specifically, during the demonstration phase, a human caretaker located behind the robot holds and moves the arm of the robot so to make it performs the three behaviors several times while the actuators of the robot are set in a passive compliant mode. This phase is used to collect a training set formed by the observation and action vectors. During the training phase, the neural network controller of the robot is trained to respond to the sequence of observations stored in the training set by producing the associated sequence of action vectors through back-propagation through time (Werbos, 1990). During the post-evaluation phase, the robot is evaluated for the ability to produce the behaviors demonstrated without the assistance of the caretaker. During the demonstration the object is initially located in three different positions on the left, center, and right of the taskspace. During the post-evaluation, the object is initially placed in random positions selected within the same range of the demonstration.
As reported by the authors, the post-evaluation phase indicates that the robot is not yet able to produce the requested behavior. This is due to the fact that minor differences between the actions produced by the robot during post-evaluation and the actions demonstrated by the experimenter cumulate by producing significant differences over time. As a result of these differences, the trained robot fails to hold the object in its hands. This problem was solved by repeating the training process a second time in which: (i) the robot was allowed to control its actuators during the demonstration, and (ii) the experimenter exhorted an external force on the arms of the robots during the holding phases to correct the robot’s behavior. In other words, the training was realized by iterating the learning by demonstration process multiple times and by using the successive demonstrations to correct only the aspects of the robots’ behavior that required further adjustments.
The neural network controller of the robot is constituted by a multi-level recurrent neural networks with neurons varying at different time scales. More specifically, a lower-level sub-network that includes the sensory neurons, the motor neurons, and fast varying internal neurons and a higher-level sub-network that includes the neurons encoding the affordance vector and slow varying internal neurons (Figure 11.4). The time constant of the neurons determining the rate of variations is set manually (see Section 10.3). The two sub-networks, that are connected through few shared neurons, assume functionally different roles during training. The higher-level sub-network becomes responsible for triggering the production of one of the tree behaviors and for triggering the appropriate sequence of elementary behaviors required to produce it. The lower-level sub-network, instead, becomes responsible for the production of the elementary behaviors. This functional subdivision can be appreciated also by inspecting the way slow and fast neurons vary during the production of the behaviors (see Video 11.4).
The three affordance vectors that are used to trigger the execution of the three corresponding behaviors are identical at the beginning of the training process (i.e. are formed by vectors of identical 0.5 values). However, the values forming each affordance vector are varied during the course of the training in the direction of the error back-propagated from the motor neurons to the affordance neurons. In other words, the content of the affordance vector is varied in a way that facilitates the production of corresponding demonstrated behaviors.
Another example that demonstrates how the availability of affordance vector facilitates the evolution of effective behaviors is reported in Massera et al. (2010). In this case, a humanoid iCub robot (Metta et al. 2008) was trained in simulation for the ability to reach, grasp and move an object. The authors compared the results obtained in a standard experiment and in an extended experiment in which the robot received from the caretaker a sequence of three affordance vectors indicating whether the current context afford a reach, grasp or move behavior. The observation vector includes the information extracted from the camera located on the robot head and extracted from the position sensors located on the joints of the robot. In the case of the extended experiment, the tree affordance vectors are received from the caretaker at the beginning of the evaluation episode, when the hand of the robot approaches the object, and when the object has been successfully grasped, respectively. As reported by the authors, the availability of the affordance vectors facilitates the development of the required behaviors.
Another type of information that can usefully complement the observation vector is a representation of the robot’s goal, i.e. of the target state that should be attained by the robot. As we have seen, a robot can acquire the ability to achieve goals without possessing or receiving an explicit representation of them. However, the availability of a representation of the goal can facilitate the development of the goal achieving capability.
An example that illustrates the utility of goal representation is the algorithm introduced by Andrychowicz et al. (2017) that permits to learn from errors.
To illustrate the idea, let’s consider a robotic arm with 7 DOFs that should learn to move a target object to a given target position indicated by the experimenter. The sensors of the robot detect the current angular position and velocity of the robot's joints and the position of the object. In addition, the robot receives as an input from the experimenter a vector that encodes the target position that the object should assume, i.e. the representation of the goal. The robot is rewarded with 1 and -1 at the end of episodes in which it manages or fails to achieve the goal, respectively.
Since the probability that a robot with random parameters solves this problem is extremely low, it will receive a positive rewards only occasionally during the initial phase of the training process. Consequently, the progresses produced by learning will be slow.
As we discussed in Section 6.4, this bootstrap problem can be solved by designing a reward function that scores the robot also for the ability to produce behaviors that are instrumental to the achievement of the goal, e.g. that rewards the robot also for the ability to reduce the distance between the position of the object and the goal. On the other hand, shaping the fitness function can be challenging and can have unexpected undesirable consequences.
A better solution consists in exploiting the availability of the goal representation to learn from failures. This can be realized by repeating each learning episode a second time during which the original goal is replaced with the position that corresponds to the position that the object actually assumed at the end of the previous episode. This permits creating episodes where the robot is positively rewarded since the target goal corresponds to the position in which the object was actually moved. In other words, this replay with goal modification technique permits to exploit the knowledge that the action performed by the agent are inadequate to achieve the intended goal but are adequate to achieve a different goal, i.e. the goal that correspond to the actual outcome of the episode. For these reasons the algorithm has been named hindsight experience replay (HER).
Video 11.5 shows the behaviors of a robot trained with a Deep Deterministic Policy Gradient algorithm (Lillicrap et al., 2015) without and with Hindsight Experience Replay, left and right respectively. As can be seen in the video, the vanilla version of the algorithm keeps failing most of the time while the algorithm with hindsight experience replay quickly converges on effective solutions.
An additional advantage of the HER algorithm is that it generates implicitly an incremental learning process, i.e. a process where the complexity of the environmental conditions increases as the skills of the learning robot improves. Indeed, the offset between the initial and the target position of the object is small at the beginning, when the robot produces simple actions. The offset then increases later on when the actions produced by the robot complexify.
Read Section 13.15 to analyse in detail the Stable Baseline3 implementation of the PPO algorithm. Read Section 13.16 to learn to train robots with reinforcement learning algorithms through the Stable Baseline tool. Make the Exercises 12 and 13.
Andrychowicz M., Wolski F., Ray A., Schneider J., Fong R., Welinder P. ... & Zaremba W. (2017). Hindsight experience replay. arXiv preprint arXiv:1707.01495.
Asada M. & Cangelosi A. (in press). Cognitive Robotics Handbook. Cambridge, MA: MIT Press.
Cangelosi A. & Schlesinger M. (2015). Developmental Robotics: From Babies to Robots. Cambridge, MA: MIT press.
Freeman D., Ha D. & Metz L. (2019). Learning to predict without looking ahead: World models without forward prediction. In Advances in Neural Information Processing Systems (pp. 5379-5390).
Ha D. & Schmidhuber J. (2018). World models. arXiv:1803.10122.
Hansen N. & Ostermeier A. (2001). Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195.
Ishida T. (2004). Development of a small biped entertainment robot QRIO. In Micro-Nanomechatronics and Human Science, 2004 and The Fourth Symposium Micro-Nanomechatronics for Information-Based Society, (pp. 23-28). IEEE Press.
Kempka M., Wydmuch M., Runc G., Toczek, J. & Jaśkowski W. (2016). Vizdoom: A doom-based ai research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG) (pp. 1-8). IEEE Press.
Klimov O. (2016). Carracing-v0. URL https://gym. openai.com/envs/CarRacing-v0/.
Lange S., Riedmiller M. & Voigtländer A. (2012). Autonomous reinforcement learning on raw visual input data in a real world application. In The 2012 international joint conference on neural networks (IJCNN) (pp. 1-8). IEEE.
Lillicrap T.P., Hunt J.J., Pritzel A. et al. (2015). Continuous control with deep reinforcement learning. arXiv:1509.02971.
Massera G., Tuci E., Ferrauto T. & Nolfi S. (2010). The facilitatory role of linguistic instructions on developing manipulation skills, IEEE Computational Intelligence Magazine, (5) 3: 33-42.
Mattner J., Lange S. & Riedmiller M. (2012). Learn to swing up and balance a real pole based on raw visual input data. In: Proceedings of the 19th International Conference on Neural Information Processing (5) (ICONIP 2012), pp. 126–133. Dohar, Qatar.
Metta G., Sandini G., Vernon D., Natale L. & Nori F. (2008). The iCub humanoid robot: an open platform for research in embodied cognition. In Proceedings of the 8th workshop on performance metrics for intelligent systems, pp. 50-56.
Milano N. & Nolfi S. (2020). Autonomous Learning of Features for Control: Experiments with Embodied and Situated Agents. arXiv preprint arXiv:2009.07132.
Nishimoto R., & Tani J. (2009). Development of hierarchical structures for actions and motor imagery: a constructivist view from synthetic neuro-robotics study. Psychological Research, 73, 545-558.
Onofrio G., Pezzulo G. & Nolfi S. (2011). Evolution of a predictive internal model in an embodied and situated agent. Theory in Biosciences, vol. 130(4): 259-276.
Werbos P.J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550-1560.
Yamashita Y. & Tani J. (2008). Emergence of functional hierarchy in a multiple timescale neural network model: a humanoid robot experiment. PLoS Comput Biol, 4(11), e1000220.