Hierarchical Reinforcement Learning Using Automatic Task Decomposition And Exploration Shaping
MetadataShow full item record
Reinforcement learning agents situated in real world environments have to be able to address a number of challenges in order to succeed at accomplishing a wide range of tasks over their lifetime. Among these, such systems have to be able to extract control knowledge from already learned tasks and apply them to subsequent ones in order to allow the agent to accomplish the new task faster and to accelerate the learning of an optimal policy. To address skill reuse and skill transfer, a number of approaches using hierarchical state and action spaces have been introduced recently which build on the idea of transferring the previously learned policies and representations to model and control the new task. However, while such transfer of skills can significantly improve learning times, it also poses the risk of "behavior proliferation" where the increasing set of available reusable actions makes it incrementally more difficult to determine a strategy for a new task. To address this issue, it is important for the agent to have the capability to analyze new tasks and to have a means of predicting the utility of an action or skill in a new context prior to learning a policy for the task. The former here implies an ability to decompose the new task into known subtasks while the latter implies the availability of an informed exploration policy used to find the new goal and to more efficiently learn a corresponding policy. This thesis presents a novel approach for learning task decomposition by learning to predict the utility of subgoals and subgoal types in the context of the new task, as well as for exploration shaping by predicting the likelihood with which each available action is useful in the given task context. To achieve this, the approach presented here uses past learning experiences to acquire set of utility functions that encode relevant knowledge about useful subgoals and skills and applies them to shape the search for the optimal policy for the new task. Acceleration is achieved by focusing the search on contextually identifiable subgoals and actions/skills that have been learned to be valuable in the context of optimal policies in the previously encountered worlds. Performance increase is achieved here both in terms of the time required to reach the task's goal the first time and time required to learn an optimal policy, which is demonstrated in the context of navigation and manipulation tasks in a grid world domain.