Bachelorarbeit BCLR-0158

Doerr, Andreas: Policy Search for Imitation Learning.
Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik, Bachelorarbeit Nr. 158 (2015).
51 Seiten, englisch.
CR-Klassif.G.1.6 (Numerical Analysis Optimization)
I.2.6 (Artificial Intelligence Learning)
I.2.9 (Robotics)

Efficient motion planning and possibilities for non-experts to teach new motion primitives are key components for a new generation of robotic systems. In order to be applicable beyond the well-defined context of laboratories and the fixed settings of industrial factories, those machines have to be easily programmable, adapt to dynamic environments and learn and acquire new skills autonomously. Reinforcement learning in principle solves those learning issues but suffers from the curse of dimensionality. When dealing with complex environments and highly agile hardware platforms like humanoid robots in large or possibly continuous state and action spaces, the reinforcement framework becomes computationally infeasible. In recent publications, parametrized policies have been employed to face this problem. One of them, Policy Improvement with Path Integrals (PI^2), has been derived from the transformation of the Hamilton-Jacobi-Bellman (HJB) equation of stochastic optimal control into a path integral using the Feynmann Kac theorem. Applications of PI^2 are so far limited to Dynamic Movement Primitives (DMP) to parametrize the motion policy. Another policy parametrization, the formulation of motion primitives as solution of an optimization-based planner has been widely used in other fields (e.g. inverse optimal control) and offers compelling possibilities to formulate characteristic parts of a motion in an abstract sense without specifying too much problem-specific geometry. Imitation learning or learning from demonstration can be seen as a way to bootstrap the acquisition of new behavior and as an efficient way to guide the policy search into a desired direction. Nevertheless, due to imperfect demonstrations, which might be incomplete or contradictory and also due to noise, the learned behavior might be insufficient. As observed in the animal kingdom, a final trial-and-error phase guided by the cost and reward of a specific behavior is necessary to obtain a successful behavior. Interestingly, the reinforcement learning framework might offer the tools to govern both learning methods at the same time. Imitation learning can be reformulated as reinforcement learning under a specific reward function, allowing the combination of both learning methods. In this work, the concept of probability-weighted averaging of policy roll-outs as seen in PI^2 is combined with an optimization-based policy representation. The reinforcement learning toolbox and direct policy search is utilized in a way that allows both imitation learning based on arbitrary demonstration types and the imposition of additional objectives on the learned behavior. A black box evolutionary algorithm, Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES), which can be shown to be closely related to the approach in PI2 is leveraged to explore the parameter space. This work will experimentally evaluate the suitability of this algorithm for learning motion behavior on a humanoid upper body robotic system. We will focus on learning from different types of demonstrations. The formulation of the reward function for reinforcement learning will be depicted and multiple test scenarios in 2D and 3D will be presented. Finally, the capability of this approach to learn and improve motion primitives is demonstrated on a real robotic system within an obstacle test scenario.

Volltext und
andere Links
PDF (3479572 Bytes)
Abteilung(en)Universität Stuttgart, Institut für Parallele und Verteilte Systeme, Maschinelles Lernen und Robotik
BetreuerRatliff, Nathan
Eingabedatum16. Februar 2015
   Publ. Informatik