【论文笔记MGRL】HER: Hindsight Experience Replay

DarkDawn

July 27, 2022

904 views

No comments

4146 words

论文笔记

NeurIPS 2017

摘要

Dealing with sparse rewards is one of the biggest challenges in Reinforcement Learning (RL). We present a novel technique called Hindsight Experience Replay which allows sample-efficient learning from rewards which are sparse and binary and therefore avoid the need for complicated reward engineering. It can be combined with an arbitrary off-policy RL algorithm and may be seen as a form of implicit curriculum.
We demonstrate our approach on the task of manipulating objects with a robotic arm. In particular, we run experiments on three different tasks: pushing, sliding, and pick-and-place, in each case using only binary rewards indicating whether or not the task is completed. Our ablation studies show that Hindsight Experience Replay is a crucial ingredient which makes training possible in these challenging environments. We show that our policies trained on a physics simulation can be deployed on a physical robot and successfully complete the task.

处理稀疏奖励是强化学习的最大挑战之一。我们提出了一种新的技术称为 "事后诸葛亮"（Hindsight Experience Replay），它允许从稀疏的二进制奖励中进行有效的样本学习，从而避免了复杂的奖励设计工程。它可以与任意的off-policy RL算法相结合，可以被看作是一种隐式的课程学习。
我们在用机器人手臂操纵物体的任务上展示了我们的方法。特别的，我们在三个不同的任务上进行了实验：推、滑和取放，在每一种情况下都只使用二进制奖励来表示任务是否完成。我们的消融研究表明，HER是一个关键因素，它使训练在这些具有挑战性的环境中成为可能。同时我们在模拟环境中训练的策略可以部署在物理机器人上，并成功完成任务。

hindsight

研究动机

Reward Shaping：需要专业领域知识，且会限制智能体的能力。
稀疏奖励（0-1二元奖励）挑战
从错误轨迹学习：（a example）打球没进篮筐，可以是投篮动作错误，也可以是篮筐位置不对

主要贡献

提出HER的技术：重放episode时考虑不同的目标，而不只智能体期望达到的
该技术可以和任何off-policy的方法进行组合
HER提高了采样效率，也可以从稀疏奖励环境中学习

方法描述

Multi-goal RL
目标定义：$g \in \mathcal{G}$满足$f_g: \mathcal{S} \mapsto \{0,1\}$，当完成目标是$f_g(s) = 1$
目标空间与状态空间等价：$\mathcal{S} = \mathcal{G}$，$f_g(s) = [s=g]$
目标空间是状态空间的子集：例如$\mathcal{S} = \mathbb{R}^2$为平面坐标系，$\mathcal{G}=\mathbb{R}$为x轴坐标，所以$f_g((x,y))=[x=g]$
假设给定状态$s$可以很容易找到目标$g$满足该状态，即存在映射$m:\mathcal{S} \mapsto\mathcal{G}$ $\text{s.t.}$ $\forall_{s \in \mathcal{S}} f_{m(s)}(s)=1$

算法流程

具体实现方法很简单（这也是我看的论文算法介绍部分最短的）

在收集完一个episode经验后重采样额外目标，直接对transition进行修改并放入经验池
额外目标选择方式（消融实验部分介绍）
- Final：将episode最后一个状态$s_T$作为回放目标
- Future：从当前transition之后状态中随机选取k个状态作为回放目标
- Episode：从当前episode所有状态中随机选取k个状态作为回放目标
- Random：从当前整个训练过程所遇到过的所有状态中随机选取k个状态作为回放目标

理论分析

无

实验验证

实验环境：基于MuJoCo自建的一个环境（OpenAI已将其封装在Gym环境中）https://openai.com/blog/ingredients-for-robotics-research/
- Pushing：将箱子推至目标位置
- Sliding：击打冰球至目标位置（该位置机械手臂无法到达）
- Pick-and-place：将箱子抓取到目标位置（目标位置在空中）
实验分析
- Does HER improve performance?

训练曲线

Does HER improve performance even if there is only one goal we care about?
- 和上图对比，多目标的训练更快

单目标训练曲线

How does HER interact with reward shaping?
- 奖励函数：$r(s,a,g) = \lambda|g-s_{\text{object}}|^p-|g-s'_{\text{obejct}}|^p$，其中$\lambda \in \{0,1\}, p\in\{1,2\}$
- Reward Shaping性能差原因：1）奖励函数设计和实际成功条件存在巨大差异；2）依据shaped reward惩罚行为可能会阻碍探索

reward shaping训练曲线

How many goals should we replay each trajectory with and how to choose them?
- 每次重构transition选择的目标数量k

选取额外目标策略消融实验

Deployment on a physical robot

部署到物理机器人

其他思考

本文是OpenAI大佬们的文章，没有严谨的数学推导，想法比较简单且有效，很容易理解。另外实验部分的问题写作法可以学习。
还是存在一个较大问题是，对于目标的定义过于简单，文章说目标部分的假设（可以很容易找到状态对应的目标）并不具备局限性通常可满足，单可能还是对于目标定义存在争论，该目标只适用于导航类任务（即目标是一个具体的坐标位置），暂时没有想到其他的任务可以如此简单定义目标。或者在复杂环境中很难直接通过当前状态设定对应目标，例如Atari游戏。

原文链接：https://proceedings.neurips.cc/paper/2017/file/453fadbd8a1a3af50a9df4df899537b5-Paper.pdf

参考资料：https://zhuanlan.zhihu.com/p/34309324 & https://zhuanlan.zhihu.com/p/501043736

【论文笔记MGRL】HER: Hindsight Experience Replay