【论文笔记MTRL】MGDT: Multi-Game Decision Transformers

Author： DarkDawn
发布时间：August 4, 2022
1779 views
One comment
2073 words
Categories：论文笔记

Google Research

MGDT

摘要

A longstanding goal of the field of AI is a strategy for compiling diverse experience into a highly capable, generalist agent. In the subfields of vision and language, this was largely achieved by scaling up transformer-based models and training them on large, diverse datasets. Motivated by this progress, we investigate whether the same strategy can be used to produce generalist reinforcement learning agents. Specifically, we show that a single transformer-based model – with a single set of weights – trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance. When trained and evaluated appropriately, we find that the same trends observed in language and vision hold, including scaling of performance with model size and rapid adaptation to new games via fine-tuning. We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning, and find that our Multi-Game Decision Transformer models offer the best scalability and performance. We release the pre-trained models and code to encourage further research in this direction.

人工智能领域的一个长期目标是将不同的经验汇聚成一个高能力、通用的智能体。在视觉和语言的子领域，这主要是通过扩大基于transformer的模型并在大型的、不同的数据集上训练来实现的。在这一进展的激励下，我们研究了同样的策略是否可以用来产生通用的强化学习智能体。具体来说，我们表明，一个纯粹离线训练的基于transformer的单一模型（使用统一权重）可以以接近人类的表现同时玩多达46个Atari游戏。当进行适当的训练和评估时，我们发现了在语言和视觉方面相同的趋势，包括随着扩大模型规模来提升性能和通过微调快速适应新游戏。我们比较了多游戏环境的几种方法，如在线和离线RL方法和行为克隆，并发现我们的Multi-Game Decision Transformer模型提供了最好的可扩展性和性能。我们发布了预训练的模型和代码，以鼓励在这个方向的进一步研究。

研究动机

在语言和视觉领域已经出现了大规模通用模型，类似的进展可能可用于强化学习领域训练通用智能体
强化学习已经有用于解决单任务或单环境多任务的方法，但很少有适用于多环境的方法

主要贡献

通过离线数据（包括专家数据和非专家数据）训练一个高性能的通用智能体适用于多个环境
语言和视觉上观察到的缩放趋势在强化学习中同样适用，即随着扩大模型规模来提升性能和通过微调快速适应新游戏
比较了实现Multi-task RL的多种方法，发现MGDT与guided generation结合的表现最好

MGDT的训练和测评

方法描述

问题定义：最优策略分布$P_\theta^*(a^t|o^{\le t}, a^{<t},r^{<t})$使得未来回报最大化$R^t = \sum_{k>t}r^k$

Sequence Modeling：$x=\left\langle\ldots, \mathbf{o}_{1}^{t}, \ldots, \mathbf{o}_{M}^{t}, \hat{R}^{t}, a^{t}, r^{t}, \ldots\right\rangle$
- $M$：图像观测的patch数量
- $\hat{R}^t$：智能体在后续序列中的目标回报（target return）
- 仅预测目标回报$\hat{R}$、动作$a$、奖励$r$，使用交叉熵损失
- 不考虑预测观测，因为状态的非离散型，以及需要额外的模型参数进行图像生成

MGDT框架

Tokenization
- 动作：环境本身将其离散化
- 奖励：将归一化的奖励转化为三元量$\{-1,0,+1\}$
- 回报：根据环境离散化，文章使用$\{-20,\dots,100\}$以$1$为间隔进行离散化，可以囊括数据集中大部分回报
- 观测：将其划分为 $6\times 6$ 个 $14\times 14$ 像素的patch，并加入可训练位置编码
训练数据集
- 训练集：41个Atari游戏，2次训练，每个训练提取50个policy checkpoint，每个policy生成1mil环境步数.一共4.1bil时间步数据，生成约160 bil个token
  - sub-optimal行为更具有多样性，并且对环境表征和错误决策将导致的后果学习有帮助
  - 很难定义最优策略的标准
- 泛化性、微调测试集：5个Atari游戏
Expert Action Inference（==没太看明白==）
- 仅作用于部署测试阶段：直接拟合数据集产生的动作不一定为最优策略（因为数据集中包含非专家数据）
- 专家level的未来回报贝叶斯公式：$P\left(R^{t} \mid \text {expert}^{t}, \ldots\right) \propto P_{\theta}\left(R^{t} \mid \ldots\right) P\left(\text {expert}^{t} \mid R^{t}, \ldots\right)$
- 逆温度系数$\kappa$专家判别器：$P\left(\text {expert}^{t} \mid R^{t}, \ldots\right) \equiv \exp(\kappa R^t) $（文章中$\kappa$取10）
- 自回归：根据log-probability $\log P_\theta(R^t|\dots) + \kappa R^t$选择目标回报$R^t$，根据$P_\theta(a^t|R^t,\dots)$采样动作
- （个人理解）在数据集拟合的$R^t$分布基础上人为引入一个指数函数，作为专家$R^t$的分布

部署时动作选择

理论分析

无

实验验证

Baseline
- BC：transformer-based行为克隆（删除reward和target return的序列）
- C51 DQN：（将累计回报看作分布的DQN）使用分类损失最小化TD误差
- CQL：offline conservative Q-learning
- CPC, BERT, and ACL：对比表征学习baseline，所有的状态表征网络都是在C51和CQL基线中使用的Impala CNN的基础上作为额外的MLP或transformer实现的。
对比不同的online/offline方法

41 Atari游戏综合得分

不同方法的扩展模型大小

模型参数

预训练+新环境微调
- 预训练：41个环境各50M步；微调：新环境500k步(1%)

预训练微调

最终表现超越训练数据
- 选择测试中表现最好的rollout top3与训练集最好分数比较（0%表示无提升）

超越数据集

最优动作推断效果（与行为克隆对比）

最优动作推断

数据集对比
- 专家数据集：总trajectory的前10%

数据集

其他思考

原文中提到了和GATO的对比，表明GATO使用的offline数据为near-optimal，以及需要专家trajectory的指引；而MGDT使用的为混合数据（专家和非专家都有）。（GATO参考http://darkdawn.top/index.php/archives/20/）
目前使用transformer建模强化学习问题的方法主要是将序列决策问题转换为序列预测问题，则可以参考语言、视觉大模型范式，所以主要问题转换为如何将MDP转化为序列token，本文和GATO的处理方式类似。后续主要问题还是如何直接在RL经典范式上构建transformer序列决策框架。

原文链接：https://sites.google.com/view/multi-game-transformers

参考资料：https://ai.googleblog.com/2022/07/training-generalist-agents-with-multi.html

Last modification：August 4, 2022

如果觉得我的文章对你有用，请随意赞赏

One comment

鍗庣撼鍏徃鍚堜綔寮€鎴锋墍闇€鏉愭枡锛熺數璇濆彿鐮?5587291507 寰俊STS5099
November 8th, 2025 at 10:52 pm

果博东方客服开户联系方式【182-8836-2750—】?薇- cxs20250806】
果博东方公司客服电话联系方式【182-8836-2750—】?薇- cxs20250806】
果博东方开户流程【182-8836-2750—】?薇- cxs20250806】
果博东方客服怎么联系【182-8836-2750—】?薇- cxs20250806】

Reply

【论文笔记MTRL】MGDT: Multi-Game Decision Transformers

DarkDawn • 2022 年 08 月 04 日

<blockquote><p>Google Research</p></blockquote><p><img src="https://darkdawn-typora-img.oss-cn-beijing.aliyuncs.com/img202207281050274.png" alt="MGDT" title="MGDT"style=""></p><h3>摘要</h3><blockquote><p>A longstanding goal of the field of AI is a strategy for compiling diverse experience into a highly capable, generalist agent. In the subfields of vision and language, this was largely achieved by scaling up transformer-based models and training them on large, diverse datasets. Motivated by this progress, we investigate whether the same strategy can be used to produce generalist reinforcement learning agents. Specifically, we show that a single transformer-based model – with a single set of weights – trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance. When trained and evaluated appropriately, we find that the same trends observed in language and vision hold, including scaling of performance with model size and rapid adaptation to new games via fine-tuning. We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning, and find that our Multi-Game Decision Transformer models offer the best scalability and performance. We release the pre-trained models and code to encourage further research in this direction.</p></blockquote><p>人工智能领域的一个长期目标是将不同的经验汇聚成一个高能力、通用的智能体。在视觉和语言的子领域，这主要是通过扩大基于transformer的模型并在大型的、不同的数据集上训练来实现的。在这一进展的激励下，我们研究了同样的策略是否可以用来产生通用的强化学习智能体。具体来说，我们表明，一个纯粹离线训练的基于transformer的单一模型（使用统一权重）可以以接近人类的表现同时玩多达46个Atari游戏。当进行适当的训练和评估时，我们发现了在语言和视觉方面相同的趋势，包括随着扩大模型规模来提升性能和通过微调快速适应新游戏。我们比较了多游戏环境的几种方法，如在线和离线RL方法和行为克隆，并发现我们的Multi-Game Decision Transformer模型提供了最好的可扩展性和性能。我们发布了预训练的模型和代码，以鼓励在这个方向的进一步研究。</p><h3>研究动机</h3><ul><li>在语言和视觉领域已经出现了大规模通用模型，类似的进展可能可用于强化学习领域训练通用智能体</li><li>强化学习已经有用于解决单任务或单环境多任务的方法，但很少有适用于多环境的方法</li></ul><h3>主要贡献</h3><ul><li>通过离线数据（包括专家数据和非专家数据）训练一个高性能的通用智能体适用于多个环境</li><li>语言和视觉上观察到的缩放趋势在强化学习中同样适用，即随着扩大模型规模来提升性能和通过微调快速适应新游戏</li><li>比较了实现Multi-task RL的多种方法，发现MGDT与guided generation结合的表现最好</li></ul><p><img src="https://darkdawn-typora-img.oss-cn-beijing.aliyuncs.com/img202208011948024.png" alt="MGDT的训练和测评" title="MGDT的训练和测评"style=""></p><h3>方法描述</h3><blockquote><p>问题定义：最优策略分布$P_\theta^*(a^t|o^{\le t}, a^{&lt;t},r^{&lt;t})$使得未来回报最大化$R^t = \sum_{k&gt;t}r^k$</p></blockquote><ul><li><p>Sequence Modeling：$x=\left\langle\ldots, \mathbf{o}_{1}^{t}, \ldots, \mathbf{o}_{M}^{t}, \hat{R}^{t}, a^{t}, r^{t}, \ldots\right\rangle$</p><ul><li>$M$：图像观测的patch数量</li><li>$\hat{R}^t$：智能体在后续序列中的目标回报（target return）</li><li>仅预测目标回报$\hat{R}$、动作$a$、奖励$r$，使用交叉熵损失</li><li>不考虑预测观测，因为状态的非离散型，<em>以及需要额外的模型参数进行图像生成</em></li></ul></li></ul><p><img src="https://darkdawn-typora-img.oss-cn-beijing.aliyuncs.com/img202208012045964.png" alt="MGDT框架" title="MGDT框架"style=""></p><ul><li><p>Tokenization</p><ul><li>动作：环境本身将其离散化</li><li>奖励：将归一化的奖励转化为三元量$\{-1,0,+1\}$</li><li>回报：根据环境离散化，文章使用$\{-20,\dots,100\}$以$1$为间隔进行离散化，可以囊括数据集中大部分回报</li><li>观测：将其划分为 $6\times 6$ 个 $14\times 14$ 像素的patch，并加入可训练位置编码</li></ul></li><li><p>训练数据集</p><ul><li><p>训练集：41个Atari游戏，2次训练，每个训练提取50个policy checkpoint，每个policy生成1mil环境步数.一共4.1bil时间步数据，生成约160 bil个token</p><ul><li>sub-optimal行为更具有多样性，并且对环境表征和错误决策将导致的后果学习有帮助</li><li>很难定义最优策略的标准</li></ul></li><li>泛化性、微调测试集：5个Atari游戏</li></ul></li><li><p>Expert Action Inference（==没太看明白==）</p><ul><li>仅作用于部署测试阶段：直接拟合数据集产生的动作不一定为最优策略（因为数据集中包含非专家数据）</li><li>专家level的未来回报贝叶斯公式：$P\left(R^{t} \mid \text {expert}^{t}, \ldots\right) \propto P_{\theta}\left(R^{t} \mid \ldots\right) P\left(\text {expert}^{t} \mid R^{t}, \ldots\right)$</li><li>逆温度系数$\kappa$专家判别器：$P\left(\text {expert}^{t} \mid R^{t}, \ldots\right) \equiv \exp(\kappa R^t) $（文章中$\kappa$取10）</li><li>自回归：根据log-probability $\log P_\theta(R^t|\dots) + \kappa R^t$选择目标回报$R^t$，根据$P_\theta(a^t|R^t,\dots)$采样动作</li><li>（个人理解）在数据集拟合的$R^t$分布基础上人为引入一个指数函数，作为专家$R^t$的分布</li></ul></li></ul><p><img src="https://darkdawn-typora-img.oss-cn-beijing.aliyuncs.com/img202208021613747.png" alt="部署时动作选择" title="部署时动作选择"style=""></p><h3>理论分析</h3><p>无</p><h3>实验验证</h3><ul><li><p>Baseline</p><ul><li>BC：transformer-based行为克隆（删除reward和target return的序列）</li><li>C51 DQN：（将累计回报看作分布的DQN）使用分类损失最小化TD误差</li><li>CQL：offline conservative Q-learning</li><li>CPC, BERT, and ACL：对比表征学习baseline，所有的状态表征网络都是在C51和CQL基线中使用的Impala CNN的基础上作为额外的MLP或transformer实现的。</li></ul></li><li>对比不同的online/offline方法</li></ul><p><img src="https://darkdawn-typora-img.oss-cn-beijing.aliyuncs.com/img202208011136184.png" alt="41 Atari游戏综合得分" title="41 Atari游戏综合得分"style=""></p><ul><li>不同方法的扩展模型大小</li></ul><p><img src="https://darkdawn-typora-img.oss-cn-beijing.aliyuncs.com/img202208021723929.png" alt="模型参数" title="模型参数"style=""></p><ul><li><p>预训练+新环境微调</p><ul><li>预训练：41个环境各50M步；微调：新环境500k步(1%)</li></ul></li></ul><p><img src="https://darkdawn-typora-img.oss-cn-beijing.aliyuncs.com/img202208022019569.png" alt="预训练微调" title="预训练微调"style=""></p><ul><li><p>最终表现超越训练数据</p><ul><li>选择测试中表现最好的rollout top3与训练集最好分数比较（0%表示无提升）</li></ul></li></ul><p><img src="https://darkdawn-typora-img.oss-cn-beijing.aliyuncs.com/img202208022026568.png" alt="超越数据集" title="超越数据集"style=""></p><ul><li>最优动作推断效果（与行为克隆对比）</li></ul><p><img src="https://darkdawn-typora-img.oss-cn-beijing.aliyuncs.com/img202208022033366.png" alt="最优动作推断" title="最优动作推断"style=""></p><ul><li><p>数据集对比</p><ul><li>专家数据集：总trajectory的前10%</li></ul></li></ul><p><img src="https://darkdawn-typora-img.oss-cn-beijing.aliyuncs.com/img202208022034854.png" alt="数据集" title="数据集"style=""></p><h3>其他思考</h3><ul><li>原文中提到了和GATO的对比，表明GATO使用的offline数据为near-optimal，以及需要专家trajectory的指引；而MGDT使用的为混合数据（专家和非专家都有）。（GATO参考<a href="http://darkdawn.top/index.php/archives/20/">http://darkdawn.top/index.php/archives/20/</a>）</li><li>目前使用transformer建模强化学习问题的方法主要是将序列决策问题转换为序列预测问题，则可以参考语言、视觉大模型范式，所以主要问题转换为如何将MDP转化为序列token，本文和GATO的处理方式类似。后续主要问题还是如何直接在RL经典范式上构建transformer序列决策框架。</li></ul><p>原文链接：<span class="external-link"><a class="no-external-link" href="https://sites.google.com/view/multi-game-transformers" target="_blank"><i data-feather="external-link"></i>https://sites.google.com/view/multi-game-transformers</a></span></p><p>参考资料：<span class="external-link"><a class="no-external-link" href="https://ai.googleblog.com/2022/07/training-generalist-agents-with-multi.html" target="_blank"><i data-feather="external-link"></i>https://ai.googleblog.com/2022/07/training-generalist-agents-with-multi.html</a></span></p>

【论文笔记MTRL】MGDT: Multi-Game Decision Transformers

摘要

研究动机

主要贡献

方法描述

理论分析

实验验证

其他思考

One comment

Leave a Comment Cancel reply
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

致22岁的自己

代码情诗【持续更新...

【论文笔记 MTRL】PCGrad: Gradient Surgery for Multi-Task Learning

【论文笔记】汇总

【竞赛经历】2019ICPC南京赛站总结

【论文笔记MGRL】GoalGAN: Automatic Goal Generation for Reinforcement Learning Agents

【论文笔记MTRL】Uni[MASK]: Unified Inference in Sequential Decision Problems

【题解报告】Codeforces Gym102307

【保研经历】中科院自动化所夏令营

【论文笔记MTRL】Multi-task Reinforcement Learning with Task Representation Method

【论文笔记MTRL】MGDT: Multi-Game Decision Transformers

摘要

研究动机

主要贡献

方法描述

理论分析

实验验证

其他思考

One comment

Leave a Comment Cancel reply 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

【论文笔记MTRL】MGDT: Multi-Game Decision Transformers

Leave a Comment Cancel reply
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款