【5分钟 Paper】Deterministic Policy Gradient Algorithms-阿里云开发者社区

【5分钟 Paper】Deterministic Policy Gradient Algorithms

2023-08-05 60

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 【5分钟 Paper】Deterministic Policy Gradient Algorithms

论文题目：Deterministic Policy Gradient Algorithms

所解决的问题？

stochastic policy的方法由于含有部分随机，所以效率不高，方差大，采用deterministic policy方法比stochastic policy的采样效率高，但是没有办法探索环境，因此只能采用off-policy的方法来进行了。

背景

以往的action是一个动作分布π θ ( a ∣ s )，作者所提出的是输出一个确定性的策略(deterministic policy) a = μ θ ( s ) 。

In the stochastic case，the policy gradient integrates over both state and action spaces, whereas in the deterministic case it only integrates over the state space.

Stochastic Policy Gradient

前人采用off-policy的随机策略方法， behaviour policyβ ( a ∣ s ) ≠ π

Differentiating the performance objective and applying an approximation gives the off-policy policy-gradient (Degris et al., 2012b)

所采用的方法？

On-Policy Deterministic Actor-Critic

如果环境有大量噪声帮助智能体做exploration的话，这个算法还是可以的，使用sarsa更新critic，使用 Q w ( s , a ) 近似true action-valueQ μ ：

Off-Policy Deterministic Actor-Critic

we modify the performance objective to be the value function of the target policy, averaged over the state distribution of the behaviour policy

得到off-policy deterministic actorcritic (OPDAC) 算法：

与stochastic off policy算法不同的是由于这里是deterministic policy，所以不需要用重要性采样(importance sampling)。

取得的效果？

所出版信息？作者信息？

这篇文章是ICML2014上面的一篇文章。第一作者David Silver是Google DeepMind的research Scientist，本科和研究生就读于剑桥大学，博士于加拿大阿尔伯特大学就读，2013年加入DeepMind公司，AlphaGo创始人之一，项目领导者。

参考链接

参考文献：Degris, T., White, M., and Sutton, R. S. (2012b). Linear off-policy actor-critic. In 29th International Conference on Machine Learning.

扩展阅读

最后，论文给出了DPG的采用线性函数逼近定理，以及一些理论证明基础。

参考文献：Sutton, R.S., McAllester D. A., Singh, S. P., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. In Neural Information Processing Systems 12, pages 1057–1063.

这篇文章以后有时间再读一遍吧，里面还是有些证明需要仔细推敲一下

文章标签：

算法

【5分钟 Paper】Deterministic Policy Gradient Algorithms

所解决的问题？

背景

所采用的方法？

取得的效果？

所出版信息？作者信息？

参考链接

扩展阅读

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

【5分钟 Paper】Deterministic Policy Gradient Algorithms

所解决的问题？

背景

所采用的方法？

取得的效果？

所出版信息？作者信息？

参考链接

扩展阅读

热门文章

最新文章

相关电子书