为你分享73篇论文解决深度强化学习的18个关键问题

简介: 这两天我阅读了两篇篇猛文 A Brief Survey of Deep Reinforcement Learning 和 Deep Reinforcement Learning: An Overview,作者排山倒海的引用了 200 多篇文献,阐述强化学习未来的方向。

这两天我阅读了两篇篇猛文 A Brief Survey of Deep Reinforcement Learning 和 Deep Reinforcement Learning: An Overview,作者排山倒海的引用了 200 多篇文献,阐述强化学习未来的方向。

论文:A Brief Survey of Deep Reinforcement Learning

作者:Nevertiree

链接:http://www.paperweekly.site/papers/922

论文:Deep Reinforcement Learning: An Overview

作者:Nevertiree

链接:http://www.paperw2eekly.site/papers/137

原文归纳出深度强化学习中的常见科学问题,并列出了目前解法与相关综述,我在这里做出整理,抽取了相关的论文。 这里精选 18 个关键问题,涵盖空间搜索、探索利用、策略评估、内存使用、网络设计、反馈激励等等话题。

本文精选了 73 篇论文(其中 2017 年的论文有 27 篇,2016 年的论文有 21 篇),为了方便阅读,原标题放在文章最后,可以根据索引找到。

问题一:预测与策略评估

prediction, policy evaluation

万变不离其宗,Temporal Difference 方法仍然是策略评估的核心哲学【Sutton 1988】。TD的拓展版本和她本身一样鼎鼎大名—1992 年的 Q-learning 与 2015 年的 DQN。

美中不足,TD Learning 中很容易出现 Over-Estimate(高估)问题,具体原因如下:

The max operator in standard Q-learning and DQN use the same values both to select and to evaluate an action. — van Hasselt

旷世猛将 van Hasselt 先生很喜欢处理 Over-Estimate 问题,他先搞出一个 Double Q-learning【van Hasselt 2010】大闹 NIPS,六年后搞出深度学习版本的 Double DQN【van Hasselt 2016a】。

问题二:控制与最佳策略选择

control, finding optimal policy

目前解法有三个流派,一图胜千言:

f5a04fdef801ba9117233fbcada0ef6f2c933aff

图1:台大李宏毅教授的 Slide

最传统的方法是 Value-Based,就是选择有最优 Value 的 Action。最经典方法有:Q-learning 【Watkins and Dayan 1992】、SARSA 【Sutton and Barto 2017】。

后来 Policy-Based 方法引起注意,最开始是 REINFORCE 算法【Williams 1992】,后来策略梯度 Policy Gradient【Sutton 2000】出现。

最时行的 Actor-Critic 【Barto et al 1983】把两者做了结合。楼上 Sutton 老爷子的好学生、AlphaGo 的总设计师 David Silver 同志提出了 Deterministic Policy Gradient,表面上是 PG,实际讲了一堆 AC,这个改进史称 DPG【Silver 2014】。

98a3bdb5f1ed9ad478a81f465a34cc1cf8ae5ab1

图2:Actor-Critic 的循环促进过程

问题三:不稳定与不收敛问题

Instability and Divergence when combining off-policy,function approximation,bootstrapping

早在 1997 年 Tsitsiklis 就证明了如果 Function Approximator 采用了神经网络这种非线性的黑箱,那么其收敛性和稳定性是无法保证的。

分水岭论文 Deep Q-learning Network【Mnih et al 2013】中提到:虽然我们的结果看上去很好,但是没有任何理论依据(原文很狡猾的反过来说一遍)。

This suggests that, despite lacking any theoretical convergence guarantees, our method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in stable manner.

484567f3ef1511a8f9fec9c51238f068d5731551

图3:征服 Atari 游戏的 DQN


DQN 的改良主要依靠两个 Trick:

  • 经验回放【Lin 1993】

虽然做不到完美的独立同分布,但还是要尽力减少数据之间的关联性 。

  • Target Network【Mnih 2015】

Estimated Network 和 Target Network 不能同时更新参数,应该另设 Target Network 以保证稳定性。

Since the network Q being updated is also used in calculating the target value, the Q update is prone to divergence.(为什么我们要用 Target Network)

下面几篇论文都是 DQN 相关话题的:

  • 经验回放升级版:Prioritized Experience Replay 【Schaul 2016】
  • 更好探索策略 【Osband 2016】
  • DQN 加速 【He 2017a】
  • 通过平均减少方差与不稳定性 Averaged-DQN 【Anschel 2017】

下面跳出 DQN 的范畴:

  • Duel DQN【Wang 2016c】(ICML 2016 最佳论文)
  • Tips:阅读此文请掌握 DQN、Double DQN、Prioritized Experience Replay 这三个背景。
  • 异步算法 A3C 【Mnih 2016】
  • TRPO (Trust Region Policy Optimization)【Schulman 2015】
  • Distributed Proximal Policy Optimization 【Heess 2017】
  • Policy gradient 与 Q-learning 的结合【O'Donoghue 2017、Nachum 2017、 Gu 2017、Schulman 2017】
  • GTD 【Sutton 2009a、Sutton 2009b、Mahmood 2014】
  • Emphatic-TD 【Sutton 2016】

问题四:End-to-End 下的训练感知与控制

train perception and control jointly end-to-end

现有解法:

  • Guided Policy Search 【Levine et al 2016a】

问题五:数据利用效率

data/sample efficiency

现有解法有:

  • Q-learning 与 Actor-Critic
  • 经验回放下的actor-critic 【Wang et al 2017b】
  • PGQ,policy gradient and Q-learning 【O'Donoghue et al 2017】
  • Q-Prop, policy gradient with off-policy critic 【Gu et al 2017】
  • return-based off-policy control, Retrace 【Munos et al 2016】, Reactor 【Gruslyset al 2017】
  • learning to learn, 【Duan et al 2017、Wang et al 2016a、Lake et al 2015】

问题六:无法取得激励

reward function not available

现有解法基本上围绕模仿学习:

  • 吴恩达的逆强化学习【Ng and Russell 2000】
  • learn from demonstration 【Hester et al 2017】
  • imitation learning with GANs 【Ho and Ermon 2016、Stadie et al 2017】 (附TensorFlow 实现 [1])
  • train dialogue policy jointly with reward model 【Su et al 2016b】

问题七:探索-利用问题

exploration-exploitation tradeoff

现有解法有:

  • unify count-based exploration and intrinsic motivation 【Bellemare et al 2017】
  • under-appreciated reward exploration 【Nachum et al 2017】
  • deep exploration via bootstrapped DQN 【Osband et al 2016)】
  • variational information maximizing exploration 【Houthooft et al 2016】

问题八:基于模型的学习

model-based learning

现有解法:

  • Sutton 老爷子教科书里的经典安利:Dyna-Q 【Sutton 1990】
  • model-free 与 model-based 的结合使用【Chebotar et al 2017】

问题九:无模型规划

model-free planning

比较新的解法有两个:

  • Value Iteration Networks【Tamar et al 2016】是勇夺 NIPS2016 最佳论文头衔的猛文。

知乎上有专门的文章解说:Value iteration Network [2],还有作者的采访:NIPS 2016 最佳论文作者:如何打造新型强化学习观 [3]。VIN 的 TensorFlow 实现 [4]。

d42a31b6fb0b522607f36f3ec1894d2e756e8eea

图4:Value Iteration Network 的框架

  • DeepMind 的 Silver 大神发表的 Predictron 方法 【Silver et al 2016b】,附 TensorFlow 实现 [5]。

问题十:它山之石可以攻玉

focus on salient parts

@贾扬清 大神曾经说过:伯克利人工智能方向的博士生,入学一年以后资格考试要考这几个内容:强化学习和 Robotics、 统计和概率图模型、 计算机视觉和图像处理、 语音和自然语言处理、 核方法及其理论、 搜索,CSP,逻辑,Planning 等。

如果真的想做人工智能,建议都了解一下,不是说都要搞懂搞透,但是至少要达到开会的时候和人在 poster 前面谈笑风生不出错的程度吧。

因此,一个很好的思路是从计算机视觉与自然语言处理领域汲取灵感,例如下文中将会提到的 unsupervised auxiliary learning 方法借鉴了 RNN+LSTM 中的大量操作。

下面是 CV 和 NLP 方面的几个简介:物体检测 【Mnih 2014】、机器翻译 【Bahdanau 2015】、图像标注【Xu 2015】、用 Attention 代替 CNN 和 RNN【Vaswani 2017】等等。

问题十一:长时间数据储存

data storage over long time, separating from computation

最出名的解法是:

  • 在 Nature 上大秀一把的 Differentiable Neural Computer【Graves et al 2016】

问题十二:无回报训练

benefit from non-reward training signals in environments

现有解法围绕着无监督学习开展:

  • Horde 【Sutton et al 2011】
  • 极其优秀的工作:
  • unsupervised reinforcement and auxiliary learning 【Jaderberg et al 2017】
  • learn to navigate with unsupervised auxiliary learning 【Mirowski et al 2017】
  • 大名鼎鼎的 GANs 【Goodfellow et al 2014】

问题十三:跨领域学习

learn knowledge from different domains

现有解法全部围绕迁移学习走:

  • 【Taylor and Stone, 2009、Pan and Yang 2010、Weiss et al 2016】,learn invariant features to transfer skills 【Gupta et al 2017】

问题十四:有标签数据与无标签数据混合学习

benefit from both labelled and unlabelled data

现有解法全部围绕半监督学习:

  • 【Zhu and Goldberg 2009】
  • learn with MDPs both with and without reward functions 【Finn et al 2017)】
  • learn with expert's trajectories and those may not from experts 【Audiffren et al 2015】

问题十五:多层抽象差分空间的表示与推断

learn, plan, and represent knowledge with spatio-temporal abstraction at multiple levels

现有解法:

  • 多层强化学习 【Barto and Mahadevan 2003】
  • strategic attentive writer to learn macro-actions 【Vezhnevets et al 2016】
  • integrate temporal abstraction with intrinsic motivation 【Kulkarni et al 2016】
  • stochastic neural networks for hierarchical RL 【Florensa et al 2017】
  • lifelong learning with hierarchical RL 【Tessler et al 2017】

问题十六:不同任务环境快速适应

adapt rapidly to new tasks

现有解法基本上是 learn to learn learn:

  • a flexible RNN model to handle a family of RL tasks 【Duan et al 2017、Wang et al 2016a】
  • one/few/zero-shot learning 【Duan et al 2017、Johnson et al 2016、 Kaiser et al 2017b、Koch et al 2015、Lake et al 2015、Li and Malik 2017、Ravi and Larochelle, 2017、Vinyals et al 2016】

问题十七:巨型搜索空间

gigantic search space

现有解法:

  • 依然是蒙特卡洛搜索,详情可以参考初代 AlphaGo 的实现【Silver et al 2016a】

问题十八:神经网络架构设计

neural networks architecture design

现有的网络架构搜索方法:

  • 【Baker et al 2017、Zoph and Le 2017】,其中 Zoph 的工作分量非常重。

新的架构:

  • 有【Kaiser et al 2017a、Silver et al 2016b、Tamar et al 2016、Vaswani et al 2017、Wang et al 2016c】。

相关链接

[1] imitation learning with GANs 实现

https://github.com/openai/imitation

[2] Value iteration Network

https://zhuanlan.zhihu.com/p/24478944

[3] 如何打造新型强化学习观

http://www.sohu.com/a/121100017_465975

[4] Value Iteration Networks 实现

https://github.com/TheAbhiKumar/tensorflow-value-iteration-networks

[5] Predictron 实现

https://github.com/zhongwen/predictron

参考文献

[1] Anschel, O., Baram, N., and Shimkin, N. (2017). Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning. In the International Conference on Machine Learning (ICML).

[2] Audiffren, J., Valko, M., Lazaric, A., and Ghavamzadeh, M. (2015). Maximum entropy semisupervised inverse reinforcement learning. In the International Joint Conference on Artificial Intelligence (IJCAI).

[3] Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and Bengio, Y. (2017). An actor-critic algorithm for sequence prediction. In the International Conference on Learning Representations (ICLR).

[4] Baker, B., Gupta, O., Naik, N., and Raskar, R. (2017). Designing neural network architectures using reinforcement learning. In the International Conference on Learning Representations (ICLR).

[5] Barto, A. G. and Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379.

[6] Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13:835–846

[7] Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S.,Lakshminarayanan, B., Hoyer, S., and Munos, R. (2017). The Cramer Distance as a Solution to Biased Wasserstein Gradients. ArXiv e-prints.

[8] Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., and Levine, S. (2017). Combining model-based and model-free updates for trajectory-centric reinforcement learning. In the International Conference on Machine Learning (ICML)

[9] Duan, Y., Andrychowicz, M., Stadie, B. C., Ho, J., Schneider, J.,Sutskever, I., Abbeel, P., and Zaremba, W. (2017). One-Shot Imitation Learning. ArXiv e-prints.

[10] Finn, C., Christiano, P., Abbeel, P., and Levine, S. (2016a). A connection between GANs, inverse reinforcement learning, and energy-based models. In NIPS 2016 Workshop on Adversarial Training.

[11] Florensa, C., Duan, Y., and Abbeel, P. (2017). Stochastic neural networks for hierarchical reinforcement learning. In the International Conference on Learning Representations (ICLR)

[12] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., , and Bengio, Y. (2014). Generative adversarial nets. In the Annual Conference on Neural Information Processing Systems (NIPS), page 2672?2680.

[13] Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwinska, A., Col- ´ menarejo, S. G., Grefenstette, E., Ramalho, T., Agapiou, J., nech Badia, A. P., Hermann, K. M., Zwols, Y., Ostrovski, G., Cain, A., King, H., Summerfield, C., Blunsom, P., Kavukcuoglu, K., and Hassabis, D. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538:471–476

[14] Gruslys, A., Gheshlaghi Azar, M., Bellemare, M. G., and Munos, R. (2017). The Reactor: A Sample-Efficient Actor-Critic Architecture. ArXiv e-prints

[15] Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. (2017). Q-Prop: Sampleefficient policy gradient with an off-policy critic. In the International Conference on Learning Representations (ICLR).

[16] Gupta, A., Devin, C., Liu, Y., Abbeel, P., and Levine, S. (2017). Learning invariant feature spaces to transfer skills with reinforcement learning. In the International Conference on Learning Representations (ICLR).

[17] He, F. S., Liu, Y., Schwing, A. G., and Peng, J. (2017a). Learning to play in a day: Faster deep reinforcement learning by optimality tightening. In the International Conference on Learning Representations (ICLR)

[18] Heess, N., TB, D., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, A., Riedmiller, M., and Silver, D. (2017). Emergence of Locomotion Behaviours in Rich Environments. ArXiv e-prints

[19] Hester, T. and Stone, P. (2017). Intrinsically motivated model learning for developing curious robots. Artificial Intelligence, 247:170–86.

[20] Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. In the Annual Conference on Neural Information Processing Systems (NIPS).

[21] Houthooft, R., Chen, X., Duan, Y., Schulman, J., Turck, F. D., and Abbeel, P. (2016). Vime: Variational information maximizing exploration. In the Annual Conference on Neural Information Processing Systems (NIPS).

[22] Jaderberg, M., Mnih, V., Czarnecki, W., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. (2017). Reinforcement learning with unsupervised auxiliary tasks. In the International Conference on Learning Representations (ICLR).

[23] Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viegas, F., Watten- ´berg, M., Corrado, G., Hughes, M., and Dean, J. (2016). Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. ArXive-prints.

[24] Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., and Uszkoreit, J. (2017a). One Model To Learn Them All. ArXiv e-prints.

[25] Kaiser, Ł., Nachum, O., Roy, A., and Bengio, S. (2017b). Learning to Remember Rare Events. In the International Conference on Learning Representations (ICLR).

[26] Koch, G., Zemel, R., and Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition. In the International Conference on Machine Learning (ICML).

[27] Kulkarni, T. D., Narasimhan, K. R., Saeedi, A., and Tenenbaum, J. B. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In the Annual Conference on Neural Information Processing Systems (NIPS)

[28] Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338.

[29] Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016a). End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17:1–40.

[30] Li, K. and Malik, J. (2017). Learning to optimize. In the International Conference on Learning Representations (ICLR).

[31] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., & Tassa, Y., et al. (2015). Continuous control with deep reinforcement learning. Computer Science, 8(6), A187.

[32] Lin, L. J. (1993). Reinforcement learning for robots using neural networks.

[33] Mahmood, A. R., van Hasselt, H., and Sutton, R. S. (2014). Weighted importance sampling for off-policy learning with linear function approximation. In the Annual Conference on Neural Information Processing Systems (NIPS).

[34] Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., Kumaran, D., and Hadsell, R. (2017). Learning to navigate in complex environments. In the International Conference on Learning Representations (ICLR).

[35] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wier- stra, Daan, and Riedmiller, Martin. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

[36] Mnih, V., Heess, N., Graves, A., and Kavukcuoglu, K. (2014). Recurrent models of visual attention. In the Annual Conference on Neural Information Processing Systems (NIPS).

[37] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.

[38] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Harley, T., Lillicrap, T. P., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In the International Conference on Machine Learning (ICML)

[39] Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G.(2016). Safe and efficient offpolicy reinforcement learning. In the Annual Conference on Neural Information Processing Systems (NIPS).

[40] Nachum, O., Norouzi, M., and Schuurmans, D. (2017). Improving policy gradient by exploring under-appreciated rewards. In the International Conference on Learning Representations (ICLR).

[41] Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017). Bridging the Gap Between Value and Policy Based Reinforcement Learning. ArXive-prints.

[42] Ng, A. and Russell, S. (2000).Algorithms for inverse reinforcement learning. In the International Conference on Machine Learning (ICML).

[43] O'Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2017). PGQ: Combining policy gradient and q-learning. In the International Conference on Learning Representations (ICLR).

[44] Osband, I., Blundell, C., Pritzel, A., and Roy, B. V. (2016). Deep exploration via bootstrapped DQN. In the Annual Conference on Neural Information Processing Systems (NIPS).

[45] Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 – 1359.

[46] Ravi, S. and Larochelle, H. (2017). Optimization as a model for few-shot learning. In the International Conference on Learning Representations (ICLR).

[47] Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). Prioritized experience replay. In the International Conference on Learning Representations (ICLR).

[48] Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015). Trust region policy optimization. In the International Conference on Machine Learning (ICML).

[49] Schulman, J., Abbeel, P., and Chen, X. (2017). Equivalence Between Policy Gradients and Soft Q-Learning. ArXiv e-prints.

[50] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic policy gradient algorithms. International Conference on International Conference on Machine Learning (pp.387-395). JMLR.org.

[51] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016a). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489.

[52] Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., and Degris, T. (2016b). The predictron: End-to-end learning and planning. In NIPS 2016 Deep Reinforcement Learning Workshop.

[53] Stadie, B. C., Abbeel, P., and Sutskever, I. (2017).Third person imitation learning. In the International Conference on Learning Representations (ICLR).

[54] Sutton, R. S. and Barto, A. G. (2017). Reinforcement Learning: An Introduction (2nd Edition, in preparation). MIT Press.

[55] Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In the Annual Conference on Neural Information Processing Systems

(NIPS).

[56] Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C., and Wiewiora, ´E. (2009a). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In the International Conference on Machine Learning (ICML).

[57] Sutton, R. S., Szepesvari, C., and Maei, H. R. (2009b). A convergent O( ´ n) algorithm for off-policy temporal-difference learning with linear function approximation. In the Annual Conference on Neural Information Processing Systems (NIPS).

[58] Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. (2011). Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction, , proc. of 10th. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS).

[59] Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17:1–29

[60] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning,3(1):9–44.

Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In the International Conference on Machine Learning (ICML).

[61] Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. (2016). Value iteration networks. In the Annual Conference on Neural Information Processing Systems (NIPS).

[62] Taylor, M. E. and Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10:1633–1685.

[63] Tessler, C., Givony, S., Zahavy, T., Mankowitz, D. J., and Mannor, S. (2017). A deep hierarchical approach to lifelong learning in minecraft. In the AAAI Conference on Artificial Intelligence (AAAI).

[64] van Hasselt, H. (2010). Double Q-learning. Advances in Neural Information Processing Systems 23:, Conference on Neural Information Processing Systems 2010.

[65] van Hasselt, H., Guez, A., , and Silver, D. (2016a). Deep reinforcement learning with double Qlearning. In the AAAI Conference on Artificial Intelligence (AAAI).

[66] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need. ArXiv e-prints.

[67] Vezhnevets, A. S., Mnih, V., Agapiou, J., Osindero, S., Graves, A., Vinyals, O., and Kavukcuoglu, K. (2016). Strategic attentive writer for learning macro-actions. In the Annual Conference on Neural Information Processing Systems (NIPS).

[68] Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., and Wierstra, D. (2016). Matching networks for one shot learning. In the Annual Conference on Neural Information Processing Systems (NIPS).

[69] Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. (2016a). Learning to reinforcement learn. arXiv:1611.05763v1.

[70] Wang, S. I., Liang, P., and Manning, C. D. (2016b). Learning language games through interaction. In the Association for Computational Linguistics annual meeting (ACL)

[71] Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. (2016c). Dueling network architectures for deep reinforcement learning. In the International Conference on Machine Learning (ICML).

[72] Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8:279–292

[73] Weiss, K., Khoshgoftaar, T. M., and Wang, D. (2016). A survey of transfer learning. Journal of Big Data, 3(9)

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256.

[74] Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A.,Salakhutdinov, R., Zemel, R. S., and Bengio,Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In the International Conference on Machine Learning (ICML).

[75] Zhu, X. and Goldberg, A. B. (2009). Introduction to semi-supervised learning. Morgan & Claypool

Zoph, B. and Le, Q. V. (2017). Neural architecture search with reinforcement learning. In the International Conference on Learning Representations (ICLR)


原文发布时间为:2017-12-26
本文来自云栖社区合作伙伴“ 数据派THU”,了解相关信息可以关注“ 数据派THU”微信公众号
相关文章
|
7月前
|
人工智能 调度 vr&ar
探索生成模型的新篇章:扩散模型的理论与实践
【4月更文挑战第11天】扩散模型作为新兴的生成工具,基于变分自编码器(VAE)和去噪扩散概率模型(DDPM),通过逐步添加噪声生成样本,广泛应用于图像和视频生成,展示出在逆问题解决上的潜力。尽管训练复杂且计算需求高,研究者正通过新理论框架和SDE方法优化模型,以应对挑战并提升性能。
73 1
探索生成模型的新篇章:扩散模型的理论与实践
|
7月前
|
机器学习/深度学习 人工智能 自然语言处理
深度理解深度学习:从理论到实践的探索
【5月更文挑战第3天】 在人工智能的浪潮中,深度学习以其卓越的性能和广泛的应用成为了研究的热点。本文将深入探讨深度学习的核心理论,解析其背后的数学原理,并通过实际案例分析如何将这些理论应用于解决现实世界的问题。我们将从神经网络的基础结构出发,逐步过渡到复杂的模型架构,同时讨论优化算法和正则化技巧。通过本文,读者将对深度学习有一个全面而深刻的认识,并能够在实践中更加得心应手地应用这些技术。
|
7月前
|
人工智能 算法 测试技术
论文介绍:进化算法优化模型融合策略
【5月更文挑战第3天】《进化算法优化模型融合策略》论文提出使用进化算法自动化创建和优化大型语言模型,通过模型融合提升性能并减少资源消耗。实验显示,这种方法在多种基准测试中取得先进性能,尤其在无特定任务训练情况下仍能超越参数更多模型。同时,该技术成功应用于创建具有文化意识的日语视觉-语言模型。然而,模型融合可能产生逻辑不连贯响应和准确性问题,未来工作将聚焦于图像扩散模型、自动源模型选择及生成自我改进的模型群体。[论文链接: https://arxiv.org/pdf/2403.13187.pdf]
176 1
|
机器学习/深度学习 传感器 算法
【深度增强学习 初步学习总结】
【深度增强学习 初步学习总结】
|
机器学习/深度学习 算法 PyTorch
深度强化学习技术要求
深度强化学习技术要求
81 0
|
机器学习/深度学习 传感器 算法
深度强化学习
深度强化学习
183 0
|
机器学习/深度学习 自然语言处理 搜索推荐
「深度学习注意力机制 」TKDE 2022研究综述
「深度学习注意力机制 」TKDE 2022研究综述
203 0
|
机器学习/深度学习 算法 测试技术
【吴恩达深度学习笔记-改善深层神经网络】——第一周深度学习的实用层面(1.1)
【吴恩达深度学习笔记-改善深层神经网络】——第一周深度学习的实用层面(1.1)
【吴恩达深度学习笔记-改善深层神经网络】——第一周深度学习的实用层面(1.1)
|
机器学习/深度学习 算法
【吴恩达深度学习笔记-改善深层神经网络】——第一周深度学习的实用层面(1.2)
【吴恩达深度学习笔记-改善深层神经网络】——第一周深度学习的实用层面(1.2)
【吴恩达深度学习笔记-改善深层神经网络】——第一周深度学习的实用层面(1.2)
|
机器学习/深度学习 算法 网络架构
【吴恩达深度学习笔记-改善深层神经网络】——第一周深度学习的实用层面(1.3)
【吴恩达深度学习笔记-改善深层神经网络】——第一周深度学习的实用层面(1.3)
【吴恩达深度学习笔记-改善深层神经网络】——第一周深度学习的实用层面(1.3)