Continuous Multi-Step TD, Eligibility Traces and TD(λ): A brief note

简介: Thanks Richard S. Sutton and Andrew G. Barto for their great work in Reinforcement Learning: An Introduction.We focus on episodic case only and deal with continuous state and action space

Thanks Richard S. Sutton and Andrew G. Barto for their great work in Reinforcement Learning: An Introduction.

We focus on episodic case only and deal with continuous state and action spaces. Suppose you already have the basic knowledge of TD(0) methods in reinforcement learning.

Stochastic-gradient and Semi-gradient Methods

Stochastic-gradient methods are among the most widely used of all function approximation methods and are particularly well suited to online reinforcement learning.

Let v^(s,θ) be the approximation of the true value function v(s) with real valued components θ=(θ1,θ2,,θn)T for all sS. We assume that states appear in examples with the same distribution d, over which we are trying to minimize error on the Mean Square Value Error (MSVE(θ)=sSd(s)[vπ(s)v^(s,θ)]2). A straight-forward strategy is to try to minimize the error on the observed examples. Stochastic-gradient methods do this by adjusting the weight vector after each example by a small amount in the direction that would most reduce the error on that example:

θt+1=θt12αθ[vπ(St)v^(St,θt)]2=θt+α[vπ(St)v^(St,θt)]θv^(St,θt)

SGD methods are gradient descent methods because the overall step in θt is proportional to the negetive gradient of the example’s squared error. This is the direction in which the error falls most rapidly. Meanwhile, stochastic means that the update is done on only a single example, which might have been selected stochastically.

Why we take only a small step in the direction of the gradient? We do not seek to find a value function that has zero error for all states, but only an approximation that balances the errors in different states. In fact, the convergence results for SGD methods assume that α decreases over time. By that way, the SGD method is guaranteed to converge to a local optimum.

However, we do not know the true value function vπ(s) in practice. Hence it is necessary to substitute an estimate Ut in place of vπ(s). If Ut is an unbiased estimate, this is, if E[Ut]=vπ(St) for each t, then θt is guaranteed to converge to a local optimum. For example, the Monte Carlo target Ut=Gt is by definition an unbiased estimate of vπ(St).

If a bootstrapping estimate of vπ(St) is used as the target Ut (such as Ut=Rt+1+γv^(St+1,θt)), one does not obtain the same guarantees since the update step would rely on the target failing to be independent of θt. Bootstrapping methods take into account the effect of changing the weight vector θt on the estimate, but ignore its effect on the target. They include only a part of the gradient and, accordingly, we call them semi-gradient methods.

Although semi-gradient methods do not converge as robustly as gradient methods, they do converge reliably in important case such as the linear case.

Episodic Semi-gradient Control

In the view of control instead of prediction, the action-value function q(St,At) accounts more for the application than value function v(St). The general gradient-descent update for action-value prediction is

θt+1=θt+α[Utq^(St,At,θt)]q^(St,At,θt)

For example, the update for the one-step SARSA method is
θt+1=θt+α[Rt+1+γq^(St+1,At+1,θt)q^(St,At,θt)]q^(St,At,θt)

We call this method episodic semi-gradient one-step SARSA. For a constant policy, this method converges in the same way that TD(0) does, with the same kind of error bound MSVE(θTD)11γminθMSVE(θ).

n-Step Semi-gradient SARSA

We can obtain an n-step version of episodic semi-gradient SARSA by using an n-step return as the update target.

G(n)t=Rt+1+γRt+2++γn1Rt+n+γnq^(St+n,At+n,θt+n1),n1,0t<Tn

The n-step update update equation is
θt+n=θt+n1+α[G(n)tq^(St,At,θt+n1)]q^(St,At,θt+n1)

The performance is the best if an intermediate level of bootstrapping is used, corresponding to an n larger than 1.

The λ-Return

To begin with, we note that a valid update can be done not just toward any n-step return, but toward any average of n-step returns. The composite return possesses an error reduction property similar to that of individual n-step returns and thus can be used to construct backups with guaranteed convergence properties. A backup that average simpler component backups is called a compound backup.

The λ-return is defined by

Gλt=(1λ)n=1λn1G(n)t

which can be understood as one particular way of averaging n-step backups. This average contains all the n-step backups, each weighted proportional to λn1, where λ[0,1], and normalized by a factor of 1λ to ensure that the weights sum to 1. After a terminal state (if exists) has been reached, all subsequent n-step returns are equal to Gt. If we want, we can separate these post-termination terms from the main sum, yielding
Gλt=(1λ)n=1Tt1λn1G(n)t+λTt1Gt

Now it is time to define our first learning algorithm based on λ-return: the off-line λ-return algorithm. As an off-line algorithm, it makes no changes to the weight vector during the episode. Then, at the end of the episode, a whole sequence of off-line updates are made according to usual semi-gradient rule:
θt+1=θt+α[Gλtv^(St,θt)]v^(St,θt)

The approach is what we call the theoretical, or forward view of a learning algorithm. For each state visited, we look forward in time to all the future rewards and decide how best to combine them.

Eligibility Traces

Eligibility traces are one of the basic mechanisms of reinforcement learning. Almost any temporal-difference methods, such as Q-Learning or SARSA, can be combined with eligibility traces to obtain a more general method that may learn more efficiently. What eligibility traces offer beyond these is an elegant algorithmic mechanism with significant computational advantages. Only a single trace vector is required rather than a store of the last n feature vectors. Learning also occurs continually and uniformly in time rather than being delayed and ten catching up at the end of the episode.

The mechanism is a short-term memory vector, the eligibility trace etRn,that parallels the long-term memory weight vector θtRn. The rough idea is that when a component of θt participates in producing an estimated value, then the corresponding component of et is bumped up and then begins to fade away. Learning will then occur in that component of θt is a nonzero TD error occurs before the trace falls back to zero. The trace-decay parameter λ[0,1] determines the rate at which the trace falls.

The idea behind eligibility traces is very simple. Each time a state is visited it initializes a short-term memory process, a trace, which then decays gradually over time. This trace marks the state as eligibility for learning. If an unexpectedly good or bad event occurs while the trace is non-zero, then the state is assigned credit accordingly. There have been two kinds of traces: the accumulating trace and the replacing trace. In a accumulating trace, the trace builds up each time the state is entered. In a replacing trace, on the other hand, each time the state is visited the trace is reset to 1 regardless of the presence of a prior trace. Typically, eligibility traces decay exponentially according to the product of a decay parameter, λ, 0λ1, and a discount-rate parameter, γ, 0γ1. The accumulating trace is defined by:

et+1(s)={γλet(s),γλet(s)+1,ssts=st

where et(s) represents the trace for state s at time t. and st is the actual state at time t. The corresponding replacing trace is defined by:
et+1(s)={γλet(s),1ssts=st

In the gradient descent case, the (accumulating) eligibility trace is initialized to zero at the beginning of the episode, is incremental on each time step by the value gradient, and then fades away by γλ:

e0et=0,=v^(St,θt)+γλet1

The eligibility trace keeps track of which components of the weight vector have contributed, positively or negatively, to recent state valuations, where recent is defined in terms γλ.

TD(λ) Methods

TD(λ) improves over the off-line λ-return algorithm in three ways:

  1. It updates the weight vector on every step of an episode rather than only at the end.
  2. Its computations are equally distributed in time rather that all at the end of the episode.
  3. It can be applied to continuing problems rather than just episodic problems.

To begin with, the TD error for state-value prediction is:

δt=Rt+1+γv^(St+1,θt)v^(St,θt)

In TD(λ), the weight vector is updated on each step proportional to the scalar TD error and the vector eligibility trace:
θt+1=θt+αδtet

TD(λ) is oriented backward in time. At each moment we look at the current TD error and assign it backward to each prior state according to how much that state contributed to the current eligibility trace at that time. The complete update rule of TD(λ) (with so-called accumulating traces) is as follows:
  1. e=0
  2. Choose Aπ(|S)
  3. Observe R, S
  4. e=γλe+v^(S,θ)
  5. δ=R+γv^(S,θ)v^(S,θ)
  6. θ=θ+αδe
  7. S=S

In another variant of the algorithm, the eligibility traces are updated according to:

e=max(γλe,v^(S,θ))

This is called the replacing traces update.

TD(0) is the case in which only the one state preceding the current one is changed by the TD error. For larger values of λ, but still λ<1, more of the preceding states are changed but each more temporally distant state is changed less because the corresponding eligibility trace is smaller. We say that the earlier states are given less credit for the TD error.

TD(1) is a way of implementing Monte Carlo algorithms. On-line TD(1) learns in an n-step TD way from the incomplete ongoing episode, where the n steps are all the way up to the current step. If something unusually good or bad happens during an episode, control methods based on TD(1) can learn immediately and alter their behavior on that same episode.

相关文章
|
机器学习/深度学习 传感器 编解码
一文详解视觉Transformer在CV中的现状、趋势和未来方向(分类/检测/分割/多传感器融合)(中)
本综述根据三个基本的CV任务和不同的数据流类型,全面调查了100多种不同的视觉Transformer,并提出了一种分类法,根据其动机、结构和应用场景来组织代表性方法。由于它们在训练设置和专用视觉任务上的差异,论文还评估并比较了不同配置下的所有现有视觉Transformer。此外,论文还揭示了一系列重要但尚未开发的方面,这些方面可能使此类视觉Transformer能够从众多架构中脱颖而出,例如,松散的高级语义嵌入,以弥合视觉Transformer与序列式之间的差距。最后,提出了未来有前景的研究方向。
一文详解视觉Transformer在CV中的现状、趋势和未来方向(分类/检测/分割/多传感器融合)(中)
|
人工智能 监控 数据可视化
智慧工地管理云平台可视化AI大数据建造工地源码
数字孪生可视化大屏,一张图掌握项目整体情况;
286 3
|
机器学习/深度学习 存储 数据采集
优质数据的稀缺性:深度分析及可能的解决方案
在信息化社会,数据被誉为新的石油。然而,与之相反的是,我们却面临着优质数据的严重缺乏。这种现象引发了一系列的问题,特别是在人工智能(AI)和机器学习(ML)领域,这一问题尤为突出。
754 0
优质数据的稀缺性:深度分析及可能的解决方案
|
JSON 中间件 Linux
Windows安装Docker
Windows安装Docker
2208 0
Windows安装Docker
|
自然语言处理 安全 Linux
kali Linux的优点与缺点
kali Linux的优点与缺点
404 0
|
SQL 运维 Oracle
【大数据开发运维解决方案】记一次同事不慎用root起动weblogic以及启动日志卡在The server started in RUNNING mode 问题解决过程
最近因为单位换了新版本HD集群,有一些业务数据存在于hive数据库中。而有一些Smartbi的报表数据源是连接的华为HD Hive,因为变更了集群,需要将SmartBi的数据源改为新集群的。我将Kerberos认证凭据和新版本Hive jdbc驱动以及新的jdbc连接串给了同事,也将实施文档给了同事,但是同事在操作完成后,Smarbi节点无法正常起来(后台日志卡在:The server started in RUNNING mode,Server state changed to RUNNING),要么起来了就是无法联通Hive。
【大数据开发运维解决方案】记一次同事不慎用root起动weblogic以及启动日志卡在The server started in RUNNING mode 问题解决过程
|
Windows
scrlk键是什么意思(电脑键盘每个按键的作用详细图解)
scrlk键是什么意思(电脑键盘每个按键的作用详细图解)
8217 0
|
文字识别 开发工具 iOS开发
iOS小技能: OCR 之身份证识别 (正反面) 【 应用场景:物流类型app进行实名认证】
1、功能:可自动快速读出中国二代身份证上的信息(姓名、性别、民族、住址、身份证号码)并截取到身份证图像 2、应用场景:信用卡网申、商户进件、实名认证流程为了用户体验提供扫一扫证件识别身份证号码功能。
1274 0
iOS小技能: OCR 之身份证识别 (正反面) 【 应用场景:物流类型app进行实名认证】
|
SQL 索引
SQL四大功能及语法
SQL四大功能及语法
「企业安全架构」EA874:信息安全架构
「企业安全架构」EA874:信息安全架构