Sarsa(λ) and Q(λ) in Tabular Case

简介: ‘Thanks R. S. Sutton and A. G. Barto for their great work in Reinforcement Learning: An Introduction.Eligibility Traces in Prediction ProblemsIn the backward view of TD(λ)TD(\lambda), t

‘Thanks R. S. Sutton and A. G. Barto for their great work in Reinforcement Learning: An Introduction.

Eligibility Traces in Prediction Problems

In the backward view of TD(λ), there is a memory variable associated with each state, its eligibility trace. The eligibility trace for state s at time t is a random variable denoted Zt(s)R+. On each step, the eligibility traces for all states decay by γλ, and the eligibility trace for the one state visited on the step is incremented by 1:

Zt(s)={γλZt1(s)γλZt1(s)+1sSts=St

for all nonterminal states s, where γ is the discount rate and λ is the λ-return 1 parameter or trace-decay parameter. This kind of eligibility trace is called an accumulating trace. The global TD error signal triggers proportional updates to all recently visited states, as signaled by their nonzero traces:
ΔVt(s)=αδtZt(s),sS

where
δt=Rt+1+γVt(St+1)Vt(St)

A complete prediction algorithm for on-line TD(λ) is given as follows:
  1. Initialize V(s) arbitrarily.
  2. Repeat for each episode:
    1. Initialize Z(s)=0 for all sS.
    2. Repeat for each step of episode:
      1. A action given by π for S.
      2. Observe reward R and the next state S.
      3. δR+γV(S)V(S)
      4. Z(S)Z(S)+1
      5. For all sS:
        1. V(s)V(s)+αδZ(s)
        2. Z(s)γλZ(s)
      6. SS
    3. Until S is terminal.

Sarsa(λ)

The main idea of control is simply to learn action values rather than state values. The idea in Sarsa(λ) is to apply the TD(λ) prediction method to state-action pairs. Let Zt(s,a) denote the trace for state-action pair s,a. Substitute state-action variables for state variables, i.e.

Qt+1(s,a)=Qt(s,a)+αδtZt(s,a),s,a

where
δt=Rt+1+γQt(St+1,At+1)Qt(St,At)

and
Zt(s,a)={γλZt1(s,a)+1γλZt1(s,a)s=St,a=Atotherwise for all s,a

The complete Sarsa( λ) algorithm is given as follows:
  1. Initialize Q(s,a) arbitrarily for all sS,aA(s)
  2. Repeat for each episode:
    1. Z(s,a)=0 for all sS,aA(s)
    2. Repeat for each step of episode:
      1. Take action A, observe R, S.
      2. Choose A from S using policy derived from Q.
      3. δR+γQ(S,A)Q(S,A)
      4. Z(S,A)Z(S,A)+1
      5. For all sS,aA(s):
        1. Q(s,a)Q(s,a)+αδZ(s,a)
        2. Z(s,a)γλZ(s,a)
      6. SS,AA
    3. Unitl S is terminal.

Q(λ)

There are two different methods have been proposed that combine eligibility traces and Q-Learning: the Watkins’s Q(λ) and Peng’s Q(λ). Here we focus on Watkin’s Q(λ) only and give a brief description of Peng’s Q(λ).

Unlike TD(λ) and Sarsa(λ), Watkins’s Q(λ) does not look ahead all the way to the end of the episode in its backup. It only looks ahead as far as the next exploratory action. For example, suppose the first action, At+1, is exploratory. Watkins’s Q(λ) would still do the one-step update of Qt(St,At) toward Rt+1+γmaxaQ(St+1,a). In general, if At+n is the first exploratory action, then the longest backup is toward

Rt+1+γRt+1++γn1Rt+n+γnmaxaQ(St+n,a)

where we assume off-line updating.

The mechanistic or backward view of Watkins’s Q(λ) is also very simple. Eligibility traces are used just as in Sarsa(λ), except that they are set to zero whenever an exploratory action is taken. That is to say. First, the traces for all state-action pairs are either decayed by γλ or, if an exploratory action was taken, set to 0. Second, the trace corresponding to the current state and action is incremented by 1, i.e.

Zt(s,a)=IsStIaAt+γλZt1(s,a)0 if Qt1(St,At)=maxaQt1(St,a) otherwise

where Ixy is an identity indicator function, equal to 1 if x=y and 0 otherwise. The rest of the algorithm is defined by
Qt+1(s,a)=Qt(s,a)+αδtZt(s,a),sS,aA(s)

where
δt=Rt+1+γmaxaQt(St+1,a)Qt(St,At)

The complete algorithm is given below:
  1. Initialize Q(s,a) arbitrarily, for all sS,aA(s).
  2. Repeat for each episode:
    1. Z(s,a)=0 for all sS,aA(s)
    2. Repeat for each step of episode:
      1. Take action A, observe R, S
      2. Choose A from S using policy derived from Q.
      3. AargmaxaQ(S,a), ifA ties for the max, then AA
      4. δR+γQ(S,A)Q(S,A)
      5. Z(S,A)Z(S,A)+1
      6. For all sS,aA(s):
        1. Q(s,a)Q(s,a)+αδZ(s,a)
        2. If A=A, then Z(s,a)γλZ(s,a), else Z(s,a)0.
      7. SS,AA
    3. Until S is terminal

Unfortunately, cutting off traces every time an exploratory action is taken loses much of the advantage of using eligibility traces. Peng’s Q(λ) is an alternate version of Q(λ) meant to remedy this, which can be thought of as a hybrid of Sarsa(λ) and Watkins’s Q(λ). In Peng’s Q(λ), there is no distinction between exploratory and greedy actions. Each component beckup is over many steps of actual experiences, and all but the last are capped by a final maximization over actions. For a fixed nongreedy policy, Qt converges to neither qπ nor q, but to some hybrid of the two. However, if the policy is gradually made more greedy, then the method may still converge to q.

Replacing Traces

In some cases significantly better performance can be obtained by using a slightly modified kind of trace known as replacing trace:

Zt(s)={γλZt1(s)1sSts=St

There are several possible ways to generalize replacing eligibility traces for use in control methods. In some cases, the state-action traces are updated by the following:
Zt(s,a)=10γλZt1(s,a)s=St,a=Ats=St,aAtsSt

  1. λ-return: Gλt=(1λ)n=1λn1G(n)t
相关文章
|
16天前
|
人工智能 JavaScript Java
阿里云百炼API调用教程:准备API-Key、配置环境变量和调用API流程
本文介绍阿里云百炼API调用全流程:注册登录阿里云账号,开通百炼服务,创建并配置API Key至环境变量,避免硬编码风险。支持通过Python的OpenAI兼容接口或DashScope SDK调用大模型,亦可在Node.js、Java等环境中使用。附详细命令与代码示例,助您快速上手百炼AI大模型平台。
423 1
|
4月前
|
Ubuntu 安全 搜索推荐
揭秘Ubuntu系统的优势,你想知道吗?
对于移动设备,Ubuntu系统还在不断地探索与支持。众多Ubuntu系统的社区和开发人员正在探索Ubuntu系统在移动领域的应用,以提供全新的、更加开放和稳定的移动系统体验。 对于云服务器,Ubuntu系统作为一种轻量级的操作系统,越来越受到云服务提供商的青睐。Ubuntu系统可以作为一种安全和高效的云服务器操作系统,无论在公有云、私有云或混合云里,都可以提供出色的性能和体验。
|
缓存 自然语言处理 API
阿里云百炼产品月刊【2025年8月】
阿里云百炼平台8月推出多项更新与活动。通义千问系列重磅升级,新增多款图像、语音及研究模型,如Qwen-Image、Qwen-Image-Edit、Qwen-MT-Image、Wan2.2-S2V等,全面增强图文生成与编辑能力。推出Qwen-Flash轻量模型,优化代码与推理性能,支持高并发低延迟场景。平台服务稳定性提升,部分模型计费策略调整,上下文缓存价格降低至input_token的20%,并提供100万免费token额度。同步上线“实训Agent创客”活动,助力用户快速上手新模型,提升实践能力。
453 0
|
4月前
|
JSON API 数据格式
天猫商品评论API响应数据解析
天猫商品评论API是淘宝开放平台提供的数据接口,支持获取评论内容、评分、时间等信息,具备筛选、分页功能,适用于电商数据分析与用户行为研究。
|
开发框架 JSON 前端开发
Go主流框架对比:Gin Echo Beego Iris
由于go的标准库非常丰富,尤其是net/http包的存在,基本上把别的语言需要通过框架搞的事情都做了,不用框架光用标准库也能顺畅的开发需求了。
3017 0
|
3月前
|
存储 SQL NoSQL
终于有人把数据库讲明白了
数据库是存储、管理与高效查询数据的系统,广泛应用于各类软件与企业系统。本文详解关系型与非关系型数据库的分类、特点及适用场景,结合实际案例教你如何选型,并介绍多数据库协同架构,助你构建高效、可扩展的数据体系。
终于有人把数据库讲明白了
|
3月前
|
人工智能 运维 监控
AI加持下的容器运维:别再当“背锅侠”,让机器帮你干活!
AI加持下的容器运维:别再当“背锅侠”,让机器帮你干活!
231 8
|
JavaScript 前端开发 编译器
【小白入门】 浏览器如何识别Typescript?
【10月更文挑战第1天】浏览器如何识别Typescript?
|
Java 微服务 Spring
【Spring Cloud】spring cloud 调用feign请求超时 feign.RetryableException: Read timed out executing POST
【Spring Cloud】spring cloud 调用feign请求超时 feign.RetryableException: Read timed out executing POST
2579 0
|
存储 运维 安全
导论|数据可信流通 从运维信任到技术信任
《数据二十条》提出建立数据可信流通体系,强调全流程合规与监管,确保数据来源合法、隐私保护和交易规范。数据已成为数字经济的关键要素,但面临身份确认、利益依赖、能力预期和行为后果的信任挑战。安全事件暴露了所有权保障和越权使用风险,外循环模式下责任主体不清、利益不一致等问题突显。为解决信任问题,需从运维信任转向技术信任,利用密码学和可信计算实现身份确认、利益对齐、能力预期和行为审计。关键技术包括可信数字身份验证、使用权跨域管控、安全分级测评和全链路审计。数据密态时代借助密码学保障数据全生命周期的安全可控,降低流通风险,实现广域数据的可信流通。基础设施如密态天空计算将支持这一转型。
327 0