Thanks Hado van Hasselt for the great work.

Introduction

In the problems of sequential decision making in continuous domains with delayed reward signals, the main purpose for the algorithms is to learn how to choose actions from an infinitely large action space to optimize a noisy delayed cumulative reward signal in an infinitely large state space, where even the outcome of a single action can be stochastic.

Here we assume that a model of environment is not known. Analytically computing a good policy from a continuous model can be infeasible, we focus on methods that explicitly update a representation of a value function, a policy or both here. In addition, we focus mainly on the problem of control, which means we want to find action-selection polices that yield high returns, as opposed to the problem of prediction, which aims to estimate the value of a given policy.

Some possible valuable books are given below: (we suppose that you are familiar with Sutton)

Szepesvari, C.: Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 4(1), 1-103(2010).
Busoniu, L., Babuska, R., De Schutter, B., Ernst, D.: Reinforcement learning and dynamic programming using function approximators. CRC Press, Boca Raton (2010).

Function Approximation

In this section, we mostly limit ourselves to the general functional form of the approximators and general methods to update the parameters. Non-parametric approaches, such as kernel-based methods are not included. Turn to the following references for more details if you are interested in kernel based methods:

Ormoneit, D., Sen, S.: Kernel-based reinforcement learning. Machine Learning 49(2), 161-178 (2002).

Linear Function Approximation

A linear function is a simple parametric function that depends on the feature vector. For instance, consider a value-approximation algorithm where the value function is approximated by:

V t (s) = θ T t ϕ (s)

$V_{t}(s)=\theta_{t}^{T}\phi(s)$
A common method to find features for a linear function approximator divides the continuous state space into separate segments and attaches one feature to each segment, ex. tile coding. However, one potential problem with discretizing methods such as tile coding is that the resulting function that maps state into features is not injective, i.e.

ϕ(s)=ϕ′(s) $\phi(s)=\phi'(s)$ does not imply that

s=s′ $s=s'$ . Another issue is related to the step-size parameter that many algorithms use. In general, it is often a good idea to make sure that

|ϕ(s)|=∑DΦkϕk(s)≤1 $|\phi(s)|=\sum_{k}^{D_{\Phi}}\phi_{k}(s)\leq 1$ for all

s $s$ . A final issue is that it introduces discontinuities in the function.

Some of the issues mentioned above can be tackled by suing so-called fuzzy sets. A fuzzy set is a generalization of normal sets to fuzzy membership. A state may belong partially to the set defined by feature $\phi_{i}$ and partially to the set defined by $\phi_{j}$ . An advantage of this view is that it is quite natural to assume that $\sum_{k}\phi_{k}\leq 1$ , since each part of an element can belong to only one set. It is possible to define the sets such that each combination of feature activations corresponds precisely to one single state, thereby avoiding the partial-observability problem sketched earlier. A common choice is to use triangular functions that are equal to one at the center of the corresponding feature and decay linearly to zero for states further from the center. A drawback of fuzzy sets is that these sets still need to be defined beforehand, which may be difficult.

Non-linear Function Approximation

In a parametric non-linear function approximator, the function that should be optimized is represented by some predetermined parameterized function. For instance, for value-based algorithms we may have:

V t (s) = V (ϕ (s), θ t)

$V_{t}(s)=V(\phi(s),\theta_{t})$
In general, a non-linear function approximator may approximate an unknown function with better accuracy than a linear function approximator that uses the same input features. A drawback of non-linear function approximation is that less convergence guarantees can be given.

Updating Parameters

Gradient Descent

A gradient descent update follows the direction of the negative gradient of some parameterized function that we want to minimize. The gradient of a parameterized function is a vector in parameter space that points in the direction in which the function decreases. Because the gradient only describes the local shape of the function, this algorithm can end up in a local minimum. For instance, using an error measure such as temporal-difference or a prediction error, i.e.

E (s t, a t, θ t) = (R (s t, a t, θ t) - r t + 1) 2

$E(s_{t},a_{t},\theta_{t})=(R(s_{t},a_{t},\theta_{t})-r_{t+1})^{2}$
Update parameter:

θ t + 1 = θ t - α t \nabla θ E (x, θ t)

$\theta_{t+1}=\theta_{t}-\alpha_{t}\nabla_{\theta}E(x,\theta_{t})$
If the parameter space is a curved space, it is more appropriate to use

dθTGdθ $d\theta^{T}Gd\theta$ where

G $G$ is a

P×P $P\times P$ positive semi-definite matrix. Wit this weighted distance metric, the direction of steepest descent becomes

\nabla ~ θ E (x, θ) = G - 1 \nabla θ E (x, θ)

$\tilde{\nabla}_{\theta}E(x,\theta)=G^{-1}\nabla_\theta E(x,\theta)$
which is known as the natural gradient. In general, the best choice for matrix

G $G$ depends on the functional form of

E $E$ .

Gradient-Free Optimization

Gradient free methods are useful when the function that is optimized is not differentiable or when it is expected that many local optima exist. There are many general global methods for optimization exist, including evolutionary algorithms. Details about evolutionary algorithms are beyond the scope of this note.

Approximate Reinforcement Learning

Value Approximation

In value-approximation algorithms, experience samples are used to update a value function that gives an approximation of the current or the optimal policy. Many reinforcement learning algorithms fall into this category. Important differences between algorithms within this category is whether they are on-policy or off-policy and whether they update online or offline. Online algorithms are sometimes more sample-efficient in control problems.

In order to update a value with gradient descent, we must choose some measure of error that we can minimize. This measure is often referred to as objective function. We generalize standard temporal-difference learning to a gradient update on the parameters of a function approximator. The tabular TD-Learning update is

V t + 1 (s t) = V t (s t) + α t (s t) δ t

$V_{t+1}(s_{t})=V_{t}(s_{t})+\alpha_{t}(s_{t})\delta_{t}$
with

δ t = r t + 1 + γ V t (s t + 1) - V t (s t)

$\delta_{t}=r_{t+1}+\gamma V_{t}(s_{t+1})-V_{t}(s_{t})$ and

αt(s)∈[0,1] $\alpha_{t}(s)\in[0,1]$ is a step-size parameter. When the state values are stored in a table, TD-learning can be interpreted as a stochastic gradient-descent update on the one-step temporal-difference error:

E (s t) = 1 2 (r t + 1 + γ V t (s t + 1) - V t (s t)) 2 = 1 2 δ 2 t

$E(s_{t})=\frac{1}{2}(r_{t+1}+\gamma V_{t}(s_{t+1})-V_{t}(s_{t}))^{2}=\frac{1}{2}\delta_{t}^{2}$
However, if

Vt $V_{t}$ is a parametrized function s.t.

Vt(s)=V(s,θt) $V_{t}(s)=V(s,\theta_{t})$ , the negative gradient with respect to the parameters is given by

- \nabla θ E (s t, θ) = - (r t + 1 + γ V t (s t + 1) - V t (s t)) \nabla θ (r t + 1 + γ V t (s t + 1) - V t (s t))

$-\nabla_{\theta}E(s_{t},\theta)=-(r_{t+1}+\gamma V_{t}(s_{t+1})-V_{t}(s_{t}))\nabla_{\theta}(r_{t+1}+\gamma V_{t}(s_{t+1})-V_{t}(s_{t}))$
Anyway, we can interpret

rt+1+γVt(st+1) $r_{t+1}+\gamma V_{t}(s_{t+1})$ as a stochastic approximation for

Vπ $V^{\pi}$ that does not depend on

θ $\theta$ . Then

- \nabla θ E (s t, θ) = - (r t + 1 + γ V t (s t + 1) - V t (s t)) \nabla θ V t (s t)

$-\nabla_{\theta}E(s_{t},\theta)=-(r_{t+1}+\gamma V_{t}(s_{t+1})-V_{t}(s_{t}))\nabla_{\theta}V_{t}(s_{t})$
This implies the parameters can be updated as

θ t + 1 = θ t + α t δ t \nabla θ V t (s t)

$\theta_{t+1}=\theta_{t}+\alpha_{t}\delta_{t}\nabla_{\theta}V_{t}(s_{t})$
Similarly, updates for action-value algorithms are

δ t = r t + 1 + γ Q t (s t + 1, a t + 1) - Q t (s t, a t)

$\delta_{t}=r_{t+1}+\gamma Q_{t}(s_{t+1},a_{t+1})-Q_{t}(s_{t},a_{t})$ or

δ t = r t + 1 + γ max a Q t (s t + 1, a) - Q t (s t, a t)

$\delta_{t}=r_{t+1}+\gamma\max_{a} Q_{t}(s_{t+1},a)-Q_{t}(s_{t},a_{t})$ for SARSA and Q-Learning respectively. We can also incorporate accumulating eligibility traces with trace parameter

λ $\lambda$ with the following two equations:

e t + 1 θ t + 1 = λ γ e t + \nabla θ V t (s t) = θ t + α t δ t e t + 1

$\begin{split}e_{t+1} &= \lambda\gamma e_{t}+\nabla_{\theta}V_{t}(s_{t})\\\theta_{t+1} &= \theta_{t}+\alpha_{t}\delta_{t}e_{t+1}\end{split}$
Parameters updated with equations above may diverge when off-policy updates are used. This holds for any temporal-difference method with

λ<1 $\lambda < 1$ . In other words, if we sample transitions from a distribution that does not comply completely to the state-visit probabilities that would occur under the estimation policy, the parameters of the function may diverge. This is unfortunate, because in the control setting ultimately we want to learn about the unknown optimal policy. Recently, a class of algorithms has been proposed to deal with this issue:

Maei, H.R., Sutton, R.S.: GQ (λ ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In: Proceedings of the Third Conference On Artificial General Intelligence (AGI-2010), pp. 91–96. Atlantis Press, Lugano (2010).
Sutton, R.S., Maei, H.R., Precup, D., Bhatnagar, S., Silver, D., Szepesv´ari, C., Wiewiora, E.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), pp. 993–1000. ACM (2009).

It is nontrivial to extend the standard online temporal difference algorithms to continuous action spaces. The algorithms in the next section are usually much better suited for use in problems with continuous actions.

Policy Approximation

If the action space is continuous finding the greedy action in each state can be nontrivial and time-consuming.

Policy Gradient Algorithms

The idea of policy-gradient algorithms is to update the policy with gradient ascent on the cumulative expected value $V^{\pi}$ . If the gradient is known, we can update the policy parameters with

ψ k + 1 = ψ k + β k \nabla ψ E (V π (s t)) = ψ k + β k \nabla ψ \int s \in S P (s t = s) V π (s) d s

$\psi_{k+1}=\psi_{k}+\beta_{k}\nabla_{\psi}E(V^{\pi}(s_{t}))=\psi_{k}+\beta_{k}\nabla_{\psi}\int_{s\in S}P(s_{t}=s)V^{\pi}(s)ds$
As a practical alternative, we can use stochastic gradient descent:

ψ t + 1 = ψ t + β t (s t) \nabla ψ V π (s t)

$\psi_{t+1}=\psi_{t}+\beta_{t}(s_{t})\nabla_{\psi}V^{\pi}(s_{t})$
Such procedures can be at best hope to find a local optimum, because they use a gradient of a value function that is usually not convex with respect to the policy parameters. Define the trajectory as

T $\mathcal{T}$ which is a sequence of states and actions. Then

\nabla ψ V π (s) = \int T \nabla ψ P (T | s, ψ) E {\sum t = 0 \infty γ t r t + 1 ∣ ∣ ∣ T} d T = \int T P (T | s, ψ) \nabla ψ log P (T | s, ψ) E {\sum t = 0 \infty γ t r t + 1 ∣ ∣ ∣ T} d T = E {\nabla ψ log P (T | s, ψ) E {\sum t = 0 \infty γ t r t + 1 ∣ ∣ ∣ T} ∣ ∣ ∣ ∣ s, ψ}

$\begin{split}\nabla_{\psi}V^{\pi}(s) &= \int_{\mathcal{T}}\nabla_{\psi}P(\mathcal{T}|s,\psi)E\left\{\left.\sum_{t=0}^{\infty}\gamma^{t}r_{t+1}\right|\mathcal{T}\right\}d\mathcal{T}\\ &=\int_{\mathcal{T}}P(\mathcal{T}|s,\psi)\nabla_{\psi}\log P(\mathcal{T}|s,\psi)E\left\{\left.\sum_{t=0}^{\infty}\gamma^{t}r_{t+1}\right|\mathcal{T}\right\}d\mathcal{T}\\ &= E\left\{\nabla_{\psi}\log P(\mathcal{T}|s,\psi)\left.E\left\{\left.\sum_{t=0}^{\infty}\gamma^{t}r_{t+1}\right|\mathcal{T}\right\}\right|s,\psi\right\} \end{split}$
Moreover, since only the policy term depend on

ψ $\psi$ , then

\nabla ψ log P (T | s, ψ) = \sum t = 0 \infty \nabla ψ log π (s t, a t, ψ)

$\begin{split}\nabla_{\psi}\log P(\mathcal{T}|s,\psi)&=\sum_{t=0}^{\infty}\nabla_{\psi}\log\pi(s_{t},a_{t},\psi)\end{split}$
This only holds if the policy is stochastic. In most cases, this is not a big problem, for stochastic polices are needed anyway to ensure sufficient exploration.

There are two examples of stochastic polices for policy gradient algorithms:

Boltzmann Exploration
Gaussian Exploration

We need to sample the expected cumulative discounted reward to get the gradient. For instance, if the task is episodic we can take a Monte Carlo sample that gives the cumulative (possibly discounted) reward for each episode:

\nabla ψ V π (s) = E ⎧ ⎩ ⎨ R k (s t) ⎛ ⎝ \sum j = t T k - 1 \nabla ψ log π (s j, a j, ψ) ⎞ ⎠ ⎫ ⎭ ⎬

$\nabla_{\psi}V^{\pi}(s)=E\left\{R_{k}(s_t)\left(\sum_{j=t}^{T_{k}-1}\nabla_{\psi}\log\pi(s_{j},a_{j},\psi)\right)\right\}$ where

Rk=∑Tk−1j=tγt−jrj+1 $R_{k}=\sum_{j=t}^{T_{k}-1}\gamma^{t-j}r_{j+1}$ is the total discounted return obtained after reaching state

st $s_{t}$ in episode

k $k$ .

Actor Critic Algorithms

The variance of the estimate of $\nabla_{\psi}V^{\pi}(s_{t})$ can be very high if Monte Carlo roll-outs are used, which can severely slow convergence. A potential solution to this problem is presented by using an explicit approximation of $V^{\pi}$ . Such an approximate value function is called a critic and the combined algorithm is called an actor-critic algorithm.

Actor critic algorithms typically use a temporal difference algorithm to update $V_{t}$ , an estimate for $V^{\pi}$ . Assuming $b(s_{t})=V^{\pi}(s_{t})$ as a baseline, this leads to an unbiased estimate of $\delta_{t}\nabla_{\psi}\log\pi(s_{t},a_{t},\psi_{t})$ for the gradient of the policy. A typical actor-critic update would update the policy parameters with

ψ t + 1 = ψ t + β t (s t) δ t \nabla ψ log π (s t, a t, ψ t)

$\psi_{t+1}=\psi_{t}+\beta_{t}(s_{t})\delta_{t}\nabla_{\psi}\log\pi(s_{t},a_{t},\psi_{t})$
where

δt=rt+1+γVt(st+1)−Vt(st) $\delta_{t}=r_{t+1}+\gamma V_{t}(s_{t+1})-V_{t}(s_{t})$ is an unbiased estimate of

Qπ(st,at)−Vπ(st) $Q^{\pi}(s_{t},a_{t})-V^{\pi}(s_{t})$

Cacla (continuous actor-critic learning-automation) is special AC algorithm using an error in action space rather than in parameter or policy space and it uses the sign of the temporal difference error rather than its size. During learning, it is assumed that there is exploration. As in many other actor-critic algorithms, if the temporal-difference error $\delta_{t}$ is positive, we judge $a_{t}$ to be a good choice and we reinforce it. In Cacla, this is done by updating the output of the actor towards $a_t$ . This is why exploration is necessary: without exploration the actor output is already equal to the action, and the parameters cannot be updated.

An update to the actor only occurs when the temporal difference error is positive. As an extreme case, consider an actor that already outputs the optimal action in each state for some deterministic MDP. For most exploring actions, the temporal-difference error is then negative. If the actor would be updated away from such an action, its output would almost certainly no longer be optimal.

This is an important difference between Cacla and policy-gradient methods: Cacla only updates its actor when actual improvements have been observed. This avoids slow learning when there are plateaus in the value space and the temporal difference errors are small. Intuitively, it makes sense that the distance to a promising action $a_t$ is more important than the size of the improvement in value.

Here is a simple Cacla algorithm:
Cacla

More details about actor critic algorithms? Refer to :

van Hasselt, H.P.,Wiering, M.A.: Using continuous action spaces to solve discrete problems. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN 2009), pp. 1149–1156 (2009).
http://blog.csdn.net/philthinker/article/details/71104095

Reinforcement Learning in Continuous State and Action Spaces: A Brief Note

Introduction

Function Approximation

Linear Function Approximation

Non-linear Function Approximation

Updating Parameters

Gradient Descent

Gradient-Free Optimization

Approximate Reinforcement Learning

Value Approximation

Policy Approximation

Policy Gradient Algorithms

Actor Critic Algorithms

热门文章

最新文章

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

Reinforcement Learning in Continuous State and Action Spaces: A Brief Note

Introduction

Function Approximation

Linear Function Approximation

Non-linear Function Approximation

Updating Parameters

Gradient Descent

Gradient-Free Optimization

Approximate Reinforcement Learning

Value Approximation

Policy Approximation

Policy Gradient Algorithms

Actor Critic Algorithms

热门文章

最新文章

相关电子书