441 字
2 分钟
Policy Gradient Methods

Policies can be represented by parameterized functions denoted as π(as,θ)π(a|s, θ), where θRmθ\in \mathbb{R}^{m} is a parameter vector.

When policies are represented as functions, optimal policies can be obtained by optimizing certain scalar metrics. Such a method is called policy gradient.

9.1 Policy representation: From table to function#

  • how to define optimal policies? maximize every state value. maximize certain scalar metrics.
  • how to update a policy? changing the entries in the table; changing the parameter θθ.
  • how to retrieve the probability of an action? looking up the corresponding entry in the table;

Suppose that J(θ)J(\theta) is a scalar metric. Then optimal policies can be obtained by

θt+1=θt+αθJ(θt)\theta_{t+1}=\theta_t+\alpha\nabla_\theta J(\theta_t)

where θJ\nabla_{\theta}J is the gradient of JJ with respect to θθ, tt is the time step, and αα is the optimization rate.

9.2 Metrics for defining optimal policies#

Metric 1: Average state value#

vˉπ=sSd(s)vπ(s),\bar{v}_\pi=\sum_{s\in\mathcal{S}}d(s)v_\pi(s),

where d(s)d(s) is the weight of state s.s. It satisfies d(s)0d(s)\geq0 for any sSs\in\mathcal{S} and sSd(s)=1.\sum_s\in\mathcal{S}d(s)=1. Therefore, we can interpret d(s)d(s) as a probability distribution of s.s. Then, the metric can be written as

vˉπ=ESd[vπ(S)].\bar{v}_\pi=\mathbb{E}_{S\sim d}[v_\pi(S)].

How to select the distribution dd?#

  • dd is independent of the policy π\pi.
    • Treat all the states equally important and select d0(s)=d_0(s)= 1/S.1/|\mathcal{S}|.
    • We are only interested in a specific state s0s_0 (e.g., the agent always starts from s0).s_0). In this case, we can design d0(s0)=1,d0(ss0)=0.d_0(s_0)=1,\quad d_0(s\neq s_0)=0.
  • dd is dependent on the policy π\pi. it is common to select dd as dπd_\pi, which is the stationary distribution under π\pi. One basic property of dπd_\pi is that it satisfies dπTPπ=dπT,d_\pi^TP_\pi=d_\pi^T, where PπP_\pi is the state transition probability matrix.

two important equivalent expressions of vπ\overline{v}_{\pi}.#

First one

J(θ)=limnE[t=0nγtRt+1]=E[t=0γtRt+1].J(\theta)=\lim_{n\to\infty}\mathbb{E}\left[\sum_{t=0}^n\gamma^tR_{t+1}\right]=\mathbb{E}\left[\sum_{t=0}^\infty\gamma^tR_{t+1}\right].

Which is easy to show:

E[t=0γtRt+1]=sSd(s)E[t=0γtRt+1S0=s]=sSd(s)vπ(s)=vˉπ.\begin{aligned} \mathbb{E}\left[\sum_{t=0}^\infty\gamma^tR_{t+1}\right]& \begin{aligned}=\sum_{s\in\mathcal{S}}d(s)\mathbb{E}\left[\sum_{t=0}^\infty\gamma^tR_{t+1}|S_0=s\right]\end{aligned} \\ &=\sum_{s\in\mathcal{S}}d(s)v_\pi(s) \\ &=\bar{v}_{\pi}. \end{aligned}

Second one

vπ=dTvπ.\overline{v}_{\pi}=d^{T}v_{\pi}.

Metric 2: Average reward#

rˉπsSdπ(s)rπ(s)=ESdπ[rπ(S)],\begin{aligned}\bar{r}_{\pi}&\doteq\sum_{s\in\mathcal{S}}d_\pi(s)r_\pi(s)\\&=\mathbb{E}_{S\sim d_\pi}[r_\pi(S)],\end{aligned}

where dπd_{\pi} is the stationary distribution and

rπ(s)aAπ(as,θ)r(s,a)=EAπ(s,θ)[r(s,A)s]r_\pi(s)\doteq\sum_{a\in\mathcal{A}}\pi(a|s,\theta)r(s,a)=\mathbb{E}_{A\sim\pi(s,\theta)}[r(s,A)|s]

is the expectation of the immediate rewards. Here, r(s,a)E[Rs,a]=rrp(rs,a)r(s,a)\doteq\mathbb{E}[R|s,a]=\sum_rrp(r|s,a).

two important equivalent expressions of rπ\overline{r}_{\pi}#

First one

J(θ)=limn1nE[t=0n1Rt+1].J(\theta)=\lim_{n\to\infty}\frac1n\mathbb{E}\left[\sum_{t=0}^{n-1}R_{t+1}\right].

It is easy to show that

limn1nE[t=0n1Rt+1]=sSdπ(s)rπ(s)=rˉπ.\lim_{n\to\infty}\frac{1}{n}\mathbb{E}\left[\sum_{t=0}^{n-1}R_{t+1}\right]=\sum_{s\in\mathcal{S}}d_\pi(s)r_\pi(s)=\bar{r}_\pi.

Second one

rˉπ=sSdπ(s)rπ(s)=dπTrπ.\bar{r}_\pi=\sum_{s\in\mathcal{S}}d_\pi(s)r_\pi(s)=d_\pi^Tr_\pi.

Summary#

RL9-1

The two metrics vˉπ\bar{v}_\pi and rˉπ\bar{r}_\pi are equivalent in the discounted case where γ<1.\gamma<1. In particular, it can be shown that

rˉπ=(1γ)vˉπ.\bar{r}_\pi=(1-\gamma)\bar{v}_\pi.

9.3 Gradients of the metrics#

9.4 Monte Carlo policy gradient (REINFORCE)#

The gradient-ascent algorithm for maximizing J(θ)J(\theta) is

θt+1=θt+αθJ(θt)=θt+αE[θlnπ(AS,θt)qπ(S,A)],\begin{aligned}\theta_{t+1}&=\theta_t+\alpha\nabla_\theta J(\theta_t)\\&=\theta_t+\alpha\mathbb{E}\Big[\nabla_\theta\ln\pi(A|S,\theta_t)q_\pi(S,A)\Big],\end{aligned}

where α>0\alpha>0 is a constant learning rate. Since the true gradient is unknown, we can replace the true gradient with a stochastic gradient to obtain the following algorithm:

θt+1=θt+αθlnπ(atst,θt)qt(st,at),\theta_{t+1}=\theta_t+\alpha\nabla_\theta\ln\pi(a_t|s_t,\theta_t)q_t(s_t,a_t),

where qt(st,at)q_t(s_t,a_t) is an approximation of qπ(st,at).q_\pi(s_t,a_t). If qt(st,at)q_t(s_t,a_t) is obtained by Monte Carlo estimation, the algorithm is called REINFORCE or Monte Carlo policy gradient which is one of earliest and simplest policy gradient algorithms.

Since θlnπ(atst,θt)=θπ(atst,θt)π(atst,θt)\nabla_\theta\ln\pi(a_t|s_t,\theta_t)=\frac{\nabla_\theta\pi(a_t|s_t,\theta_t)}{\pi(a_t|s_t,\theta_t)}, we have

θt+1=θt+α(qt(st,at)π(atst,θt))βtθπ(atst,θt),\theta_{t+1}=\theta_t+\alpha\underbrace{\left(\frac{q_t(s_t,a_t)}{\pi(a_t|s_t,\theta_t)}\right)}_{\beta_t}\nabla_\theta\pi(a_t|s_t,\theta_t),

Which can be further written concisely as

θt+1=θt+αβtθπ(atst,θt).\theta_{t+1}=\theta_t+\alpha\beta_t\nabla_\theta\pi(a_t|s_t,\theta_t).

It is clear that π(atst,θt+1)π(atst,θt)\pi(a_t|s_t,\theta_{t+1})\geq\pi(a_t|s_t,\theta_t) when βt0\beta_t\geq0 and π(atst,θt+1)<π(atst,θt)\pi(a_t|s_t,\theta_{t+1})<\pi(a_t|s_t,\theta_t) when βt<0\beta_t<0.

Unfortunately, the ideal ways for sampling S and A are not strictly followed in practice due to their low efficiency of sample usage. A more sample-efficient implementation is given RL9-1

Policy Gradient Methods
https://fuwari.vercel.app/posts/9_policy_gradient_methods/
作者
pride7
发布于
2024-06-15
许可协议
CC BY-NC-SA 4.0