Policies can be represented by parameterized functions denoted as π(a∣s,θ), where θ∈Rm is a parameter vector.
When policies are represented as functions, optimal policies can be obtained by optimizing certain scalar metrics. Such a method is called policy gradient.
9.1 Policy representation: From table to function#
how to define optimal policies? maximize every state value. maximize certain scalar metrics.
how to update a policy? changing the entries in the table; changing the parameter θ.
how to retrieve the probability of an action? looking up the corresponding entry in the table;
Suppose that J(θ) is a scalar metric. Then optimal policies can be obtained by
θt+1=θt+α∇θJ(θt)
where ∇θJ is the gradient of J with respect to θ, t is the time step, and α is the optimization rate.
where d(s) is the weight of state s. It satisfies d(s)≥0 for any s∈S and ∑s∈Sd(s)=1. Therefore, we can interpret d(s) as a probability distribution of s. Then, the metric can be written as
Treat all the states equally important and select d0(s)=1/∣S∣.
We are only interested in a specific state s0 (e.g., the agent always starts from s0). In this case, we can design d0(s0)=1,d0(s=s0)=0.
d is dependent on the policy π. it is common to select d as dπ, which is the stationary distribution under π. One basic property of dπ is that it satisfies dπTPπ=dπT, where Pπ is the state transition probability matrix.
where α>0 is a constant learning rate. Since the true gradient is unknown, we can replace the true gradient with a stochastic gradient to obtain the following algorithm:
θt+1=θt+α∇θlnπ(at∣st,θt)qt(st,at),
where qt(st,at) is an approximation of qπ(st,at). If qt(st,at) is obtained by Monte Carlo estimation, the algorithm is called REINFORCE or Monte Carlo policy gradient which is one of earliest and simplest policy gradient algorithms.
Since ∇θlnπ(at∣st,θt)=π(at∣st,θt)∇θπ(at∣st,θt), we have
It is clear that π(at∣st,θt+1)≥π(at∣st,θt) when βt≥0 and π(at∣st,θt+1)<π(at∣st,θt) when βt<0.
Unfortunately, the ideal ways for sampling S and A are not strictly followed in practice due to their low efficiency of sample usage. A more sample-efficient implementation is given