601 字
3 分钟
Multi-Arm Bandits

Reinforcement learning uses training information to

  • evaluate the actions taken
  • rather than instruct by giving correct actions

1 An n-Armed Bandit Problem#

Consider the following learning problem:

  • you are faced repeatedly with a choice among n different actions.
  • After each choice, you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected.
  • Your objective is to maximize the expected total reward over some time period.

In this nn-armed bandit problem, each action has an expect reward given that the action is selected. (we call this the value of that action).

  • If you knew the value of each action, then it would be trivial to solve the n-armed bandit problem: you would always select the action with highest value.

2 Action-Value Methods#

Sample-average methods:

Qt(a)=R1+R2++RNt(a)Nt(a).Q_t(a)=\frac{R_1+R_2+\cdots+R_{N_t(a)}}{N_t(a)}.

This greedy action selection method can be written as

At=argmaxaQt(a).A_t=\underset{a}{\operatorname*{\arg\max}}Q_t(a).

ε\varepsilon-greedy methods: with small probability ε\varepsilon to select randomly.

  • An advantage of these methods is that, in the limit as the number of plays increases, every action will be sampled an infinite number of times, guaranteeing that Nt(a)N_t(a)\to\infty for all aa, and thus ensuring that all the Qt(a)Q_t(a) converge to q(a).q(a).

3 Incremental Implementation#

NewEstimateOldEstimate+StepSize[TargetOldEstimate].NewEstimate\leftarrow OldEstimate+StepSize\begin{bmatrix}Target-OldEstimate\end{bmatrix}.

4 Tracking a Nonstationary Problem#

If the bandit is changing over time, we can use a constant step-size parameter.

Qk+1=Qk+α[RkQk],Q_{k+1}=Q_k+\alpha\Big[R_k-Q_k\Big],

where the step-size parameter α(0,1]\alpha\in(0,1] is constant. This results in Qk+1Q_{k+1} being a weighted average of past rewards and the initial estimate Q1Q_{1}:

Qk+1=(1α)kQ1+i=1kα(1α)kiRi.Q_{k+1}=\quad(1-\alpha)^kQ_1+\sum_{i=1}^k\alpha(1-\alpha)^{k-i}R_i.

A well-known result in stochastic approximation theory gives us the conditions required to assure convergence with probability 1:

k=1αk(a)=andk=1αk2(a)<.\sum_{k=1}^\infty\alpha_k(a)=\infty\quad \quad\mathrm{and}\quad \sum_{k=1}^\infty\alpha_k^2(a)<\infty.
  • The first condition is required to guarantee that the steps are large enough to eventually overcome any initial conditions or random fluctuations.
  • The second condition guarantees that eventually the steps become small enough to assure convergence.

5 Optimistic Initial Values#

For methods with constant α, the bias is permanent.

  • The downside is that the initial estimates become, in effect, a set of parameters that must be picked by the user, if only to set them all to zero.
  • The upside is that they provide an easy way to supply some prior knowledge about what level of rewards can be expected.

6 Upper-Confidence-Bound Action Selection#

When exploration, it would be better to select among the non-greedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertainties in those estimates. One effective way of doing this is to select actions as

At=argmaxa[Qt(a)+clntNt(a)],A_t=\arg\max_a\left[Q_t(a)+c\sqrt{\frac{\ln t}{N_t(a)}} \right],

where the number c>0c > 0 controls the degree of exploration. If Nt(a)=0N_{t}(a) = 0, then a is considered to be a maximizing action.

  • The idea of UCB action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of aa‘s value.
  • Each time aa is selected the uncertainty is presumably reduced.
  • The use of the natural logarithm means that the increase gets smaller over time, but is unbounded, so all actions will eventually be selected.

7 Gradient Bandits#

In this section we consider learning a numerical preference Ht(a)H_{t}(a) for each action aa. Then, the action probability is

Pr{At=a}=eHt(a))b=1neHt(b)=πt(a),\Pr\{A_t{=}a\}=\frac{e^{H_t(a))}}{\sum_{b=1}^ne^{H_t(b)}}=\pi_t(a),

There is a natural learning algorithm for this setting based on the idea of stochastic gradient ascent. On each step, after selecting the action AtA_t and receiving the reward RtR_t, the preferences are updated by:

Ht+1(At)=Ht(At)+α(RtRˉt)(1πt(At)),andHt+1(a)=Ht(a)α(RtRˉt)πt(a),aAt,\begin{aligned}H_{t+1}(A_t)&=H_t(A_t)+\alpha\big(R_t-\bar{R}_t\big)\big(1-\pi_t(A_t)\big),\quad&\text{and}\\H_{t+1}(a)&=H_t(a)-\alpha\big(R_t-\bar{R}_t\big)\pi_t(a),\quad&\forall a\neq A_t,\end{aligned}

where α>0\alpha>0 is a step-size parameter, and RˉtR\bar{R}_t\in\mathbb{R} is the average of all the rewards up through and including time tt, which can be computed incrementally. The Rˉt\bar{R}_t term serves as a baseline with which the reward is compared. If the reward is higher than the baseline, then the probability of taking AtA_t in the future is increased, and if the reward is below baseline, then probability is decreased. The non-selected actions move in the opposite direction.

8 Associative Search (Contextual Bandits)#

Multi-Arm Bandits
https://fuwari.vercel.app/posts/multi-arm-bandits/
作者
pride7
发布于
2024-09-18
许可协议
CC BY-NC-SA 4.0