Reinforcement learning uses training information to
- evaluate the actions taken
- rather than instruct by giving correct actions
1 An n-Armed Bandit Problem
Consider the following learning problem:
- you are faced repeatedly with a choice among n different actions.
- After each choice, you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected.
- Your objective is to maximize the expected total reward over some time period.
In this -armed bandit problem, each action has an expect reward given that the action is selected. (we call this the value of that action).
- If you knew the value of each action, then it would be trivial to solve the n-armed bandit problem: you would always select the action with highest value.
2 Action-Value Methods
Sample-average methods:
This greedy action selection method can be written as
-greedy methods: with small probability to select randomly.
- An advantage of these methods is that, in the limit as the number of plays increases, every action will be sampled an infinite number of times, guaranteeing that for all , and thus ensuring that all the converge to
3 Incremental Implementation
4 Tracking a Nonstationary Problem
If the bandit is changing over time, we can use a constant step-size parameter.
where the step-size parameter is constant. This results in being a weighted average of past rewards and the initial estimate :
A well-known result in stochastic approximation theory gives us the conditions required to assure convergence with probability 1:
- The first condition is required to guarantee that the steps are large enough to eventually overcome any initial conditions or random fluctuations.
- The second condition guarantees that eventually the steps become small enough to assure convergence.
5 Optimistic Initial Values
For methods with constant α, the bias is permanent.
- The downside is that the initial estimates become, in effect, a set of parameters that must be picked by the user, if only to set them all to zero.
- The upside is that they provide an easy way to supply some prior knowledge about what level of rewards can be expected.
6 Upper-Confidence-Bound Action Selection
When exploration, it would be better to select among the non-greedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertainties in those estimates. One effective way of doing this is to select actions as
where the number controls the degree of exploration. If , then a is considered to be a maximizing action.
- The idea of UCB action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of ‘s value.
- Each time is selected the uncertainty is presumably reduced.
- The use of the natural logarithm means that the increase gets smaller over time, but is unbounded, so all actions will eventually be selected.
7 Gradient Bandits
In this section we consider learning a numerical preference for each action . Then, the action probability is
There is a natural learning algorithm for this setting based on the idea of stochastic gradient ascent. On each step, after selecting the action and receiving the reward , the preferences are updated by:
where is a step-size parameter, and is the average of all the rewards up through and including time , which can be computed incrementally. The term serves as a baseline with which the reward is compared. If the reward is higher than the baseline, then the probability of taking in the future is increased, and if the reward is below baseline, then probability is decreased. The non-selected actions move in the opposite direction.