Prelim notes: RL

General formulation of the reinforcement learning problem: I am an agent inside an environment. Minimally, this environment consists of a set $S$ of states and $A$ of actions. I may additionally have a model $T(s,a,s')$ of this environment, which takes (state, action) pairs to distributions over states. At every time step, I observe a state $s$ and select an action $a$ . I then transition to a new state $s'$ according to some transition function $P$ (not necessarily the same as $T$ !) and receive a reward $R(s,a,s') \in \mathbb{R}$ .

My goal is to maximize expected total reward, either up to a fixed time horizon $T$ , or a discounted future $\sum_{t = 0}^\infty \gamma^t r_t$ . My behavior will be defined by a policy $\pi$ , which maps states to distributions over actions. Then the reinforcement learning problem is to define a policy that maximizes my reward in the sense just given.

Value iteration with a fixed policy (policy evaluation)

Suppose I already know the true transition function $P$ and reward function $R$ . Basic question: given a fixed policy, what is the value (expected total future reward) $V^\pi_t(s)$ of being in state $s$ ? Alternatively, what is the value $Q^\pi_t(s,a)$ of taking action $a$ in state $s$ ? (We’ll ignore the finite case from now on—it’s straightforward.) Certainly the following relations hold:

\[ V^\pi(s) = \sum_a \pi(s,a) \sum_{s’} P(s,a,s’) [R(s,a,s’) + \gamma V^\pi(s’)] \]

\[ Q^\pi(s,a) = \sum_{s’} P(s,a,s’) [R(s,a,s’) + \gamma \sum_a’ Q^\pi(s’,\pi(s’, a’))] \]

These are called the Bellman equations. In each case, they define a system of $|S|$ (or $|S \times A|$ ) linear equations in the same number of unknowns, and we can just throw the system at a solver to get the right values for $V$ or $Q$ .

Alternatively, there is an iterative method that often finds the solution faster. In particular, treat the Bellman equation for $V$ given above as an operater rather than an inequality. First observe that the true value function is a fixed point of this operator. It can be shown (TODO show it; easy in this case) that it is a contraction mapping; it follows from Banach’s fixed point theorem that the fixed point is unique, and repeated application of this backup will eventually converge to it.

Policy iteration

What I really want is the best policy $\pi^*$ . We can write down the conditions for this just like before (these are the Bellman optimality equations):

\[ V^*(s) = \max_a \sum_{s’} P(s,a,s’) [R(s,a,s’) + \gamma V^*(s’) ] \]

(assume, toward contradiction, that the optimal policy did something other than max over $a$ above—then we could improve its value by choosing the max instead).

Now suppose, given some $V^\pi$, that I define a new policy to greedily select actions which maximize the value of the successor state. It’s easy to see that the value of the new policy $\pi'$ is at least as good as the old one—at every step $Q^{\pi'}(s,a) \geq V(s)$ .

Clearly, for an optimal policy, this greedy selection leaves the overall value function unchanged; if, after applying greedy selection, the value function is unchanged, then the Bellman optimality equations are already satisfied. So iteration of this process:

Compute the value $V^\pi$ of $\pi$
Compute the optimal policy $\pi'$ for $V^\pi$

will eventually take us to a (non-unique) optimal policy and the (unique) optimal value function.

Value iteration

What if we just treat the Bellman optimality equations as an operator like we did the Bellman backup equations? The optimal value function a fixed point of the operator, and again the operator is a contraction mapping (TODO show it), so the optimal value function is the unique fixed point, and repeated application will take us there. From the optimal value function it’s straightforward to get the optimal policy.

We can think of this as a special case of policy iteration where we only run each internal policy evaluation for one step, rather than to convergence.

(First-visit) Monte Carlo with exploring starts

Now suppose we want to do policy evaluation, but don’t know how rewards are assigned. An easy way to do this is just run repeated trials in the environment. In each trial, every time we arrive at a state for the first time, record the return from that state until the end of time. (We only use the first visit to make sure this measurement is unbiased). The law of large numbers says the average of these reutrns will eventually converge to the real value function.

If we do this on Q rather than V, we don’t even need a model of the environment: we can directly record the value of taking actions in particular states. Now that we have a way to do policy estimation, we can use this to find an optimal policy, just as we did before. Assuming we visit every state an infinite number of times, the guarantees are the same as before. (The only way to ensure this is to assume we can randomly start from any state—otherwise the current policy may avoid some states altogether.) Even without an infinite number of trials, we can take noisy steps towards the optimal policy. This looks like SGD, and comes with the same guarantees.

On-policy Monte Carlo

How do we get good estimates without exploring starts? One way is to guarantee that the policy under consideration continues to select all actions with some probability (for at least long enough to guarantee that we see everything).

We can define an $\varepsilon$ -greedy policy which behaves greedily most of the time, but randomly an $\varepsilon$ fraction of the time. First note that of any $\varepsilon$ -soft strategy (which assigns at least $\varepsilon$ weight to every action), the $\varepsilon$ -greedy one is best (moving mass to the highest-valued action can only increase value).

(Easy way of handling this analysis: a $\varepsilon$ -greedy policy is the same as a greedy policy in a noisy environment. The best we can do there is a greedy policy, so the best $\varepsilon$ -soft policy is $\varepsilon$ -greedy.

Off-policy Monte Carlo

Now let’s estimate the value of a policy $\pi$ from a different policy $\pi'$ . For this, we just need to re-weight the samples we get from the $\pi'$ according to their probability under $\pi$ . While we can’t compute the probability of a sequence under $\pi$ directly, we can compute the ratio of probabilities:

\[ \frac{p(\mathbf{s})}{p’(\mathbf{s})} = \prod \frac{\pi(s,a) P(s,a,s’)}{\pi’(s,a,s’) P(s,a,s’)} = \prod\frac{\pi’(s,a)}{\pi’(s,a)} \]

Then the reweighted value is just $\sum (p/p' \cdot R) / \sum (p/p')$ .

As before, this also defines a control procedure.

Temporal-difference learning: TD(0)

Rather than updating only at the end of episodes, let’s learn in place: assuming my value function for the next state is approximately correct, update the value of my last state according to the return I just received.

Key observation: MC minimizes the value function error for each value on the training set; TD(0) finds the value function corresponding to the MLE of the underlying Markov model. First thing is obvious. Intuition for second: with a fixed policy, this is just a Markov model between states. For a given set of observed traces, the maximum-likelihood MM has a transition function which obeys TD(0) updates with equality (suppose it didn’t—then could get higher likelihood by moving transitions closer to mean). So usual Banach trick to argue that TD(0) updates eventually take us there.

SARSA

Do the above in Q space rather than V space.

Q-learning

Rather than taking steps towards the empirical Q, take steps towards Q under the greedy policy—think of this as a sort of early-updating SARSA. This is an off-policy control method, since the current value of the Qs reflects the “real” current policy, while the max implicitly defines an improved policy.

Summary of exact methods

Monte Carlo: \[ V(s_t) \leftarrow V(s_t) + \alpha [(\sum r_{t+}) - V(s_t)] \]

TD(0): \[ V(s_t) \leftarrow V(s_t) + \alpha [r_{t+1} + \gamma V(s_{t+1}) - V(s_t)] \]

SARSA: \[ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) ] \]

Q-learning: \[ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t) ] \]

Actor-critic methods

Values are inspected by some other model (a “critic”) which produces a distribution over output actions. Can either do the obvious thing (softmax on estimated values) or fancier.

R-learning

No horizon, no discounting—just keep track of the running best outcome for each action and do Q learning against the difference.

Belief space planning

Note first taht there’s no point in doing this if we don’t have a model—in that case might as well learn policy directly. So assume we have a model (maybe with some free params). Easy thing: form new MDP whose states are distributions over states of the underlying MDP. Harder: do normal HMM inference to get transition probabilities, marginal state probabilities, then do updates on value function with partial counts.

Approximating everything

Just approximate the $V$ or $Q$ values with some arbitrary parametric model $Q(s,a;\theta)$ , with everything exactly as before. One obvious thing is to minimize error (in Euclidean sense) between observed and real rewards. E.g. for TD(0)

\[ \frac{\partial}{\partial \theta} (r + Q’ - Q)^2 \propto (r + Q’ - Q) \frac{\partial}{\partial \theta} Q \]

Our previous models looked just like this, with a linear model and indicators on states or (state,action) pairs, but we can choose arbitrary Q (softmax, etc.). Can either take gradient steps directly, which looks like everything before, or do fancier (BFGS etc.).

TODO

More detail about contraction mappings

— 17 July 2014

← text, home