A Bit on Baselines

Reducing variance and taking names

This article discusses baselines in the context of vanilla policy gradient, more specifically when discussing generalized advantage estimation. In that post, we talk more about the intuition of baselines. This could of course be applied to many other algorithms.


Supposing you’ve gone through the intuition of baselines…

A more satisfying explanation for why subtracting a baseline to come up with Advantage works is usually swept under the rug or banished to a “see this reference” link, but as it’s not perfectly intuitive, I’d like to at least take a whack at explaining it.

To begin with, we need to consider at a higher level what we’re trying to accomplish. Consider the policy gradient we’ve derived so far:

\[\hat{g}=\sum_{t=0}^{T}\nabla _{\theta}\log\pi_\theta (a_t\vert s_t)_{\theta _k} (r_{t+1} + \gamma V(s_{t+1}))\]

This term is the sum over time steps in a single trajectory \(\tau\).

Now for the new part. The above term actually gives us a sample of the optimal policy gradient for all possible trajectories. What we would ideally like is the expectation over all trajectories:

\[\mathbb{E}\left[\hat{g}\right]=\mathbb{E}\left[\sum_{t=0}^{T}\nabla _{\theta}\log\pi_\theta (a_t\vert s_t)_{\theta _k} (r_{t+1} + \gamma V(s_{t+1})) \right]\] \[=\int_{\mathcal{T}} \sum_{t=0}^{T}\nabla _{\theta}\log\pi_\theta (a_t\vert s_t)_{\theta _k} (r_{t+1} + \gamma V(s_{t+1})) d\tau\]

where $\mathcal{T}$ is the set of all possible trajectories \(\tau\).

In order to calculate the true optimal policy gradient, we would need to collect every possible trajectory, which of course is intractable for any environment of more than trivial size. Instead, we collect a set \(\mathcal{D}\) of some batch size (say, 128) trajectories \(\tau\), and compute a sampled (Monte Carlo) expectation:

\[\approx \frac{1}{\left|\mathcal{D}\right|}\sum_{\tau\in\mathcal{D}}\sum_{t=0}^{T}\nabla _{\theta}\log\pi_\theta (a_t\vert s_t)_{\theta _k} (r_{t+1} + \gamma V(s_{t+1}))\]

Now, our goal is to reduce the variance of this policy gradient estimate. Let’s revert to the integral form:

\[\int_{\mathcal{T}} \sum_{t=0}^{T}\nabla _{\theta}\log\pi_\theta (a_t\vert s_t)_{\theta _k} (r_{t+1} + \gamma V(s_{t+1})) d\tau\]

and encapsulate everything into an integral of one function:

\[\int_{\mathcal{T}}f(\tau)\]

Now add and subtract the integral of a similar function we can evaluate, \(\phi \approx f\):

\[\int_{\mathcal{T}}f(\tau) = \int_{\mathcal{T}}\left(f(\tau)-\phi (\tau)\right)+\int_{\mathcal{T}}\phi(\tau)\]

Instead of estimating \(\int_{\mathcal{T}}f(\tau)\) (our original problem), we estimate \(\int_{\mathcal{T}}\left(f(\tau)-\phi (\tau)\right)\). Now

\[\text{Var}(f-\phi) = \text{Var}(f) - 2 \text{Cov}(f,\phi)+\text{Var}(\phi)\]

Since \(\phi\approx f\), \(\text{Cov}(f,\phi)\) is positive. If \(- 2 \text{Cov}(f,\phi)+\text{Var}(\phi)\) is negative, \(\text{Var}(f-\phi) \lt \text{Var}(f)\), so \(\int_{\mathcal{T}}\left(f(\tau)-\phi (\tau)\right)\), the new thing we need to estimate, has lower variance. This is exactly what we’re looking for!

Are you beginning to notice the similarity to our punchline above? Since it may not be completely obvious, I’ll make it painfully explicit. Starting with

\[\int_{\mathcal{T}} \sum_{t=0}^{T}\nabla _{\theta}\log\pi_\theta (a_t\vert s_t)_{\theta _k} (r_{t+1} + \gamma V(s_{t+1})) d\tau\]

we subtract \(V(s_t)\):

\[\int_{\mathcal{T}} \sum_{t=0}^{T}\nabla _{\theta}\log\pi_\theta (a_t\vert s_t)_{\theta _k} (r_{t+1} + \gamma V(s_{t+1})-V(s_t)) d\tau\] \[=\int_{\mathcal{T}} \sum_{t=0}^{T}\nabla _{\theta}\log\pi_\theta (a_t\vert s_t)_{\theta _k} (r_{t+1} + \gamma V(s_{t+1})) d\tau - \int_{\mathcal{T}} \sum_{t=0}^{T}\nabla _{\theta}\log\pi_\theta (a_t\vert s_t)_{\theta _k} (V(s_t)) d\tau\] \[=\int_{\mathcal{T}} f(\tau) - \int_{\mathcal{T}} \phi(\tau)=\int_{\mathcal{T}} f(\tau) - \phi(\tau)\]

Now just convert all that to the Monte Carlo, sampled-over-trajectories version, and Bob’s your uncle. This is why subtracting the value function from the n-step return reduces the variance of the samples from all trajectories.

One last thing. We added \(\int_{\mathcal{T}}\phi(\tau)\) so that we were still calculating \(\int_{\mathcal{T}}f(\tau)\), so shouldn’t we add \(V(s_t)\) back in somehow? Turns out you don’t have to; subtracting \(\phi\) has no effect on the bias of the Monte Carlo estimate, i.e. the expectation over trajectories.

I will banish some things to a “see this reference” link. If any of this seems suspect, take a look here. See equation 7 in that reference for why we can subtract a baseline without affecting the bias of the estimate. If you’re still not convinced this reduces variance, take a look here and perhaps also at section 5.5 (p.59) in this book on Monte Carlo methods. Pat yourself on the back, now you know “advantage” is just a clever rebranding of control variates.