Annealed Importance Sampling

Suppose you can’t find your keys. You know you left them in your apartment somewhere, but don’t remember where. Because this happens quite often you have a distribution over possible places you may have left your keys. You want to find out the probability that the keys are in a specific room of your apartment. Let’s call this room $R$ - mathematically speaking, you want to calculate: \(\begin{equation} P\left(\text{keys}\in R\right)=\intop \mathbf{1}\left[x\in R\right]\cdot p_\text{keys}(x)dx \end{equation}\)

where $x\in\mathbb{R}^2$ is a two-dimensional coordinate where the keys may have been forgotten and $p_\text{keys}(x)$ is the PDF for the keys to be in the location $x$. The function $\mathbf{1}\left[x\in R\right]$ equals 1 if $x\in R$, otherwise it is 0. We can rewrite the above as:

\[\begin{equation} P\left(\text{keys in }R\right) = \mathbb{E}_{x\sim p_\text{keys}}\left[\mathbf{1}\left[x\in R\right]\right] \end{equation}\]

How can we calculate (or approximate) this expectation?

While this example is a bit silly, the problem can be abstracted to fit many situations. We are going to try and solve it using Annealed Importance Sampling (AIS, ). Honestly, in 2D there are probably many better ways to do this, but 2D allows for simple (intuitive!) visualizations, so let’s stick with our somewhat wonky example.


Problem Statement

Let’s generalize the problem at hand a bit.

We have some distribution over the domain $\mathcal{X}$: \(\begin{equation} p(x)= \frac{1}{Z}\tilde{p}(x) \end{equation}\)

where for a given $x$ we know how to calculate $\tilde{p}(x)$. Unfortunately, we don’t know the normalizing constant $Z$.

Lastly, there’s some function $f:\mathcal{X}\rightarrow \mathbb{R}$ and we want (for whatever reason) to calculate:

\[\begin{equation}\label{eq:f-exp} \overline{f}=\mathbb{E}_{x\sim p(x)}\left[f\left(x\right)\right] \end{equation}\]

How can we do this? Is there a way to estimate $Z$ simultaneously?

Difficulties with Simpler Methods

AIS is not a simple method, and there are approaches that are easier to understand. One of them is importance sampling (see A.1 for more details), where a simple distribution $q(x)$ is chosen and the expectation in equation \eqref{eq:f-exp} is approximated according to:

\[\begin{equation} \tilde{w}(x)=\frac{\tilde{p}(x)}{q(x)}\qquad\overline{f}\approx \frac{\sum_{x_i\sim q}\tilde{w}(x_i)f(x_i)}{\sum_{x_i\sim q}\tilde{w}(x_i)} \end{equation}\]

Significantly, the normalization constant $Z$ can be approximated using importance sampling: $Z=\mathbb{E}_q\left[\tilde{w}(x)\right]$. However, choosing a well specified distribution $q(x)$ for the importance weights $\tilde{w}(x)$ is difficult, and bad choices can lead to very bad approximations unless an enormous number of samples is used.

If we are solely interested in estimating the expectation in equation \eqref{eq:f-exp}, then another alternative is available - as long as we have some method for producing samples from $p(x)$ using only the unnormalized function $\tilde{p}(x)$. If that is the case, then $M$ points $x_1,\cdots,x_M$ can be sampled and used to approximate the expectation:

\[\begin{equation} \overline{f}\approx\frac{1}{M}\sum_{i:\ x_i\sim p}^M f(x_i) \end{equation}\]

Typically, Markov chain Monte Carlo (MCMC) approaches are used, such as Langevin dynamics, in order to sample from the distribution. Many of these MCMC methods only require access to the gradient of the log of the distribution, $\nabla \log p(x)=\nabla \log \tilde{p}(x)$, so not knowing the normalization constant isn’t a problem. However, this doesn’t give us any estimate of $Z$ and many times it’s also difficult to tune an MCMC sampler.


Annealing the Importance Distribution

The main intuition behind annealed importance sampling (or AIS) is the same as that of regular importance sampling. As in importance sampling, we begin by choosing: \(\begin{equation} q(x)=\frac{1}{Z_0}\tilde{q}(x) \end{equation}\) which is easy to sample from and whose normalization constant, $Z_0$, is known. AIS is an iterative algorithm - for $T$ iterations, define:

\[\begin{aligned} \pi_t(x)&=\tilde{q}(x)^{1-\beta(t)}\cdot\tilde{p}(x)^{\beta(t)}\\ \beta(t)&=t/T \end{aligned}\]

where $p(x)=\tilde{p}(x)/Z_T$ is the distribution we’re actually interested in.

Before continuing, let’s parse what’s going on here. Notice that $\beta(0)=0$ and $\beta(T)=1$, so:

\[\begin{align} \pi_0(x)&=\tilde{q}(x)\\ \pi_T(x)&=\tilde{p}(x) \end{align}\]

Furthermore, the values of $\beta(t)$ gradually move from 0 to 1, so for each $t$ the function $\pi_t(x)$ is an unnormalized distribution seating somewhere in between the two distributions $\tilde{q}(x)$ and $\tilde{p}(x)$. These intermediate distributions will allow a smooth transition from the simple distribution to the complex.

Using MCMC approach for sampling from unnormalized distributions, the AIS algorithm proceeds as follows:

  1. sample $x_0\sim q(x)$
  2. set $w_0=Z_0$
  3. for $t=1,\cdots,T$:
  4. $\qquad$set $w_t=w_{t-1}\cdot\frac{\pi_t(x_{t-1})}{\pi_{t-1}(x_{t-1})}$
  5. $\qquad$sample $x_t\sim \pi_t(x)$ starting from $x_{t-1}$

That’s it.

Actually, in practice $\beta(t)$ is defined differently (not linearly), but let’s ignore that for now.

Examples and Visualizations

Implementation details

In all of the following examples, I’m Langevin dynamics or the Metropolis corrected version (called MALA) with a single step as the MCMC algorithm between intermediate distributions. Moreover, I always used $q(x)=\mathcal{N}(x;\ 0, I)$ as the proposal distribution.

To be honest, this would not work in any real application - a single Langevin step doesn’t sample from the distribution (you usually need many more steps). Luckily, for these visualizations a single step was enough and conveys the message equally well, so I’d rather keep the simpler approach for now.

The first example is really simple - the target and proposal distributions are both Gaussian:

AIS from one Gaussian to another, non-isotropic Gaussian.

Figure 1: a really simple example of using AIS to anneal between a standard normal to another (non-isotropic) Gaussian. Brighter values indicate regions with higher probability, and the two black dots are the samples across the intermediate distributions. Notice how the intermediate distributions "pull" the two samples after them, finally reaching the target distribution.

An important advantage of AIS is that it anneals between a simple distribution, slowly morphing into the more complicated distribution. If properly calibrated, this allows it to sample from all modes:

AIS from one Gaussian to another, non-isotropic Gaussian.

Figure 2: AIS from one Gaussian to a mixture of 3 Gaussians. When the proposal distribution is properly set, the annealing process ensures that all modes are properly "covered".

Of course, AIS can be used to model much more complex distributions:

AIS from one Gaussian to another, non-isotropic Gaussian.

Figure 3: Notice how AIS doesn't "waste" samples in regions with practically 0 density towards the end.


Importance Weights

The mathematical trick of AIS is the way we defined the weights, $w_T$ (see A.2 for more details regarding the definition). Like in regular importance sampling, the weights are calculated such that:

\[\begin{equation} \mathbb{E}_{x_0\sim q}\left[w_T\right]=Z_T \end{equation}\]

So, we can use $M$ samples $x_T^{(1)},\cdot,x_T^{(M)}$ and importance weights $w_T^{(1)},\cdots,w_T^{(M)}$ created using the AIS algorithm to estimate the expectation from equation \eqref{eq:f-exp}:

\[\begin{equation} \overline{f}\approx\hat{f}= \frac{\sum_i^M w_T^{(i)}f(x_T^{(i)})}{\sum_i^M w_T^{(i)}} \end{equation}\]

In fact, as defined $\hat{f}$ is an unbiased estimator for $\overline{f}$!

Calculations in Log-Space

If you’ve ever dealt with probabilistic machine learning, you probably already know that multiplying (possible very small) probabilities is a recipe for disaster. This is also true here.

Recall:

\[\begin{equation} w_T=Z_0\cdot\frac{\pi_1(x_0)}{\pi_0(x_0)}\cdot\frac{\pi_2(x_1)}{\pi_1(x_1)}\cdots\frac{\pi_T(x_{T-1})}{\pi_{T-1}(x_{T-1})} \end{equation}\]

In almost all practical use cases, the values $\pi_i(x)$ are going to be very small numbers. So, $w_T$ is the product of many small numbers. If $T$ is very large, it is almost guaranteed that the precision of our computers won’t be able to handle the small numbers and eventually we’ll end up with $w_T=0/0$.

Instead, the importance weights are usually calculated in log-space, which modifies the update for the importance weight into:

\[\begin{equation} \log w_t=\log w_{t-1}+\log \pi_t(x_{t-1})-\log\pi_{t-1}(x_{t-1}) \end{equation}\]

The log-weights can then be averaged to get an estimate of $\log Z_t$… well, almost.

Averaging out the log-weights gives us $\mathbb{E}_{x_0\sim q(x)}[\log w_T]$ , however by Jensen’s inequality :

\[\begin{equation} \mathbb{E}_{x_0\sim q}[\log w_T] \le \log \mathbb{E}_{x_0\sim q}[w_T]=\log Z_T \end{equation}\]

So, when we use the log of importance weights, it’s important to remember that they only provide us with a stochastic lower boundThe lower bound is stochastic because we only get an estimate of $Z_T$ when the number of samples is finite. This makes things a bit hard sometimes: the variance of the estimator can sometimes push the estimate to be larger than the true value, even though it's a lower bound! of the normalization constant. When $T$ is very large, it can be shown that the variance of the estimator tends to 0, meaning the lower bound becomes tight.

Bottom line is: the number of intermediate distributions $T$ should be quite large and carefully calibrated.

Reversing the Annealing

There is a silver lining to the above. If we reverse the AIS procedure, that is start at $\pi_T(x)$ and anneal to $\pi_0(x)$, then we can generate a stochastic upper bound of $Z_T$.

Keeping the same notation as above, let $w_T$ be the importance weights of the regular AIS and $m_0$ be the importance weights of the reverse annealing. Then:

\[\begin{align} \mathbb{E}_{x_T\sim p}[\log m_0]&\le \log \mathbb{E}_{x_T\sim p}[m_0]=\log\frac{1}{Z_T}\\ \Leftrightarrow \log Z_T&\ge - \mathbb{E}_{x_T\sim p}[\log m_0] \end{align}\]

The only problem, which you may have noticed, is that the reverse procedure needs to start from samples out of $p(x)$, our target distribution. Fortunately, such samples were produced by the forward procedure of AISThis method for finding both the stochastic lower and upper bounds is called bidirectional Monte Carlo .!


Finding Your Keys

Back to our somewhat contrived problem.

Here’s your apartment and the PDF for $p_\text{key}(x)$ representing the distribution of probable key placements:

AIS from one Gaussian to another, non-isotropic Gaussian.

Figure 4: The floor plan with the density of finding the keys at each point in space (brighter is higher density). It's impossible to find the keys outside the house or in the walls, so the darkest blue in this image should be treated as essentially 0 density.

You’re place is really bigYeah, the floor design isn't that good... but I'm not an architect or anything, so it's fine.!

As you can see, there are rooms more and less likely to contain the keys and there are regions where it would be almost impossible to find the keys (all the places with the darkest shade of blue). Such places are, for instance, outside the house, in the walls or in the middle of a hallway.

Conveniently, the rooms are numbered. We want to estimate, given this (unnormalized) PDF the probability that the keys are in a room, say room 7:

\[\begin{equation} P(\text{keys}\in R_7)=? \end{equation}\]

Well, let’s use AIS to calculate the importance weights. Here’s the compulsory animation:

AIS from one Gaussian to another, non-isotropic Gaussian.

Figure 5: Running AIS on the floor plan. The points towards the end really look as if they were sampled from the correct distribution, even though it's such a weird one. Also, note that I ran this algorithm for many more iterations than the previous ones - this helped the sampling procedure, but could probably be done with less iterations.
More implementation details

Unlike the previous animations, for these trajectories I actually used 100 samples and am only showing 30 (otherwise everything would be full of moving black dots). Also, notice that towards the end of the AIS procedure the particles get “stuck”; this is because I used Metropolis-Hastings acceptance stepsIf you are unfamiliar with this term, don't sweat it. Basically, I used a method that rejects sampled points that aren't from the distribution I'm trying to sample from. and most of the sampling steps towards the end were rejected, because of the really small densities at the edges of the rooms.

Also, the annealing for this animation was a bit tricky to set. Because the density outside the house is basically constant (and equal to 0), if the annealing isn’t carefully adjusted points have a tendency of getting stuck there. My solution was to also anneal the impossibility of being in those regions, just in a much slower pace than the other parts of the distributionIf you've ever heard of log-barriers in optimization, then I think it's basically the same concept..

Using the importance weights accumulated during this sampling procedure, we can now calculate the probability of the keys being in any one of the rooms, for instance room 7:

\[\begin{align} P(\text{keys}\in R_7)&=\mathbb{E}_x\left[\textbf{1}[x\in R_7]\right]\\ &\approx\frac{\sum_i w_T^{(i)}\cdot \textbf{1}[x\in R_7]}{\sum_i w^{(i)}_T} \end{align}\]

Using this formula to calculate the probabilities of the keys being in each of the rooms, we get:

AIS from one Gaussian to another, non-isotropic Gaussian.

Figure 6: The same floor plan, only with the probabilities of the keys being in any of the rooms overlayed on top. Brighter rooms have higher probability.

And there you have it! You should probably check in either room 9 or 6 and only then search in the other rooms.


Practical Applications of AIS

While I do believe the example I gave in this post was good for visualization and intuition, it’s pretty silly (as I already mentioned). In 2D, rejection sampling probably achieves the same results with much less fuss.

The more common use for AIS that I’ve seen around is as a method for Bayesian inference (e.g. ).

Suppose we have some prior distribution $p(\theta;\ \varphi)$ parametrized by $\varphi$ and a likelihood $p(x\vert\theta)$. Bayesian inference is, at it’s core, all about calculating the posterior distribution and the evidence function:

\[\overbrace{p(\theta\vert x;\varphi)}^\text{posterior}=\frac{p(\theta)\cdot p(x\vert \theta)}{\underbrace{p(x;\varphi)}_\text{evidence}}\]

For most distributions in the real world this is really really hard. As a consequence, using MCMC methods for sampling from the posterior (or posterior sampling) is very common. However, such methods don’t allow for calculation of the evidence, which is one of the primary ways models are selected in Bayesian statistics.

AIS offers an elegant solution both to posterior sampling and evidence estimation. Let’s define our proposal and target distributions once more, adjusted for Bayesian inference:

\[\begin{equation} \pi_0(\theta)=p(\theta;\ \varphi)\qquad\ \ \ \ \ \ \ \ \pi_T(\theta)=p(\theta;\varphi)\cdot p(x\vert\ \theta) \end{equation}\]

As you have probably already noticed, $\pi_T(\theta)$ is the unnormalized version of the posterior. The normalization constant of $\pi_T(\theta)$ is exactly the evidence. We only need to choose an annealing schedule between the proposal and target distributions. Taking inspiration from our earlier annealing schedule, we can use (for example):

\[\begin{equation} \pi_t(\theta)=p(\theta;\varphi)\cdot p(x\vert\theta)^{\beta(t)} \end{equation}\]

where $\beta(0)=0$ and $\beta(T)=1$.

That’s it. If $T$ is large enough, then we can be sure that the samples procured from the AIS algorithm will be i.i.d. from the posterior. Moreover, the weights $w_T^{(i)}$ can be used to estimate the evidence:

\[\begin{equation} p(x;\varphi)\approx \frac{1}{M}\sum_i w_T^{(i)} \end{equation}\]

And there you have it! Instead of simply sampling from the posterior, you can get an estimate for the evidence at the same timeAs long as you don't use a batched method on many data points $x$ like they do in Bayesian neural networks, I don't think this will work there (although variants do exist).

Conclusion

You now (maybe) know what annealed importance sampling is and how to use it. My main hope was to give some intuition into what happens in the background when you use AIS. I find the concept of sampling by starting at a simple distribution and moving to a more complex one really cool, especially when it is treated in such a clear and direct manner.

Appendix

A.1 Importance Sampling

We know how to calculate $\tilde{p}(x)$, but don’t know how to sample from it. The simplest solution for calculating $\overline{f}$ and $Z$ is through what is called importance sampling.

Start by choosing a simpler distribution $q(x)$ whose normalization is completely known and is easy to sample fromAlso, you have to make sure that the support of $p(x)$ is contained in the support for $q(x)$!. Then:

\(\begin{align} \mathbb{E}_{x\sim p}[f(x)]&=\intop p(x)f(x)dx\\ &=\intop \frac{p(x)}{q(x)}f(x)q(x)dx\\ &=\mathbb{E}_{x\sim q}\left[\frac{p(x)}{q(x)}\cdot f(x)\right] \end{align}\)

Using $q(x)$, we somehow magically moved the difficulty of sampling $x$ from $p(x)$ to the much simpler operation of sampling $x$ from $q(x)$! The expectation can now be approximated using a finite number of samples. Let $w(x)=p(x)/q(x)$ and generate $M$ samples from the distribution $q(x)$ such that:

\(\begin{equation} \mathbb{E}_{x\sim p}\left[f(x)\right]\approx \frac{1}{M}\sum_{i:\ x_i\sim q}^M w(x_i)\cdot f(x_i) \end{equation}\)

But there’s a problem: we don’t really know how to calculate $p(x)$ (since we don’t know $Z$), only $\tilde{p}(x)$. Fortunately, we can also estimate $Z$ for the same price! Denote $\tilde{w}(x)=\tilde{p}(x)/q(x)$, then:

\(\begin{align} Z&=\intop \tilde{p}(x)dx=\intop\frac{\tilde{p}(x)}{q(x)}q(x)dx\\ &=\intop \tilde{w}(x)q(x)dx\\ &=\mathbb{E}_{x\sim q}\left[\tilde{w}(x)\right]\\ &\approx \frac{1}{M}\sum_{i:\ x_i\sim q}^M\tilde{w}(x_i) \end{align}\)

So, our estimate of $\overline{f}$ is given by:

\(\begin{equation} \mathbb{E}_{x\sim p}\left[f(x)\right]\approx \frac{1}{\sum_i\tilde{w}(x_i)}\cdot\sum_{i:\ x_i\sim q}^M \tilde{w}(x_i)\cdot f(x_i) \end{equation}\)

The $w(x)$ (and their unnormalized versions) are called importance weights as for each $x_i$ they capture the relative importance between $p(x_i)$ and $q(x_i)$.

At the limit $M\rightarrow\infty$, the above approximation becomes accurate. Unfortunately, when $M$ is finite, this estimation is biased and in many cases can be very misspecified.



A.2 AIS Importance Weights

Prerequisites:

To properly understand the construction of the importance weights in AIS, we are going to need to be more precise than my explanation in the main body of text.

So, as usual, we have a target distribution $p(x)=\pi_T(x)/Z_T$ and a proposal distribution $q(x)=\pi_0(x)/Z_0$. In between these two distributions, we have $T-1$ intermediate distributions unnormalized distributions, $\pi_1(x),\cdots,\pi_{T-1}(x)$. The missing piece in the original body of text is the fact that we have $T$ different transition operators that are invariant to the different distributions, which we will call $\mathcal{T}_t(x\rightarrow x’)$ for an operation that starts at $x$ and ends at $x’$. In practice, we can think of these as the transition probabilities in a Markov chain.

What do I mean by “invariant transition operators”? Well, these will be our sampling algorithms, so Langevin dynamics on the $t$-th distribution, $\pi_t(x)$. The “invariant” part just means that this transition operator maintains detailed balance with respect to the distribution $\pi_t(x)$:

\(\begin{equation} \mathcal{T}_t(x\rightarrow x')\frac{\pi_t(x)}{Z_t}=\mathcal{T}_t(x'\rightarrow x)\frac{\pi_t(x')}{Z_t} \end{equation}\)

As long as $\mathcal{T}_t(x\rightarrow x')$ has this property for every possible pair of $x$ and $x'$, it can be used in AIS.

Now, recall that the sampling procedure in AIS was carried out as follows:

sample $x_0\sim \pi_0$
generate $x_1$ using $\mathcal{T}_1(x_0\rightarrow x_1)$
$\vdots$
generate $x_T$ using $\mathcal{T}_T(x_{T-1}\rightarrow x_T)$


This procedure describes a (non-homogeneous) Markov chain, with transition probabilities determined according to $\mathcal{T}_t$.

In the scope of this Markov chain, we can talk about the forward joint probability (starting at $x_0$ and moving to $x_T$) and the reverse joint probability (starting at $x_T$ and going back). At it’s root, AIS is just importance sampling with the reverse joint as the target and the forward as the proposal. Mathematically, define:

\(\begin{align} \pi(x_0,\cdots,x_T)&=\pi_T(x_T)\cdot\mathcal{T}_T(x_T\rightarrow x_{T-1})\cdots \mathcal{T}_1(x_1\rightarrow x_0)\\ q(x_0,\cdots,x_T)&=q(x_0)\cdot\mathcal{T}_1(x_0\rightarrow x_1)\cdots \mathcal{T}_T(x_{T-1}\rightarrow x_T) \end{align}\)

Of course, we never actually observe $T_t(x_t\rightarrow x_{t-1})$, only the opposite direction. How can we fix this? Well, using detailed balance:

\(\begin{equation} \mathcal{T}_t(x_t\rightarrow x_{t-1})=\frac{\pi_t(x_{t-1})}{\pi_t(x_t)}\cdot\mathcal{T}_t(x_{t-1}\rightarrow x_t) \end{equation}\)

This neat property allows us to write the full form of the importance weightsGetting to the last line requires rearranging the terms and using the equation above for the reverse transition, but this post is already pretty long and I don't think adding that math particularly helps here... Everything cancels out nicely and we get the form for the importance weights as in the main text!:

\(\begin{align} w=&\frac{\pi(x_0,\cdots,x_T)}{q(x_0,\cdots,x_T)}\\ &=\frac{\pi_T(x_T)}{q(x_0)}\cdot\frac{\mathcal{T}_T(x_T\rightarrow x_{T-1})\cdots \mathcal{T}_1(x_1\rightarrow x_0)}{\mathcal{T}_1(x_0\rightarrow x_1)\cdots \mathcal{T}_T(x_{T-1}\rightarrow x_T)}\\ &=Z_0\cdot \frac{\pi_1(x_0)}{\pi_0(x_0)}\cdot\frac{\pi_2(x_1)}{\pi_1(x_1)}\cdots\frac{\pi_T(x_{T-1})}{\pi_{T-1}(x_{T-1})} \end{align}\)

These importance weights are exactly the same as those defined in the main body of text, but their motivation is maybe clear now?

The important point is that the proposal distribution creates a path from $x_0$ to $x_T$ while the “true target distribution” is the path from $x_T$ to $x_0$. So the importance weighting is now the forward path $\stackrel{\rightarrow}{\mathcal{T}}(x_0\rightarrow x_T)$ as a simpler alternative to the reverse path $\stackrel{\leftarrow}{\mathcal{T}}(x_T\rightarrow x_0)$.

To hammer this point home, the normalization constant for $\pi_T(x)$ can be found by taking the expectation with regards to the forward paths: \(\begin{equation} Z_T=\mathbb{E}_{\stackrel{\rightarrow}{\mathcal{T}}(x_0\rightarrow x_T)}\left[w\right]=\mathbb{E}_{\stackrel{\rightarrow}{\mathcal{T}}(x_0\rightarrow x_T)}\left[\frac{\pi_T(x_T)\stackrel{\leftarrow}{\mathcal{T}}(x_T\rightarrow x_0)}{q(x_0)\stackrel{\rightarrow}{\mathcal{T}}(x_0\rightarrow x_T)}\right] \end{equation}\)

That was… probably hard to follow. Hopefully I got some of the message across - there is a Markov chain that goes from $q(x)$ to $\pi_T(x)$ and the reverse of it. If you understood that, and are comfortable with importance sampling, then you’re fine. It’ll sink in if you think about it a bit more.

This is a neat mathematical trick, though. Theoretically, it is no different than standard importance sampling, we just defined weird proposal and target distributions. Transforming the a simple distribution to something close to the target, though, that’s the core of it.

If you read this far, well, I commend you. Good luck using AIS!