Generative Models 4 - Denoising Diffusion Probabilistic Models

Disclaimer

The math in the previous posts pretty much followed what you’d see in the literature, although there were small changes so it would be easier to understand. In the next two sections, however, the math and explanations are very different to what you would usually see.

Why? Well, the explanations for both DDPM and score-based models are very involved, have a lot of disgusting math and are super hard to get, intuitively. My main goal was to help you understand, not just to copy down the equations you’d see in literature. So I followed the natural progression of the previous posts, which means that notations, steps and sometimes results are a bit different. Hopefully this enables you to go and learn on your own the exact details.

Much of the following is a reframing of , which deserves a lot of credit. There are particular parts I decided to expand into or cut out, but I generally followed their setup for DDPM. The parts regarding score-based models, though, are completely different and more fleshed out.

We can take what we saw before with VAEs and extend them. Instead of assuming that $z \mapsto x$ in one step, we will now assume that there are many steps from $z$ to $x$. We will define the steps intelligently so we know what happens from one point to the next, and then train something using ELBO to take us from one point to the next.

The Setting

Let’s suppose we have the following chain:

\[\begin{equation} x=x_{0}\rightarrow x_{1}\rightarrow\cdots\rightarrow x_{T-1}\rightarrow x_{T}=z \end{equation}\]

where $x\sim p_{\text{data}}$ and $z\sim p_{T}\left(z\right)$, where $p_{T}\left(z\right)$ is many times just a Gaussian. Up until now we assumed that we have access to the mapping $z\mapsto x$ and had to “guess” or optimize the reverse mapping. This time, we’re gonna assume that we know how to turn $x$ into $z$, so we know $x\mapsto z$, and we want to find the opposite direction. This class of models is called denoising diffusion probabilistic models (DDPMs, ) for reasons that will become clear later on.

Let’s call the noising direction the forward process, whose transition probabilities we will assume to be:

\[\begin{equation} q\left(x_{t}\vert x_{t-1}\right)=\mathcal{N}\left(x_{t}\vert \;\gamma_{t}x_{t-1},I\sigma_{t}^{2}\right) \end{equation}\]

where $\gamma_{t}$ is some constant and $\sigma_{t}^{2}$ is the variance of the noise added at step $t$. Both values are assumed to change as a function of the iteration. We want to learn a model of the reverse process, that is we want to learn $p_{\theta}\left(x_{t-1}\vert x_{t}\right)$. We’re going to guess that:

\[\begin{equation} q\left(x_{t-1}\vert x_{t}\right)\approx p_{\theta}\left(x_{t-1}\vert x_{t}\right)=\mathcal{N}\left(x_{t-1}\vert \;\mu_{\theta}\left(x_{t},t\right),I\sigma_{t-1}^{2}\right) \end{equation}\]

If this is how we model the “reverse process”, then sampling for this model will be quite easy: simply sample $x_{T}\sim p_{T}\left(z\right)$ and then use $p_{\theta}\left(x_{t-1}\vert x_{t}\right)$ to sample until you reach $x_{0}$. This is basically the following algorithm:

\[\begin{equation} x_{T}\sim p_{T}\left(z\right)\qquad\forall0\le t\le T-1:\;x_{t-1}=\mu_{\theta}\left(x_{t},t\right)+\sigma_{t-1}\epsilon\label{eq:DDPM-sampling} \end{equation}\]

where $\epsilon\sim\mathcal{N}\left(0,I\right)$.

To train $p_{\theta}\left(x_{t-1}\vert x_{T}\right)$, what we want is high likelihood under “our guess”. The log-likelihood of an image, according to our model, is given by:

\[\begin{align} \log p_{\theta}\left(x_{0}\right) & =\mathbb{E}_{p_{\theta}\left(x_{1},\cdots,x_{T}\right)}\left[\log p_{\theta}\left(x_{0},\cdots,x_{T}\right)\right] \end{align}\]

this will of course be quite hard to calculate, never mind optimize. Instead, we will use the same variational bound as we used for VAEs:

\[\begin{align} \log p_{\theta}\left(x_{0}\right) & \ge\mathbb{E}_{q\left(x_{1},\cdots,x_{T}\vert x_{0}\right)}\left[\log\frac{p_{\theta}\left(x_{0},\cdots,x_{T}\right)}{q\left(x_{1},\cdots,x_{T}\vert x_{0}\right)}\right]\label{eq:DDPM-ELBO} \end{align}\]

Scheduling choices

Because we kept everything abstract as possible so far, there is one thing that isn’t clear. How can we ensure the following:

\[\begin{equation} q\left(x_{T}\vert x_{0}\right)\overset{?}{=}p_{T}\left(z\right) \end{equation}\]

Notice that, as we defined it, the transition probability of the Markov chain $q\left(\cdot\vert \cdot\right)$ gives us the following:

\[\begin{equation} x_{t}=\gamma_{t}x_{t-1}+\sigma_{t}\epsilon_{t}\qquad\epsilon_{t}\sim\mathcal{N}\left(0,I\right) \end{equation}\]

and this can be opened up recursively:

\[\begin{align} x_{t} & =\gamma_{t}\gamma_{t-1}x_{t-2}+\gamma_{t}\sigma_{t-1}\epsilon_{t-1}+\sigma_{t}\epsilon_{t}\\ \Rightarrow x_{t} & =\gamma_{t}\gamma_{t-1}\gamma_{t-2}x_{t-3}+\gamma_{t}\gamma_{t-1}\sigma_{t-2}\epsilon_{t-2}+\gamma_{t}\sigma_{t-1}\epsilon_{t-1}+\sigma_{t}\epsilon_{t}\\ & \vdots\\ \Rightarrow x_{t} & =\left(\prod_{i=1}^{t}\gamma_{t}\right)x_{0}+\sum_{i=1}^{t}\prod_{j=i+1}^{t}\gamma_{j}\sigma_{i}\epsilon_{i}\label{eq:unrolled-DDPM-q} \end{align}\]

Let’s make our life slightly easier and call $\prod_{i}^{t}\gamma_{i}=\bar{\gamma}_{t}$ and $\bar{\sigma}_{t}^{2}=\sum_{i}^{t}\prod_{j=i+1}^{t}\gamma_{j}^{2}\sigma_{i}^{2}$. At any rate, $x_{t}$ is some vector dependent on $x_{0}$ plus a sum of Gaussians - the distribution $q\left(x_{t}\vert x_{0}\right)$ will also be a Gaussian:

\[\begin{equation} q\left(x_{t}\vert x_{0}\right)=\mathcal{N}\left(x_{t}\vert \;\bar{\gamma}_{t}x_{0},\;I\bar{\sigma}_{t}^{2}\right) \end{equation}\]

So if we want $q\left(x_{T}\vert x_{0}\right)=p_{T}\left(z\right)$ then we have to make sure that $p_{T}\left(z\right)$ is Gaussian and that for any $x_{0}$ we have:

\[\begin{equation} \mathbb{E}_{p_{T}}\left[z\right]=\bar{\gamma}_{T}x_{0} \end{equation}\]

and:

\[\begin{equation} \text{cov}_{p_{T}}\left[z\right]=I\bar{\sigma}_{T}^{2} \end{equation}\]

The most obvious choice is to use $p_{T}\left(z\right)=\mathcal{N}\left(z\vert 0,I\right)$, in which case we just need to ensure that $\bar{\gamma}_{T}\stackrel{T\rightarrow\infty}{\longrightarrow}0$ and $\bar{\sigma}_{T}\stackrel{T\rightarrow\infty}{\longrightarrow}1$.

A Simple Variant of DDPM

There are two ways to proceed, simple or confusing. The confusing method is what (of course) is actually used, but we’ll start with the simple method.

We’ll start by rewriting both $p_{\theta}\left(x_{0},\cdots,x_{T}\right)$ and $q\left(x_{1},\cdots,x_{T}\vert x_{0}\right)$ into smaller factorizations. We can do this because both are Markov chains according to their construction, enabling us to rewrite them as:

\[\begin{align} p_{\theta}\left(x_{0},\cdots,x_{T}\right) & =p_{\theta}\left(x_{0}\vert x_{1}\right)\cdot p_{\theta}\left(x_{1}\vert x_{2}\right)\cdots p_{\theta}\left(x_{T-1}\vert x_{T}\right)\cdot p_{\theta}\left(x_{T}\right)\\ q\left(x_{1},\cdots,x_{T}\vert x_{0}\right) & =q\left(x_{1}\vert x_{0}\right)\cdot q\left(x_{2}\vert x_{1}\right)\cdots q\left(x_{T}\vert x_{T-1}\right) \end{align}\]

See how those line up nicely? Let’s put them back into equation \eqref{eq:DDPM-ELBO}:

\[\begin{align} \log p_{\theta}\left(x_{0}\right) & \ge\mathbb{E}_{q\left(x_{1},\cdots,x_{T}\vert x_{0}\right)}\left[\log p_{\theta}\left(x_{T}\right)+\sum_{t=1}^{T}\log p_{\theta}\left(x_{t}\vert x_{t+1}\right)-\sum_{t=1}^{T}\log q\left(x_{t}\vert x_{t-1}\right)\right]\\ & =\mathbb{E}_{q\left(x_{1},\cdots,x_{T}\vert x_{0}\right)}\left[\log p_{\theta}\left(x_{T}\right)+\log p_{\theta}\left(x_{0}\vert x_{1}\right)-\log q\left(x_{T}\vert x_{T-1}\right)\right.\\&\qquad\qquad\qquad\qquad\left.+\sum_{t=1}^{T-1}\log\frac{p_{\theta}\left(x_{t}\vert x_{t+1}\right)}{q\left(x_{t}\vert x_{t-1}\right)}\right]\\ & =\overbrace{\mathbb{E}_{q\left(x_{1}\vert x_{0}\right)}\left[\log p_{\theta}\left(x_{0}\vert x_{1}\right)\right]}^{\text{reconstruction term}}-\sum_{t=1}^{T-1}\mathbb{E}_{q\left(x_{t+1},x_{t-1}\vert x_{0}\right)}\overbrace{\left[D_{\text{KL}}\left(q\left(x_{t}\vert x_{t-1}\right)\,\vert \vert \,p_{\theta}\left(x_{t}\vert x_{t+1}\right)\right)\right]}^{\text{Markov chain term}}\\ & \hfill\qquad\qquad\hfill\quad-\mathbb{E}_{q\left(x_{T-1}\vert x_{0}\right)}\underbrace{\left[D_{\text{KL}}\left(q\left(x_{T}\vert x_{T-1}\right)\,\vert \vert \,p_{\theta}\left(x_{T}\right)\right)\right]}_{\text{prior term}} \end{align}\]

We now have a lower bound on the log-likelihood. There are three terms of interest:

The reconstruction term, which is basically the same as the reconstruction error we saw in the standard VAEs
The KL-divergences inside the Markov chain term, which make up the bulk of the training in diffusion models. These KL terms try to ensure that if we go forward in the Markov chain, using $q\left(\cdot\vert \cdot\right)$ starting at $x_{0}$, or backward, using $p_{\theta}\left(\cdot\vert \cdot\right)$ starting at $x_{T}$, we’ll get the same distribution
The prior term, which tries to make sure that are prior distribution in the latent space is correct. Notice that we assumed that $p_{\theta}\left(x_{T}\right)$ has a fixed distribution so there’s nothing to optimize here during training

To optimize the lower bound, what we essentially need to do is make sure that for each $t$ the KL-divergence is low. Ignoring the reconstruction term for a moment, if we want to maximize the above lower bound, we just need to optimize the following loss:

\[\begin{equation} L\left(\theta\right)=\mathbb{E}_{t,x_{0},q\left(x_{t+1},x_{t-1}\vert x_{0}\right)}\left[D_{\text{KL}}\left(q\left(x_{t}\vert x_{t-1}\right)\,\vert \vert \,p_{\theta}\left(x_{t}\vert x_{t+1}\right)\right)\right] \end{equation}\]

Of course, we chose both $q\left(x_{t}\vert x_{t-1}\right)$ and $p_{\theta}\left(x_{t}\vert x_{t+1}\right)$ to be Gaussian distributions (with the same covariance even!), so we know what the KL-divergence equals:

\[\begin{equation} \Rightarrow L\left(\theta\right)=\mathbb{E}_{t,x_{0},q\left(x_{t+1},x_{t-1}\vert x_{0}\right)}\left[\frac{1}{\sigma_{t}}\| \mu_{\theta}\left(x_{t+1},t\right)-\gamma_{t}x_{t-1}\| ^{2}\right] \end{equation}\]

What a simple loss!

So, why is this loss not usually used? The reason is that to calculate the loss at any point we need to make two MC estimates:

\[\begin{align} x_{t-1} & \sim q\left(x_{t-1}\vert x_{0}\right)\\ x_{t+1} & \sim q\left(x_{t+1}\vert x_{t-1}\right) \end{align}\]

The cited reason is then that this has more variance than the (soon to be seen) complex version.

At any rate, the intuition for diffusion models is the same even in the complex setting.

DDPM Breakdown

We are now going to break down $q\left(x_{1},\cdots,x_{T}\vert x_{0}\right)$ in a strange way:

\[\begin{align} q\left(x_{1},\cdots,x_{T}\vert x_{0}\right) & =q\left(x_{T}\vert x_{0}\right)q\left(x_{T-1}\vert x_{T},x_{0}\right)q\left(x_{T-2}\vert x_{T},x_{T-1},x_{0}\right)\cdots q\left(x_{1}\vert x_{T},\cdots,x_{2},x_{0}\right) \end{align}\]

Because of the fact that $q\left(x_{1},\cdots,x_{T}\vert x_{0}\right)$ is a Markov chain, given both $x_{0}$ and $x_{t}$ is enough to completely describe $x_{t-1}$, so the above can be rewritten as:

\[\begin{equation} q\left(x_{1},\cdots,x_{T}\vert x_{0}\right)=q\left(x_{T}\vert x_{0}\right)q\left(x_{T-1}\vert x_{T},x_{0}\right)\cdots q\left(x_{1}\vert x_{2},x_{0}\right) \end{equation}\]

Using this factorization, we now have:

\[\begin{align} \log p_{\theta}\left(x_{0}\right) & \ge\mathbb{E}_{q\left(x_{1},\cdots,x_{T}\vert x_{0}\right)}\left[\log p_{\theta}\left(x_{0}\vert x_{1}\right)+\sum_{t=1}^{T}\frac{\log p_{\theta}\left(x_{t-1}\vert x_{t}\right)}{\log q\left(x_{t-1}\vert x_{t},x_{0}\right)}+\log\frac{p_{\theta}\left(x_{T}\right)}{q\left(x_{T}\vert x_{0}\right)}\right]\\ & =\mathbb{E}_{q\left(x_{1}\vert x_{0}\right)}\left[\log p_{\theta}\left(x_{0}\vert x_{1}\right)\right]-\sum_{t=1}^{T}\mathbb{E}_{q\left(x_{t}\vert x_{0}\right)}\left[D_{\text{KL}}\left(q\left(x_{t-1}\vert x_{t},x_{0}\right)\vert \vert p_{\theta}\left(x_{t-1},x_{t}\right)\right)\right]-D_{\text{KL}}\left(q\left(x_{T}\vert x_{0}\right)\vert \vert p_{\theta}\left(x_{T}\right)\right) \end{align}\]

For now, let’s ignore that first term. Notice that in the last term we have nothing to optimize, so even if we train a model we won’t have to take it into account. We’ll now divert our attention to the middle expression which is the bulk of the lower bound.

The Really Nasty Part

As we saw from the breakdown that led to equation \eqref{eq:unrolled-DDPM-q}, because of our choice for the forward process, $q\left(x_{t}\vert x_{0}\right)$ is a Gaussian whose mean and covariance we know explicitly. This is, first of all, good to know… but we want $q\left(x_{t-1}\vert x_{t},x_{0}\right)$. Using Bayes’ law:

\[\begin{equation} q\left(x_{t-1}\vert x_{t},x_{0}\right)\propto q\left(x_{t}\vert x_{t-1}\right)q\left(x_{t-1}\vert x_{0}\right) \end{equation}\]

Basically, $q\left(x_{t-1}\vert x_{t},x_{0}\right)$ is the multiplication of two Gaussian distributions (which are linear in $x_{t-1}$), which means that it will also be Gaussian.

After massaging this expression for a while, you’ll find out that:

\[\begin{equation} q\left(x_{t-1}\vert x_{t},x_{0}\right)=\mathcal{N}\left(x_{t-1}\vert \;a_{t}x_{0}+b_{t}x_{t},\;I\kappa_{t}^{2}\right) \end{equation}\]

where $a_{t}$, $b_{t}$ and $\kappa_{t}$ are some constants that depend on $t$ which don’t really matter for now. You can look below for their exact values, but the bottom line is that $q\left(x_{t-1}\vert x_{t},x_{0}\right)$ is also Gaussian and we can find an analytical expression for it.

Finding the Gaussian

Given $x_{0}$, $x_{t}$ and $x_{t-1}$ are jointly Gaussian:

\[\begin{equation} p\left(x_{t},x_{t-1}\vert x_{0}\right)=\mathcal{N}\left(\left(\begin{matrix}x_{t}\\ x_{t-1} \end{matrix}\right)\vert \quad\left(\begin{matrix}\bar{\gamma}_{t}x_{0}\\ \bar{\gamma}_{t-1}x_{0} \end{matrix}\right),\Sigma\right) \end{equation}\]

with:

\[\begin{equation} \Sigma=\left(\begin{matrix}I\bar{\sigma}_{t}^{2} & I\sigma_{t}^{2}\\ I\sigma_{t}^{2} & I\bar{\sigma}_{t-1}^{2} \end{matrix}\right) \end{equation}\]

Using identities for conditionals of jointly Gaussian distributionsSee my BML notes for this, we have:

\[\begin{align} \mathbb{E}\left[x_{t-1}\vert x_{t},x_{0}\right] & =\bar{\gamma}_{t-1}x_{0}+\frac{\sigma_{t}^{2}}{\bar{\sigma}_{t}^{2}}\left(x_{t}-\bar{\gamma}_{t}x_{0}\right)\\ & =\left(\bar{\gamma}_{t-1}-\frac{\sigma_{t}^{2}}{\bar{\sigma}_{t}^{2}}\bar{\gamma}_{t}\right)x_{0}+\frac{\sigma_{t}^{2}}{\bar{\sigma}_{t}^{2}}x_{t}\\ \text{cov}\left[x_{t-1}\vert x_{t},x_{0}\right] & =I\left(\bar{\sigma}_{t-1}^{2}-\frac{\sigma_{t}^{4}}{\bar{\sigma}_{t}^{2}}\right)=I\left(\bar{\sigma}_{t-1}^{2}-\frac{\sigma_{t}^{2}}{\bar{\sigma}_{t-1}^{2}}\right) \end{align}\]

Define:

\[\begin{align} a_{t} & =\bar{\gamma}_{t-1}-\frac{\sigma_{t}^{2}}{\bar{\sigma}_{t}^{2}}\bar{\gamma}_{t}\\ b_{t} & =\frac{\sigma_{t}^{2}}{\bar{\sigma}_{t}^{2}}\\ \kappa_{t}^{2} & =\bar{\sigma}_{t-1}^{2}-\frac{\sigma_{t}^{2}}{\bar{\sigma}_{t-1}^{2}} \end{align}\]

Then: $\begin{equation} q\left(x_{t-1}\vert x_{t},x_{0}\right)=\mathcal{N}\left(x_{t-1}\vert \;a_{t}x_{0}+b_{t}x_{t},\;I\kappa_{t}^{2}\right) \end{equation}$

The Training Loss, Finally

Remember, what we wanted was to find the terms $D\left(q\left(x_{t-1}\vert x_{t},x_{0}\right)\vert \vert p_{\theta}\left(x_{t-1}\vert x_{t}\right)\right)$. We now have all we need to actually state what the loss at time point $t$ is actually equal to, since the KL-divergence between two Gaussians has a closed-form equation. The KL divergence between Gaussians is given by:

\[\begin{align} D_{\text{KL}}\left(\mathcal{N}\left(x\vert \mu_{x},\Sigma_{x}\right)\vert \vert \mathcal{N}\left(y\vert \mu_{y},\Sigma_{y}\right)\right)&=\frac{1}{2}\left[\log\left\vert \Sigma_{y}\Sigma_{x}^{-1}\right\vert +\text{trace}\left(\Sigma_{y}^{-1}\Sigma_{x}\right)+\right. \\ &\left.\left(\mu_{y}-\mu_{x}\right)^{T}\Sigma_{y}^{-1}\left(\mu_{y}-\mu_{x}\right)-\text{dim}\left(x\right)\right] \end{align}\]

In our setting, the covariances aren’t learnable parameters, and the dimension is constant, so we can just ignore everything that doesn’t correspond to the means. Our training loss is then:

\[\begin{equation} L\left(\theta\right)=\mathbb{E}_{x_{0},\,t,\,q\left(x_{t}\vert x_{0}\right)}\left[\frac{1}{\kappa_{t}^{2}}\| \mu_{\theta}\left(x_{t},t\right)-a_{t}x_{0}-b_{t}x_{t}\| ^{2}\right] \end{equation}\]

Notice that as we defined it, our model $\mu_{\theta}\left(x_{t},t\right)$ already depends on $x_{t}$. We can maybe improve our model a bit by making sure that it learns something that takes this information into account. If we choose:

\[\begin{equation} \mu_{\theta}\left(x_{t},t\right)=a_{t}s_{\theta}\left(x_{t},t\right)+b_{t}x_{t} \end{equation}\]

then we can get rid of one of the terms in the loss to get:

\[\begin{equation} L\left(\theta\right)=\mathbb{E}_{x_{0},\,t,\,q\left(x_{t}\vert x_{0}\right)}\left[\frac{a_{t}^{2}}{\kappa_{t}^{2}}\| s_{\theta}\left(x_{t},t\right)-x_{0}\| ^{2}\right] \end{equation}\]

This, finally, explains the name “denoising diffusion” - all we want from our model is to denoise the inputs! Given a point $x_{t}$, our model basically tries to guess what $x_{0}$ was, and the loss is weighted by the variance we expect to see.

When is the Variational Bound Tight?

When we talked about VAEs, we said that the ELBO is tight when:

\[\begin{equation} q_{\phi}\left(z\vert x\right)=p_{\theta}\left(z\vert x\right) \end{equation}\]

Exactly the same considerations are true for diffusion models, where the ELBO is tight when:

\[\begin{equation} \forall t\quad p_{\theta}\left(x_{t-1}\vert x_{t}\right)=q\left(x_{t-1}\vert x_{t}\right) \end{equation}\]

But, unlike VAEs, choosing $p_{\theta}\left(x_{t-1}\vert x_{t}\right)$ as a Gaussian distribution is much more principled. It can be shown (I’ll give a semblance of a proof in the next post) that if we have enough steps with small enough differences between them (i.e. $\sigma_{t}\rightarrow0$) in the forward process, then the reverse process will also be a Markov chain with Gaussian transitions. In other words, if we design the forward process to have enough steps with small enough variances, then our modeling for the reverse process is correct.

Conclusion

The original DDPM paper made quite a splash, since training it was much simpler than GANs (more on them later on) but the sample quality was very good. These results, paired with the simple denoising loss we found, were a big surprise, and consequently there was (and still is, at the time of writing) a lot of research into diffusion models.

Still the naive DDPM (as described here) is still a bit lacking as a generative model. This is mainly due to the fact that every operation (whether generation or inference) requires iterating over the whole Markov chain, which can be quite slow. As described here, getting good results requires Markov chains with around 1000 steps, which is like denoising 1000 images sequentially. The next post will focus on a way to generalize DDPMs into score-based generative models, which will ultimately enable faster sampling and easier control over design choices.

← Normalizing FlowsScore-Based Models →