Generative Models 5 - Score-Based Generative Models

← DDPM Energy-Based Models →

While informative, the DDPM (or variational) view of diffusion models is quite limiting. It is limiting in the sense that it is hard to get any theoretical intuition regarding ways to improve the generative model. For instance, the variational bound is only tight when we use $T\rightarrow\infty$ and $\sigma_{t}\rightarrow0$. Because of this, $T\approx1000$ steps need to be used to get meaningful results, a high computational load. How can we improve on this?

By slightly changing our perspective, we might be able to get a better understanding leading to better results.

Score-based generative models (SBGMs, ) slightly predate DDPM, but the two are actually pretty much equivalent. Still, I’m gonna start with SBGMs as if they’re completely different models and later we’ll see exactly how/if the two are related to each other.

Sampling with Langevin Dynamics

The backbone of SBGMs is sampling according to Langevin dynamics, a fairly intuitive algorithm that’s used very often. Given some distribution $p\left(x\right)$, the following iterative scheme is called the Langevin dynamics (LD) algorithmI'm assuming that the sampling procedure goes backwards in time, even though this isn't standard, just to remain consistent with what we saw before with DDPM:

\[\begin{equation} x_{t-\Delta t}=x_{t}+\frac{\Delta t}{2}\cdot\nabla_{x_{t}}\log p\left(x_{t}\right)+\sqrt{\Delta t}\eta_{t} \end{equation}\]

where $\eta_{t}\sim\mathcal{N}\left(0,I\right)$. It can be shown (we won’t do that here) that this algorithm samples from $p\left(x_{t}\right)$ as long as $\Delta t\rightarrow0$ and it is run for many iterations.

This algorithm is very very general and applicable in many situations, which is great. We’ll want to use this algorithm to sample new data points, which means that we’ll have to learn $s_{\theta}\left(x\right)\approx\nabla_{x}\log p\left(x\right)$, i.e. the score function of the distribution $p\left(x\right)$. Of course, if we don’t have access to $\log p\left(x\right)$, this will be rather difficult - let’s make our life a bit easier.

Tweedie Formula

How are we going to learn $\nabla_{x}\log p\left(x\right)$? Well, let’s start by thinking about a seemingly completely unrelated problem.

Suppose we observe a noisy image:

\[\begin{equation} x\sim p\left(x\right)\qquad y=x+\eta\quad\eta\sim\mathcal{N}\left(0,I\sigma^{2}\right) \end{equation}\]

This noisy image $y$ is sampled from a different distribution than $x$, basically a “noisy” version of $p\left(x\right)$. We can define this distribution exactly in the following way:

\[\begin{equation} p\left(y;\sigma^{2}\right)=\intop p\left(y\vert x;\sigma^{2}\right)p\left(x\right)dx=\intop\mathcal{N}\left(y\vert \,x,I\sigma^{2}\right)p\left(x\right)dx \end{equation}\]

Obviously the image $y$ is less probably than $x$ under $p\left(\cdot\right)$, right? But we’ll also expect it to be a little less probable under the noisy distribution, $p\left(\cdot;\sigma^{2}\right)$. That is, it is intuitive to believe that $p\left(y;\sigma^{2}\right)<p\left(x;\sigma^{2}\right)$. Then, if we have a really really good denoiser $D\left(y;\sigma^{2}\right)$ that is able to denoise $y$ and return something that is basically $x$, then this’ll give us some indication as to the direction $\delta$ that we need to move $y$ in order for $p\left(y+\delta;\sigma^{2}\right)$ to increase. So, maybe, a really good denoiser will give us some information regarding the gradient:

\[\begin{equation} D\left(y;\sigma^{2}\right)\stackrel{?}{\approx}y+c\nabla_{y}\log p\left(y;\sigma^{2}\right) \end{equation}\]

where $c$ is some constant.

This simplistic intuition turns out to be correct, and is called Tweedie’s formula.

Theorem (Tweedie’s Formula, see for more):

Let $p\left(y;\sigma^{2}\right)$ be as defined above. Then:

\[\begin{equation} \nabla_{y}\log p\left(y;\sigma^{2}\right)=\frac{\mathbb{E}\left[x\vert y\right]-y}{\sigma^{2}} \end{equation}\]

Click here to see the proof

For the proof, we’ll just go at it directly:

\[\begin{align} \nabla_{y}\log p\left(y;\sigma^{2}\right) & =\frac{1}{p\left(y;\sigma^{2}\right)}\frac{\partial}{\partial y}p\left(y;\sigma^{2}\right)\\ & =\frac{1}{p\left(y;\sigma^{2}\right)}\frac{\partial}{\partial y}\intop\mathcal{N}\left(y\vert \,x,I\sigma^{2}\right)p\left(x\right)dx\\ & =\frac{1}{p\left(y;\sigma^{2}\right)}\intop p\left(x\right)\frac{\partial}{\partial y}\mathcal{N}\left(y\vert \,x,I\sigma^{2}\right)dx \end{align}\]

where the last step is possible because the integration variable is not $y$. We will now use the fact that gradient of a Gaussian function is given by:

\[\begin{equation} \frac{\partial}{\partial y}\mathcal{N}\left(y\vert \,x,I\sigma^{2}\right)=\mathcal{N}\left(y\vert \,x,I\sigma^{2}\right)\cdot\left[-\frac{1}{\sigma^{2}}\left(y-x\right)\right] \end{equation}\]

Plugging this back in:

\[\begin{align} \nabla_{y}\log p\left(y;\sigma^{2}\right) & =\frac{1}{\sigma^{2}}\intop\frac{p\left(x\right)\mathcal{N}\left(y\vert \,x,I\sigma^{2}\right)}{p\left(y;\sigma^{2}\right)}\left[x-y\right]dx\\ & =\frac{1}{\sigma^{2}}\intop p\left(x\vert y;\sigma^{2}\right)\left[x-y\right]dx\\ & =\frac{1}{\sigma^{2}}\left[\mathbb{E}\left[x\vert y\right]-y\right] \end{align}\]

where the first step above was due to Bayes’ law. $\square$

Amazing! We now have a way to access the score of a distribution… well, almost. We can now access the score of the noisy distribution $p\left(y;\sigma^{2}\right)$, not the score of the original distribution $p\left(x\right)$. Of course, we’re not going to let that stop us.

Score Matching

In a moment we’ll talk about how to actually use the noisy scores, but let’s focus on the fact that we now have a roundabout way to access the score. What we should above is that we basically only need to learn a really good denoiser in order to know the score of the noisy distribution. After allFor a more thorough explanation of this equality, see my summary of decision theory.:

\[\begin{equation} \mathbb{E}\left[x\vert y\right]=\arg\min_{\hat{x}\left(y\right)}\mathbb{E}_{x\sim p\left(x\right),y\sim p\left(y\vert x;\sigma^{2}\right)}\left[\| \hat{x}\left(y\right)-x\| ^{2}\right] \end{equation}\]

So, if all we have is a dataset $\mathcal{D}=\left{ x_{i}\right} _{i=1}^{N}$ of samples from $p\left(x\right)$, what we want to do is train a denoiser with the following loss:

\[\begin{equation} L\left(\theta\right)=\mathbb{E}_{x\sim\mathcal{D},\eta\sim\mathcal{N}\left(0,I\sigma^{2}\right)}\left[\| D_{\theta}\left(x+\eta;\sigma^{2}\right)-x\| ^{2}\right] \end{equation}\]

and then our model for the score will be:

\[\begin{equation} s\left(y;\sigma^{2}\right)\approx s_{\theta}\left(y;\sigma^{2}\right)=\frac{D\left(y;\sigma^{2}\right)-y}{\sigma^{2}} \end{equation}\]

Training a model in this way is called score matching () because we’re matching our model’s score to that of the data. In this case, we’re not exactly learning the score of the distribution itself, but of the “noisy” distribution (), but as we’ll see it’s enough to use this noisy score to sample from the distribution. Specifically, if we do this when $\sigma$ is really small, then maybe the noisy distribution is close enough to the true distribution and:

\[\begin{equation} s\left(y;\sigma^{2}\right)\approx s\left(x\right) \end{equation}\]

Sampling with Noisy Scores

We learned our $s\left(y;\sigma_{\text{0}}^{2}\right)$ for some small value of $\sigma_{0}\approx0$, but there’s a problem. LD really samples from the true distribution when $\Delta t\rightarrow0$, but in real life we’ll use larger values than 0 for $\Delta t$. Since $\Delta t$ is larger than 0, then noise is effectively added to our distribution. This is most clear to see if we look at the process without the score and going forwards in time:

\[\begin{equation} x_{t+\Delta t}=x_{t}+\sqrt{\Delta t}\eta_{t} \end{equation}\]

It’ll be best if we take this added noise into account during sampling. Using the same tricks we had in DDPM, given $x_{0}$ we know how much noise was added up to time $t$:

\[\begin{equation} x_{t}=x_{0}+\sqrt{\sum_{n=1}^{t/\Delta t}n\Delta t}\cdot\eta=x_{0}+\sqrt{\frac{t}{\Delta t}\Delta t}\cdot\eta\qquad\eta\sim\mathcal{N}\left(0,I\right) \end{equation}\]

So if we want to take into account the fact that at different stages of the Markov chain we’ll effectively see different amount of noise, we should actually train our score model for every possible noise value (I’m going to assume $0\le t\le T$ and that $\Delta t\ll T$):

\[\begin{equation} L\left(\theta\right)=\mathbb{E}_{t\in[0,T],x\in\mathcal{D},\eta}\left[\| D_{\theta}\left(x+\sqrt{t}\eta;\;\max\left(\sigma_{0}^{2},t\right)\right)-x\| ^{2}\right] \end{equation}\]

Having trained $D_{\theta}\left(y;\sigma^{2}\right)$ in this manner, our sampling algorithm might look something like the following:

Sample $x_{T}\sim\mathcal{N}\left(0,I\cdot T\right)$
for $t=T,\left(T-\Delta t\right),\left(T-2\Delta t\right),\cdots,\Delta t$:

(a) $\eta\sim\mathcal{N}\left(0,I\right)$

(b) $x_{t-\Delta t}=x_{t}+\Delta t\cdot\frac{D_{\theta}\left(x_{t};\;t\right)-x_{t}}{t}+\sqrt{\Delta t}\cdot\eta$
return $x_0$

We now have a very simple algorithm for generating images. Also, notice, we trained a generative model without maximizing the log-likelihood! How strange.

The above is just the intuition behind the reverse sampling procedure, but it turns out that we can show mathematically that this is the true reverse process. I left an overview of the steps to show this below, but they’re a bit involved.

Reverse and forward processes

What we did above is, basically, defining the forward and reverse processes just like in DDPM. By defining that given a clean image $x_{0}$ the distribution at time $t$ is given by:

\[\begin{equation} x_{t}=x_{0}+\sqrt{t}\cdot\eta_{t} \end{equation}\]

we defined the forward chain given by $p\left(x_{t}\vert x_{0}\right)$.

Having defined the forward process, we now want to know what $p\left(x_{t-\Delta t}\vert x_{t}\right)$ is equal. Actually finding the reverse process is fairly difficult. Instead, I’ll try to give some intuition as to why this might be correct (this is an adaptation of the derivation per ). If $\Delta t\rightarrow0$, then the difference between $p\left(x_{t};t\right)$ and $p\left(x_{t-\Delta t};t-\Delta t\right)$ should be very small. At the same time, $x_{t}$ and $x_{t-\Delta t}$ are also going to be fairly similar in most cases, with:

\[\begin{equation} \mathbb{E}_{x_{t-\Delta t}}\left[\| x_{t}-x_{t-\Delta t}\| ^{2}\right]=\Delta t \end{equation}\]

Because we are looking at the limit of $\Delta t\rightarrow0$, we’re eventually going to ignore all terms that are of a $\Delta t$ factor. To indicate this, we’ll write, for instance, that $| x_{t}-x_{t-\Delta t}| ^{2}=O\left(\Delta t\right)$. We are now going to do a couple of consecutive Taylor approximations that are accurate at the limit $\Delta t\rightarrow0$:

\[\begin{equation} p\left(x_{t-\Delta t};t-\Delta t\right)\stackrel{\text{Taylor}}{=}p\left(x_{t-\Delta t};t\right)+\Delta t\cdot\frac{\partial}{\partial t}p\left(x_{t-\Delta t};t\right)+O\left(\Delta t\right) \end{equation}\]

I’m going to want to write down the logarithm of this distribution, so we are going to use another Taylor series:

\[\begin{align} & \log p \left(x_{t-\Delta t};t-\Delta t\right) =\log\left[p\left(x_{t-\Delta t};t\right)+\Delta t\cdot\frac{\partial}{\partial t}p\left(x_{t-\Delta t};t\right)+O\left(\Delta t\right)\right]\\ & =\log p\left(x_{t-\Delta t};t\right)+\log\left[1+\Delta t\frac{\partial_{t}p\left(x_{t-\Delta t};t\right)+O\left(1\right)}{p\left(x_{t-\Delta t};t\right)}\right]\\ & =\log p\left(x_{t-\Delta t};t\right)+\log\left[1+\Delta t\left(\frac{\partial}{\partial t}\log p\left(x_{t-\Delta t};t\right)+O\left(1\right)\right)\right]\\ & \stackrel{\text{Taylor}}{=}\log p\left(x_{t-\Delta t};t\right)+\Delta t\left(\frac{\partial}{\partial t}\log p\left(x_{t-\Delta t};t\right)\right)+O\left(\Delta t\right) \end{align}\]

Here we’re going to make an assumption: $\log p\left(\cdot;t\right)$ changes smoothly as a function of the time $t$. Concretely, we will say that there exists some constant $k$ such that $\log p\left(\cdot;t\right)$ is $k$-Lipschitz. If this assumption holds then we can say that:

\[\begin{equation} \log p\left(x_{t-\Delta t};t-\Delta t\right)\stackrel{\text{Taylor}}{=}\log p\left(x_{t-\Delta t};t\right)+O\left(\Delta t\right) \end{equation}\]

That was (in essence) one Taylor approximation in the time domain. We will now do the same in the spatial domain:

\[\begin{align} \log p\left(x_{t-\Delta t};t\right) & \stackrel{\text{Taylor}}{=}\log p\left(x_{t};t\right) \\ & +\left(x_{t-\Delta t}-x_{t}\right)^{T}\nabla_{x_{t}}\log p\left(x_{t};t\right) \\ & +\left(x_{t-\Delta t}-x_{t}\right)^{T}\frac{\partial^{2}\log p\left(x_{t};t\right)}{\partial x_{t}^{2}}\left(x_{t-\Delta t}-x_{t}\right) \\ & +\cdots \end{align}\]

Once again we’re going to have to make an assumption: $\log p\left(x_{t};t\right)$ changes slowly enough for all of it’s second order derivatives to be bounded. If this is true, we can write:

\[\begin{equation} \log p\left(x_{t-\Delta t};t\right)\stackrel{\text{Taylor}}{=}\log p\left(x_{t};t\right)+\left(x_{t-\Delta t}-x_{t}\right)^{T}s\left(x_{t};t\right)+O\left(\| x_{t-\Delta t}-x_{t}\| ^{2}\right) \end{equation}\]

Following from this, we can write:

\[\begin{equation} \log p\left(x_{t-\Delta t};t-\Delta t\right)\stackrel{\text{Taylor}\times3}{=}\log p\left(x_{t};\,t\right)+\left(x_{t-\Delta t}-x_{t}\right)^{T}s\left(x_{t};t\right)+O\left(\Delta t\right) \end{equation}\]

Using Bayes’ law:

\[\begin{align} \log p\left(x_{t-\Delta t}\vert x_{t}\right) & \propto\log p\left(x_{t}\vert x_{t-\Delta t}\right)+\log p\left(x_{t-\Delta t};t-\Delta t\right)\\ & =-\frac{1}{2\Delta t}\| x_{t-\Delta t}-x_{t}\| ^{2}+x_{t-\Delta t}^{T}s\left(x_{t};t\right)+O\left(\Delta t\right)+\text{const} \end{align}\]

At $\Delta t\rightarrow0$, this is a log-quadratic function with respect to $x_{t-\Delta t}$, so we know that $p\left(x_{t-\Delta t}\vert x_{t}\right)$ is a Gaussian distribution. To find the mean and variance, we can either complete the squares or differentiate the log-density. We’ll take the second option:

\[\begin{align} \frac{\partial}{\partial x_{t-\Delta t}}-\log p\left(x_{t-\Delta t}\vert x_{t}\right)&=\frac{1}{\Delta t}\left(x_{t-\Delta t}-x_{t}\right)-s\left(x_{t};t\right)\\&=\underbrace{\frac{1}{\Delta t}}_{\text{cov}\left[x_{t-\Delta t}\vert x_{t}\right]^{-1}}\left(x_{t-\Delta t}-\underbrace{\left(x_{t}+\Delta t\cdot s\left(x_{t};t\right)\right)}_{\mathbb{E}\left[x_{t-\Delta t}\vert x_{t}\right]}\right) \end{align}\]

So. What we found is that:

\[\begin{equation} p\left(x_{t-\Delta t}\vert x_{t}\right)\stackrel{\Delta t\rightarrow0}{\longrightarrow}\mathcal{N}\left(x_{t-\Delta t}\vert \quad x_{t}+\Delta t\cdot s\left(x_{t};t\right),\quad I\Delta t\right) \end{equation}\]

Changing the Scheduling of the Noise

As defined so far, the amount of noise at time $t$ that we “expect” to see in our samples is:

\[\begin{equation} \sigma_{t}^{2}=t \end{equation}\]

But we can generalize this and choose a different noising-schedule.

Suppose that at each iteration we add $g^{2}\left(t\right)\cdot\Delta t$ noise instead of just $\Delta t$ noise, giving the following forward process:

\[\begin{equation} x_{t+\Delta t}=x_{t}+\sqrt{\Delta t}\cdot g\left(t\right)\cdot\eta_{t} \end{equation}\]

Using the same arguments as above, the total amount of noise we’ll add (that is, we’re not taking into account the score) is:

\[\begin{equation} x_{t}=x_{0}+\sqrt{\sum_{n=1}^{t/\Delta t}g^{2}\left(n\cdot\Delta t\right)\Delta t}\cdot\eta\qquad\eta\sim\mathcal{N}\left(0,I\right) \end{equation}\]

The term $\sum_{n=1}^{t/\Delta t}g^{2}\left(n\cdot\Delta t\right)\Delta t$ is hard to understand in it’s current form. But actually, if we take $\Delta t\rightarrow0$, this is basically the Riemannian sum of $g\left(t\right)$. So at time $t$ we actually have:

\[\begin{equation} \lim_{\Delta t\rightarrow0}\sum_{n=1}^{t/\Delta t}g^{2}\left(n\cdot\Delta t\right)\Delta t=\intop_{0}^{t}g^{2}\left(\tau\right)d\tau=\sigma^{2}\left(t\right) \end{equation}\]

Doing this is like “dilating” time a bit. Instead of moving $\Delta t$ at each iteration, the process moves $g^{2}\left(t\right)\Delta t$ at each iteration, which we have to somehow take into account in the reverse process as well. This means that our sampling algorithm now becomes:

\[\begin{equation} x_{t-\Delta t}=x_{t}+\Delta t\cdot g^{2}\left(t\right)\cdot s_{\theta}\left(x_{t};\;\sigma^{2}\left(t\right)\right)+\sqrt{\Delta t}\cdot g\left(t\right)\cdot\eta_{t}\label{eq:scheduled-SGBM} \end{equation}\]

We are now completely free to choose the schedule of added noise $g\left(t\right)$ in any way we wish. Or, instead, we can decide to define the total noise we added $\sigma\left(t\right)$ instead, by noticing that:

\[\begin{equation} \sigma^{2}\left(t\right)=\intop_{0}^{t}g^{2}\left(\tau\right)d\tau\Rightarrow g^{2}\left(t\right)=\frac{d}{dt}\sigma^{2}\left(t\right)=2\sigma\left(t\right)\frac{d}{dt}\sigma\left(t\right) \end{equation}\]

which means we can rewrite equation \eqref{eq:scheduled-SGBM} as:

\[\begin{equation} x_{t-\Delta t}=x_{t}+\Delta t\cdot2\sigma\left(t\right)\frac{d}{dt}\sigma\left(t\right)\cdot s_{\theta}\left(x_{t};\;\sigma^{2}\left(t\right)\right)+\sqrt{\Delta t\cdot2\sigma\left(t\right)\frac{d}{dt}\sigma\left(t\right)}\cdot\eta_{t} \end{equation}\]

Continuous Time

The last few sections of this summary, we assumed that $\Delta t\rightarrow0$. This means that we more or less assume that for any time point $t$, the random variable $x\left(t\right)$ has some distribution. Instead of writing the update rule, as we’ve done so far, we can recast everything into differential equations.

For a moment, let’s ignore the noise we’re adding at each step of the sampling procedure. In such a case, we have the following update step:

\[\begin{equation} x_{t-\Delta t}=x_{t}+\Delta t\cdot g^{2}\left(t\right)\cdot s_{\theta}\left(x_{t};\;\sigma^{2}\left(t\right)\right) \end{equation}\]

We can move $x_{t}$ to the left-hand-side of the equation and divide by $\Delta t$ to get something that looks an awful lot like a derivativeThe minus sign in the following is because we so consistently insisted that the sampling procedure should happen in the "reverse process". In other words, we had this weird insistence that going backwards in time is the direction we need to go in order to ""reverse the noising procedure"". This was just a choice, we could've defined everything in the more intuitive direction and be rid of this minus sign, but it's here to stay now.:

\[\begin{equation} \frac{x_{t}-x_{t-\Delta t}}{\Delta t}=-g^{2}\left(t\right)\cdot s_{\theta}\left(x_{t};\;\sigma^{2}\left(t\right)\right)\label{eq:continuous-SGBM} \end{equation}\]

Taking the limit of $\Delta t\rightarrow0$, this is exactly the definition of an ordinary differential equation (ODE) given by:

\[\begin{equation} \frac{dx}{dt}=-g^{2}\left(t\right)\cdot s_{\theta}\left(x\left(t\right);\;\sigma^{2}\left(t\right)\right) \end{equation}\]

Of course, when we sampled we also added noise. The typical notation for a stochastic differential equation (SDE) defined in equation \eqref{eq:scheduled-SGBM} is (using something called Ito calculus notation, this reverse process was shown/proven in ):

\[\begin{equation} dx=-\underbrace{g^{2}\left(t\right)\cdot s_{\theta}\left(x\left(t\right);\;\sigma^{2}\left(t\right)\right)\cdot dt}_{\text{deterministic (drift)}}+\underbrace{g\left(t\right)dW\left(t\right)}_{\text{stochastic (diffusion)}} \end{equation}\]

where $dW\left(t\right)$ is called a Wiener process and is defined in the following way:

\[\begin{equation} dx=g\left(t\right)dW\left(t\right)\Leftrightarrow p\left(x\left(t\right)\vert x\left(0\right)\right)=\mathcal{N}\left(x\left(t\right)\vert \;0,\quad I\intop_{0}^{t}g^{2}\left(\tau\right)d\tau\right) \end{equation}\]

All of this is just notation for what we talked about above at the limit of $\Delta t\rightarrow0$. All SDEs of this form can be separated into a deterministic drift term, multiplied by $dt$ to remind us that only time matters, and a stochastic diffusion term, multiplied by $dW\left(t\right)$ to remind us that a random walk is taking place above the deterministic drift.

In general, solving SDEs - in other words, finding an equation $x\left(t\right)$ for all $t$ - is very very difficult. Instead, many times numerical solutions are employed that try to simply following the path defined by the SDE. This process is sometimes called integrating the SDE or discretizing it, which is what we’ve been doing all along. Our discretization is the most naive one and is very bad when $\Delta t$ is not sufficiently small, but makes everything so intuitive. This discretization is called a first-order Euler-Maruyama discretization - first order because we only use gradients (and not higher order derivatives) and Euler-Maruyama because Euler invented pretty much everything, including the first solvers for ODEs. Maruyama extended the ODE solver to the SDE case as well, I think.

A Deterministic Alternative

All of our derivations so far assumed that we’re okay with a stochastic algorithm for generating new samples. But using stochastic algorithms is actually kind of a pain. Many applications of (decoder-based) generative models assume that the mapping between the latent codes $z$ to the image space $x$ are deterministic and easy to manipulate. Can we make score-based models (or diffusion models) deterministic?

Well, consider the following SDE for the sampling procedure:

\[\begin{equation} dx=-s_{\theta}\left(x\left(t\right);\;\sigma^{2}\left(t\right)\right)\cdot dt+dW\left(t\right) \end{equation}\]

We already know how samples from this distribution look at any time $t$, right? As we already showed before:

\[\begin{equation} x_{t}=x_{0}+t\eta\qquad\eta\sim\mathcal{N}\left(0,I\right) \end{equation}\]

So maybe all we need is a deterministic algorithm that receives $x_{t}$ and somehow reduces the noise?

Let’s try to build such an algorithm.

When $x_0$ is Known

What we’ll try to do now is build a mapping $v\left(\cdot;t\right)$ such that:

\[\begin{equation} x_{t}\sim p\left(x;t\right)\quad\Rightarrow x_{t-\Delta t}=v\left(x_{t};t\right)\sim p\left(x;t-\Delta t\right) \end{equation}\]

Notice that if someone gives us $x_{0}$, this becomes simple since:

\[\begin{align} x_{t-\Delta t} & =x_{0}+\sqrt{t-\Delta t}\cdot\eta\\ x_{t} & =x_{0}+\sqrt{t}\cdot\eta \end{align}\]

So if we have $x_{0}$, we can just do the following:

\[\begin{equation} x_{t-\Delta t}=x_{0}+\frac{\sqrt{t-\Delta t}}{\sqrt{t}}\overbrace{\left(x_{t}-x_{0}\right)}^{\sqrt{t}\eta}\label{eq:cond-map} \end{equation}\]

In essence, this is like a conditional mapping: conditioned on the fact that we know $x_{0}$, the mapping function from $x_{t}$ to $x_{t-\Delta t}$ is as given in equation \eqref{eq:cond-map}.

You can kind of consider this case as if:

\[\begin{equation} x_{t}\sim p\left(x\vert x_{0};t\right) \end{equation}\]

in which case:

\[\begin{equation} v\left(x_{t}\vert x_{0};t\right)\sim p\left(x\vert x_{0};t-\Delta t\right) \end{equation}\]

We now want to find a mapping that works when we don’t know $x_{0}$.

When $x_0$ is Unknown

Instead of using $x_{0}$, we’re basically going to use our_ best guess_ for $x_{0}$. That’s right, we’ll use the conditional mean $\mathbb{E}\left[x_{0}\vert x_{t}\right]$ as a stand-in for $x_{0}$.

This means that our mapping function will be:

\[\begin{equation} x_{t-\Delta t}=v\left(x_{t};t\right)=\mathbb{E}\left[x_{0}\vert x_{t}\right]+\frac{\sqrt{t-\Delta t}}{\sqrt{t}}\left(x_{t}-\mathbb{E}\left[x_{0}\vert x_{t}\right]\right)\label{eq:marginal-map} \end{equation}\]

While this seems naive, it turns out it’s actually the correct thing to do.

Basically, we thought of $v\left(x_{t}\vert x_{0};t\right)$ as a function over random variables. Using the change of variable formula, we have:

\[\begin{equation} p\left(v\left(x_{t}\vert x_{0};t\right)\vert x_{0};t-\Delta t\right)=\frac{p\left(x_{t}\vert x_{0};t\right)}{\left\vert \det\frac{\partial v\left(x_{t}\vert x_{0};t\right)}{\partial x_{t}}\right\vert } \end{equation}\]

If we want to define a mapping that is correct for any $x_{0}$ we need to somehow take the expectation over $x_{0}$.

It turns out that taking the expectation over the mapping $v\left(\cdot\vert x_{0};t\right)$ is enough (). That is, if we define:

\[\begin{equation} v\left(x_{t};t\right)=\intop v\left(x_{t}\vert x_{0};t\right)p\left(x_{0}\vert x_{t}\right)dx_{0}=\mathbb{E}_{x_{0}}\left[v\left(x_{t}\vert x_{0};t\right)\vert x_{t}\right]\label{eq:expectation-over-flow} \end{equation}\]

then:

\[\begin{equation} p\left(v\left(x_{t};t\right);t-\Delta t\right)=p\left(x_{t};t\right)/\left\vert \text{det}\left[\frac{\partial v\left(x_{t};t\right)}{\partial x_{t}}\right]\right\vert \end{equation}\]

If we take the expectation over the mapping $v\left(x_{t}\vert x_{0};t\right)$ as we defined in equation \eqref{eq:cond-map}, then we get equation \eqref{eq:marginal-map}.

Expectations over vector fields

While do give a proof that equation \eqref{eq:expectation-over-flow}, it also makes some intuitive sense. Again, we can think of $v\left(x_{t}\vert x_{0};t\right)$ as the “correct” mapping function (or flow) if we know that we need to end up at $x_{0}$. Then, if we don’t know the end point, the expectation $\mathbb{E}_{x_{0}}\left[v\left(x_{t}\vert x_{0};t\right)\vert x_{t}\right]$ weighs all of the possible end goals $x_{0}$, giving the largest weight to the one where the path from $x_{t}$ to $x_{0}$ is most likely.

As a constructed example, suppose that:

\[\begin{equation} p\left(x_{0}\right)=\frac{1}{2}\mathcal{N}\left(-1,\varphi^{2}\right)+\frac{1}{2}\mathcal{N}\left(1,\varphi^{2}\right) \end{equation}\]

with $\varphi^{2}\ll1$. Then if $x_{t}\gg0$, it’s path most likely ends at the cluster $\mathcal{N}\left(1,\varphi^{2}\right)$. If $x_{t}\ll0$, then it should probably end up near -1. On the other hand, if $t$ is really large or $x_{t}\approx0$, then there is no real preference as to where $x_{t}$ should end up, and the flow should push it to be somewhere in-between the two. That’s the kind of intuition, anyway.

The Deterministic Mapping

From Tweedie’s formula, we can rewrite equation \eqref{eq:marginal-map} as:

\[\begin{align} x_{t-\Delta t} & =x_{t}+t\cdot s\left(x_{t};t\right)-\frac{\sqrt{t-\Delta t}}{\sqrt{t}}t\cdot s\left(x_{t};t\right)\\ & =x_{t}+\left(\sqrt{t}-\sqrt{t-\Delta t}\right)\cdot\sqrt{t}\cdot s\left(x_{t};t\right) \end{align}\]

This formula is kinda awkward. When $\Delta t\rightarrow0$, we can actually simplify things a bit:

\[\begin{equation} \lim_{\Delta t\rightarrow0}\frac{\sqrt{t}-\sqrt{t-\Delta t}}{\Delta t}=\frac{d\sqrt{t}}{dt}=\frac{1}{2\sqrt{t}} \end{equation}\]

So:

\[\begin{align} \lim_{\Delta t\rightarrow0}\left(\sqrt{t}-\sqrt{t-\Delta t}\right)\cdot\sqrt{t} & =\lim_{\Delta t\rightarrow0}\frac{\sqrt{t}-\sqrt{t-\Delta t}}{\Delta t}\cdot\Delta t\sqrt{t}\\ & =\frac{\Delta t\cdot\sqrt{t}}{2\cdot\sqrt{t}}=\frac{\Delta t}{2} \end{align}\]

Finally, we can write:

\[\begin{equation} x_{t-\Delta t}\stackrel{\Delta t\rightarrow0}{=}x_{t}+\frac{1}{2}\Delta t\cdot s\left(x_{t};t\right) \end{equation}\]

We can now move to the continuous mapping, as we did in equation \eqref{eq:continuous-SGBM} to find that:

\[\begin{equation} \frac{dx}{dt}=-\frac{1}{2}s\left(x\left(t\right);t\right) \end{equation}\]

Let the last few steps sink in. We basically started by saying that we have an SDE of the following form:

\[\begin{equation} dx=-s\left(x;t\right)dt+dW\left(t\right) \end{equation}\]

and we found that this SDE is equivalent to an ODE. If you think it’s weird that we can do this, you’re not alone. And why that $\frac{1}{2}$? I’m just as surprised as you.

Generalizing to other SDEs

What we showed above is only true when $g\left(t\right)=1$. Given an SDE of the form:

\[\begin{equation} dx=\left(f\left(t\right)x-g^{2}\left(t\right)\cdot s\left(x;t\right)\right)dt+g\left(t\right) \end{equation}\]

then there is an equivalent ODE given by:

\[\begin{equation} \frac{dx}{dt}=f\left(t\right)x-\frac{1}{2}g^{2}\left(t\right)\cdot s\left(x;t\right) \end{equation}\]

What do I mean by “equivalent”? Basically, if we start with:

\[\begin{equation} x\left(T\right)\sim p\left(x;T\right) \end{equation}\]

and iterate backwards in time according to the SDE or the ODE, then for every time point $t$, we’ll get:

\[\begin{equation} x\left(t\right)\vert x\left(T\right)\sim p\left(x;t\right) \end{equation}\]

That is, for each time point $t$, the marginal distributions $p\left(x;t\right)$ of the SDE and the ODE will be the same. But this is only true if we started the chains from $p\left(x;T\right)$! This ODE is sometimes called the probability flow instead, as it has some relation to the continuous normalizing flows we talked about before.

Actually, because of this relation to continuous normalizing flows, the likelihood of a sample $x\left(0\right)$ can be calculated under the ODE formualtion, with:

\[\begin{equation} \log p\left(x\left(0\right)\right)=\log p\left(x\left(T\right);T\right)+\frac{1}{2}\intop_{0}^{T}g^{2}\left(t\right)\text{trace}\left[\frac{\partial s_{\theta}\left(x\left(\tau\right);\tau\right)}{\partial x\left(\tau\right)}\right]d\tau \end{equation}\]

Conclusion

Diffusion models as we have seen in the last two sections have become incredibly popular because they are able to generate extremely realistic samples. However, if we want to use these models for anything beyond generating samples, then they are quite cumbersome. For instance, we might want to use the likelihood of the model to solve some task. While we can get a lower bound for the likelihood with DDPM or an exact value for SBGM, many many steps are needed to get an accurate estimation, making diffusion models difficult to use in some practical applications.

← DDPMEnergy-Based Models →