So, VAEs were kind of a headache. If we think back and try to find out why, the reasons were: the latent space was smaller than the data space, adding an observation model means we need an integral to go back to the latent space, and we have no direct way to “invert” the generator. Other than that, VAEs are very general and the only constraints are those imposed implicitly by the encoder/decoder.
Instead, we can try to add back into the mix stronger constraints, but in a way that makes sure that the generator is invertible. If we manage to do this, we can go back from data space to the latent space at will. Normalizing flows
Again, we assume $z\sim p_{z}\left(z\right)$ and now we will assume $x=G_{\theta}\left(z\right)$ and $\text{dim}\left(z\right)=\text{dim}\left(x\right)$. That is, no observation noise. If $G_{\theta}\left(z\right)$ is completely bijective (one-to-one and onto), then the probability of $x$ can be written in terms of the probability of $z$:
\[\begin{equation} p_{\theta}\left(x\right)=p_{z}\left(G_{\theta}^{-1}\left(x\right)\right)\left\vert \text{det}\left[\frac{\partial G_{\theta}^{-1}\left(x\right)}{\partial x}\right]\right\vert \end{equation}\]where $\partial G_{\theta}^{-1}\left(x\right)/\partial x$ is the Jacobian of the inverse mapping $z=G_{\theta}^{-1}\left(x\right)$. So we can just directly use the MLE criterion to train $G_{\theta}\left(z\right)$:
\[\begin{equation} L\left(\theta\right)=-\frac{1}{N}\sum_{i=1}^{N}\log p_{\theta}\left(x_{i}\right)=-\frac{1}{N}\sum_{i=1}^{N}\left(\log p_{z}\left(G_{\theta}^{-1}\left(x_{i}\right)\right)+\log\left\vert \text{det}\left[\frac{\partial G_{\theta}^{-1}\left(x_{i}\right)}{\partial x_{i}}\right]\right\vert \right)\label{eq:change-MLE} \end{equation}\]We are, of course, going to use multiple layers to define our mapping, so that:
\[\begin{equation} x=G_{\theta}\left(z\right)=\ell_{L}\circ\ell_{L-1}\circ\cdots\circ\ell_{1}\left(z\right)\Leftrightarrow z=G_{\theta}^{-1}\left(x\right)=\ell_{1}^{-1}\circ\ell_{2}^{-1}\circ\cdots\circ\ell_{L}^{-1}\left(x\right) \end{equation}\]where $L$ is the number of layers we use and $\ell_{i}\left(\cdot\right)$ is the $i$-th layer. Let’s call $y_{i}=\ell_{i}^{-1}\left(y_{i+1}\right)$ the output of the inverse of $i$-th layer with $y_{L+1}=x$, then we can rewrite equation \eqref{eq:change-MLE} as:
\[\begin{equation} L\left(\theta\right)=-\frac{1}{N}\sum_{i=1}^{N}\left(\log p_{z}\left(G_{\theta}^{-1}\left(x_{i}\right)\right)+\sum_{j=1}^{L}\log\left\vert \text{det}\left[\frac{\partial\ell_{j}^{-1}\left(y_{j+1}\right)}{\partial y_{j+1}}\right]\right\vert \right) \end{equation}\]through the chain rule. In other words, if we know the log-absolute-determinant of each layer individually, we just have to calculate them on the fly.
The question now becomes: how can we define $\ell_{i}\left(\cdot\right)$ so that it is invertible and it’s easy to calculate the determinant of the Jacobian?
To ensure that the determinant is easily calculable, we essentially have three types of transformations we can use:
There exist more types of transformations and ways to make this even more confusing, but these are the basics so we’ll stay with them. At any rate, in each of these cases calculating the determinant is feasible and not much slower than just a usual forward pass through the network.
We can use any element-wise transformation that we want, as long as the derivative for each coordinate is simple to calculate.
The easiest invertible transformations we can start thinking about are of the following sort:
\[\begin{equation} \ell\left(x\right)=e^{s}\odot x+b\quad x,s,b\in\mathbb{R}^{d}\quad\Leftrightarrow\ell^{-1}\left(y\right)=e^{-s}\odot\left(y-b\right) \end{equation}\]In this case, the inversion is very easy to describe. Also, we ensure that the layers are always invertible by using an element-wise multiplication of $x$ with $e^{s}$ instead of $s$ directly - this ensures that $\left[e^{s}\right]_{i}>0$ is always true. The Jacobian of the inverse of this transformation is simply $\text{diag}\left(e^{-s}\right)$, which is really easy to calculate on the fly. Sometimes this transformation is called an ActNorm layer and is initialized so that $e^{-s}=1/\sigma^{2}$ where $\sigma^{2}$ is the variance of the incoming signal and $b$ is equal to the mean, essentially standardizing the incoming signals.
Another useful transformation is one that moves everything in the range $\left(-\infty,\infty\right)$ to $\left(0,1\right)$, i.e. the softmax:
\[\begin{equation} \text{sig}\left(x_{i}\right)=\frac{1}{1+e^{-x_{i}}}\Leftrightarrow\text{sig}^{-1}\left(y_{i}\right)=\log\left(\frac{1-y_{i}}{y_{i}}\right) \end{equation}\]The derivative of this transformation is given by:
\[\begin{equation} \frac{d}{dx_{i}}\text{sig}\left(x_{i}\right)=\text{sig}\left(x_{i}\right)\cdot\left(1-\text{sig}\left(x_{i}\right)\right) \end{equation}\]which is also pretty simple to calculate on the fly.
While element-wise transformations are great, they don’t allow the model to learn anything regarding the correlations (or higher order statistics) between coordinates. Really, we need transformations that mix up the information. The most commonly used types of layers that act in this manner are called affine coupling layers. These layers are defined as follows:
\[\begin{equation} \ell\left(\left[\begin{matrix}x_{a}\\ x_{b} \end{matrix}\right]\right)=\left[\begin{matrix}x_{a}\\ x_{b}\odot\exp\left[f_{s}\left(x_{a}\right)\right]+f_{t}\left(x_{a}\right) \end{matrix}\right]\qquad x_{a},x_{b}\in\mathbb{R}^{d/2} \end{equation}\]The inverse of this transformation is also easily defined:
\[\begin{equation} x=\left[\begin{matrix}x_{a}\\ x_{b} \end{matrix}\right]=\ell^{-1}\left(\left[\begin{matrix}y_{a}\\ y_{b} \end{matrix}\right]\right)=\left[\begin{matrix}y_{a}\\ \left(y_{b}-f_{t}\left(y_{a}\right)\right)\odot\exp\left[-f_{s}\left(x_{a}\right)\right] \end{matrix}\right] \end{equation}\]This splitting of the inputs into two makes the transformation invertible but also makes sure that the Jacobian is triangular:
\[\begin{equation} \frac{\partial\ell^{-1}\left(y\right)}{\partial y}=\left[\begin{matrix}I & 0\\ \frac{\partial g\left(y_{b}\vert y_{a}\right)}{\partial y_{a}} & \qquad\text{diag}\left(\exp\left[-f_{s}\left(y_{a}\right)\right]\right) \end{matrix}\right] \end{equation}\]where $g\left(y_{b}\vert y_{a}\right)=\left(y_{b}-f_{t}\left(y_{a}\right)\right)\odot\exp\left[-f_{s}\left(x_{a}\right)\right]$. Because of the zero in the top right corner, the determinant of this Jacobian is simply the product of the terms on the diagonal, so we don’t need to bother with actually calculating $\frac{\partial g\left(y_{b}\vert y_{a}\right)}{\partial y_{a}}$, fortunately.
Because we could throw away the derivatives in the bottom left corner, we can use functions of any complexity to fit $f_{s}\left(x_{a}\right)$ and $f_{t}\left(x_{a}\right)$. This fact is essential for the flexibility of many of these models.
The above coupling layer was a single split of the input into two parts. A more flexible variant of the above are autoregressive flow layers, which are basically defined the same way, only in an autoregressive manner:
\[\begin{equation} \ell\left(x\right)=\ell\left(x_{1},\cdots,x_{d}\right)=\left[\begin{matrix}x_{1}\cdot\exp\left[f\right]+g\\ x_{2}\cdot\exp\left[f\left(x_{1}\right)\right]+g\left(x_{1}\right)\\ x_{3}\cdot\exp\left[f\left(x_{1},x_{2}\right)\right]+g\left(x_{1},x_{2}\right)\\ \vdots\\ x_{d}\cdot\exp\left[f\left(x_{1},\cdots,x_{d-1}\right)\right]+g\left(x_{1},\cdots,x_{d-1}\right) \end{matrix}\right] \end{equation}\]In other words, each coordinate is now a function of all of the coordinates that came before it in some manner. This is, again, invertible but is kind of a pain to write down, so I’ll leave it as it is. The Jacobian is also triangular again, pretty much in the same manner as the coupling layer.
The downside of autoregressive layers is that they’re usually quite computationally expensive. Also, their implementation is not straightforward - how do you define the functions $f\left(x_{1},\cdots\right)$ and $g\left(x_{1},\cdots\right)$ in a way that they can take any number of inputs?
The main shortcoming of normalizing flows is that custom-made layers need to be defined for the whole transformation to be (easily) invertible, and sometimes these layers are also computationally expensive.
One way we can get around the problem of custom layers is by, ironically, taking the number of layers in the transformation to be infinite. In this regime, instead of defining the whole transformation as a composition of a finite number layers, we need to define a _time-based _function that smoothly maps from the source distribution to the target. Basically, what we want is something of the following sort:
\[\begin{equation} x\left(t+\Delta t\right)=x\left(t\right)+\Delta t\cdot f_{\theta}\left(x\left(t\right),t\right)\label{eq:cont-flow-euler} \end{equation}\]where $\Delta t$ is a short change in time. So instead of defining the transformation itself, what we want to define is actually how $x$ changes over a short amount of time.
How does this help us? Well, now we can use any function we want to define $f_{\theta}\left(x,t\right)$, as long as it is smooth in $t$ and $x$. We are no longer constrained to very specific constructions of invertible layers.
Equation \eqref{eq:cont-flow-euler}, while being intuitive, does not convey the full meaning of what we want. The continuous flows are defined for infinitesimal values of $\Delta t$, so actually we need to look at:
\[\begin{equation} \lim_{\Delta t\rightarrow0}\frac{x\left(t+\Delta t\right)-x\left(t\right)}{\Delta t}=\frac{dx}{dt}=f_{\theta}\left(x,t\right) \end{equation}\]That is, $f_{\theta}\left(x,t\right)$ is the gradient of $x$ at time $t$.
To generate data, we start by sampling $x\left(0\right)$ from the source distribution:
\[\begin{equation} x\left(0\right)\sim p\left(z\right) \end{equation}\]After this initialization, we have to basically propagate $x\left(0\right)$ along time to see where we end up. That means we also have to define the amount of time we want to wait, $T$, until we assume that the target distribution is reached. Once we have done that, data is generated by solving:
\[\begin{equation} x\left(T\right)=x\left(0\right)+\intop_{0}^{T}f_{\theta}\left(x\left(\tau\right),\tau\right)d\tau \end{equation}\]The inverse transformation is basically the same, just starting at $x\left(T\right)$ and going backwards:
\[\begin{equation} x\left(0\right)=x\left(T\right)-\intop_{0}^{T}f_{\theta}\left(x\left(\tau\right),\tau\right)d\tau \end{equation}\]Honestly, the above is really hard to understand though. How do you actually solve the integral? Well, one of the simplest ways is to just use the same formula as equation \eqref{eq:cont-flow-euler}. This is sometimes called Euler’s method and is a pretty crude, but simple, method for solving the integral.
The above, while including some terrible notation, basically generalized normalizing flows to a continuum of layers. If the time steps are small enough, the log-likelihood of the continuous flows is also quite similar:
\[\begin{equation} \log p_{\theta}\left(x\left(T\right)\right)=\log p_{z}\left(x\left(0\right)\right)-\intop_{0}^{T}\text{trace}\left[\frac{\partial f_{\theta}\left(x\left(\tau\right),\tau\right)}{\partial x\left(\tau\right)}\right]d\tau \end{equation}\]The trace of the Jacobian above is very similar to the summation of the diagonal terms in the log-determinants in the regular normalizing flows. Of course, this is still going to be quite computationally intensive to calculate in practice.
Normalizing flows are a popular class of explicit likelihood generative models. Because the likelihood is baked into the whole definition of normalizing flows, that means that you don’t need to approximate it during inference like VAEs or the models in the next few posts. This fact has made normalizing flows very popular for scientific applications, where likelihood is explicitly needed during inference.
Still, normalizing flows are quite limiting. The transformations are either very rigidly defined or are computationally expensive to calculate, in both cases resulting to difficulties in high dimensions. More than just a matter of computational price, normalizing flows typically have a lot of issues regarding numerical stability. These downsides mean that normalizing flows are a bad fit for applications such as computer vision, where high sample quality is king.