Generative Models 0 - What is a Generative Model?

A Linear Model →

You are given some data (let’s say it isn’t labeled). This data, you assume, has some significance. For instance, it might be data collected from a manufacturing factory and it could contain information regarding inefficient work lines. Or it contains information regarding the number of animals and their species in certain areas and you think that the number of animals depends on some human structure that was placed next to these areas. Or, this data is a set of CT scans of cancer patients and you think that certain places in the body have higher or lower chances of growing tumors.

Whatever the case, you have data that (again, you assume) contains significant information for a scientific question which you want to answer. To aid you in the process of answering these important questions, you want to somehow simulate the process that created the data. In other words, you want to model the generative process of the creation of the data. After all, if you do so well then maybe, just maybe, you’ll be able to meaningfully solve the problem of interest.


Definition of a Generative Model

To be a bit more precise, given the set of data $\mathcal{D}=\left\{x_{i}\right\}_{i=1}^{N}$ where each point $x_{i}$ is assumed to have been sampled from a distribution $p_{\text{data}}\left(x\right)$, we want to find a parametric distribution such that:

\[\begin{equation} p_{\theta}\left(x\right)\equiv p\left(x;\theta\right)\approx p_{\text{data}}\left(x\right) \end{equation}\]

where $\theta$ are the parameters of the distribution. All of the assumptions and modeling decisions we make with regards to the distribution of the data are hidden inside the notation $\theta$.

A Simple Example: Kernel Density Estimation (KDE)

The simplest example for a generative model that I can think of is just to fit a Gaussian to the training data. But usually this won’t give us all that much information (and sometimes it will be hard/impossible to do).

Beyond the Gaussian, one of the simplest examples of a generative model is kernel density estimation (KDE, also sometimes called Parzen windows). We’ll start by looking at a very basic instance of KDE and then slightly abstract it. This model assumes very little. In fact, the only thing we’ll assume is that points close to the observed data should have high density and that the distribution should decay quickly the further we are from the training points.

So, again, let’s assume we have a dataset $\mathcal{D}=\left\{o_{i}\right\}_{i=1}^{N}$ (I’m calling them $o_{i}$ now so it will be less confusing in the next part). We want something that has high density near training points and decays fast. A base distribution that does this is an isotropic Gaussian distribution! So we can just put a lot of Gaussians centered around the training data:

\[\begin{equation} p_{\beta}\left(x\right)=\frac{1}{N}\sum_{i=1}^{N}\mathcal{N}\left(x\vert \ o_{i},I\beta\right)\label{eq:RBF-KDE} \end{equation}\]

where $\beta>0$ is called the bandwidth and is the same for all points, that is somehow chosen. This is a valid distribution - you can check to make sure that $p_{\beta}\left(x\right)$ integrates to one.

This very basic model “smoothes out” the empirical distribution, defined as:

\[\begin{equation} p_{\text{emp}}\left(x\right)=\frac{1}{N}\sum_{i=1}^{N}\delta\left(x-o_{i}\right) \end{equation}\]

where $\delta\left(\cdot\right)$ is Dirac’s delta, defined as:

\[\begin{equation} \delta\left(x\right)=\begin{cases} \infty & x=0\\ 0 & \text{otherwise} \end{cases}\qquad\quad\intop\delta\left(x\right)dx=1 \end{equation}\]

Anyway, in equation \eqref{eq:RBF-KDE} the smoothing function was a Gaussian, but there are many other possible choices that we could have made. The smoothing function is usually called the kernel, and it has to be a positive definite (PD) kernel if we want the KDE approximation to be a valid distribution. Given a kernel $K\left(x,o_{i}\right)$, the general KDE distribution is defined as:

\[\begin{equation} p_{\text{KDE}}\left(x\right)=\frac{1}{N}\sum_{i=1}^{N}K\left(x,o_{i}\right) \end{equation}\]

which is only a valid distribution if:

\[\begin{equation} \intop K\left(x,o_{i}\right)dx<\infty \end{equation}\]

That said, everyone just uses the RBF kernel, which is the one from equation \eqref{eq:RBF-KDE}. While this seems like a very simple definition, it’s a very good baseline for comparison (which is why I even put it here). This form of KDE is sort of the nearest neighbors equivalent in the world of generative models, since the nearest neighbor will typically have the largest impact on the density of the point $x$ in $p_{\text{KDE}}\left(x\right)$. In this case, the kernel defines the distance function for the nearest neighbors (and is euclidean when using the RBF kernel).


Training Generative Models

In general, we will want to train our generative models to be as adequate as possible to explain the observed data. Training a generative model in practice amounts to defining a way to compare two models, then choosing the parameters that give the better model under said comparison.

Actually, it’s good that I’ve already presented the KDE, that way we’ll have a concrete example of this. Given two possible bandwidths $\beta_{1}$ and $\beta_{2}$, how are we supposed to choose which bandwidth is better suited to use in a KDE for our purposes?

Training, or choosing between models, requires a way to compare generative models, as I’ve already said. How do we decide in practice? This depends on what we’re trying to achieve. It might be that we only want samples $x\sim p_{\theta}$ to be similar to samples from the real data $x\sim p_{\text{data}}$, in which case our criteria might be “. Or we could say that there is a set of statistics (say the mean, variance, etc.) gleaned from the observed data which we have to match, otherwise the distribution should be as general as possible. Or, the most popular, that $p_{\theta}\left(x\right)$ should be as close as possible to $p_{\text{data}}\left(x\right)$ under some divergence.

Let’s talk about the last one.

Maximum Likelihood Estimation (MLE)

A natural way to evaluate the quality of a model is through the use of some divergence between distributions:

\[\begin{equation} \text{error}\left(\theta\right)=D\left(p_{\text{data}}\left(x\right)\vert \vert p\left(x;\theta\right)\right) \end{equation}\]

Here “divergence” just means some function $D\left(p\vert \vert q\right)$ which is 0 if and only if $\forall x\quad p(x)=q(x)$, and is monotonically increasing as a function of some difference between $p\left(x\right)$ and $q\left(x\right)$. Basically, a divergence is a sort of distance between distributions.

The most commonly used divergence is the Kullback-Leibler divergence (KL-divergence) defined as:

\[\begin{align} D_{\text{KL}}\left(p\vert \vert q\right) & =\intop p\left(x\right)\log\frac{p\left(x\right)}{q\left(x\right)}dx=\mathbb{E}_{x\sim p}\left[\log p\left(x\right)-\log q\left(x\right)\right]\\ & =-\mathbb{E}_{p}\left[\log q\left(x\right)\right]-H\left(p\right) \end{align}\]

where $H\left(p\right)$ is called the (differential) entropy of the distribution $p\left(x\right)$.

We want to compare the data distribution to our model’s distribution:

\[\begin{equation} D_{\text{KL}}\left(p_{\text{data}}\left(x\right)\vert \vert p\left(x;\theta\right)\right)=-\mathbb{E}_{p_{\text{data}}}\left[\log p\left(x;\theta\right)\right]-H\left(p_{\text{data}}\right) \end{equation}\]

Now, $H\left(p_{\text{data}}\right)$ will be impossible to calculate if we just observe some data, since we don’t actually know the function $p_{\text{data}}\left(\cdot\right)$. However, if all we want to do is compare between two parameterizations $\theta_{1},\theta_{2}\in\Theta$ and to determine which is best, we have to ask:

\[\begin{align} D_{\text{KL}}\left(p_{\text{data}}\left(x\right)\vert \vert p\left(x;\theta_{1}\right)\right) & \stackrel{?}{>}D_{\text{KL}}\left(p_{\text{data}}\left(x\right)\vert \vert p\left(x;\theta_{2}\right)\right)\\ D_{\text{KL}}\left(p_{\text{data}}\left(x\right)\vert \vert p\left(x;\theta_{1}\right)\right)-D_{\text{KL}}\left(p_{\text{data}}\left(x\right)\vert \vert p\left(x;\theta_{2}\right)\right) & \stackrel{?}{>}0\\ -\mathbb{E}_{p_{\text{data}}}\left[\log p\left(x;\theta_{1}\right)\right]+\mathbb{E}_{p_{\text{data}}}\left[\log p\left(x;\theta_{2}\right)\right] & \stackrel{?}{>}0\\ \mathbb{E}_{p_{\text{data}}}\left[\log p\left(x;\theta_{1}\right)\right] & \stackrel{?}{<}\mathbb{E}_{p_{\text{data}}}\left[\log p\left(x;\theta_{2}\right)\right] \end{align}\]

That is, to decide if $p\left(x;\theta_{1}\right)$ is worse than $p\left(x;\theta_{2}\right)$ you don’t even need access to $H\left(p_{\text{data}}\right)$!

The term $\log p\left(x;\theta\right)$ is called the log-likelihood of the model $\theta$. What we saw above is that it is enough to find $\theta\in\Theta$ that maximize the expected log-likelihood in order to say that we found the best parameters to describe the distribution (in terms of the KL-divergence):

\[\begin{equation} \theta^{\star}=\arg\min_{\theta\in\Theta}D_{\text{KL}}\left(p_{\text{data}}\left(x\right)\vert \vert p\left(x;\theta\right)\right)=\arg\max_{\theta\in\Theta}\mathbb{E}_{p_{\text{data}}}\left[\log p\left(x;\theta\right)\right] \end{equation}\]

This concept is aptly called maximum likelihood estimation (MLE), because we estimate $p_{\text{data}}\left(x\right)$ with the model $p\left(x;\theta^{\star}\right)$ that maximizes the likelihood.

Of course, in the real world calculating the expectation is impossible and we only have access to samples from the data distribution, $\mathcal{D}$. Also, researchers in machine learning are used to talking about losses (or errors). So usually the negative log-likelihood (NLL) loss is used to train generative models:

\[\begin{equation} L\left(\theta\right)=-\frac{1}{N}\sum_{x_{i}\in\mathcal{D}}\log p\left(x_{i};\theta\right)\approx-\mathbb{E}_{p_{\text{data}}}\left[\log p\left(x;\theta\right)\right] \end{equation}\]

Difficulties with MLE

MLE is the most popular way to train generative models, since it seems like such a natural criteria. After all, even without the whole thing with the KL-divergence, maximizing the log-likelihood of the training examples alone already seems like a good idea. So our criteria is simple to understand and instead training generative models is difficult because it’s hard in many cases to exactly calculate the log-likelihood $\log p\left(x;\theta\right)$.

For instance, the most popular form of generative models in computer vision work in the following way:

\[\begin{equation} z\sim p\left(z\right)\stackrel{G_{\theta}}{\mapsto}x\in\mathcal{X}\label{eq:latent-space-model} \end{equation}\]

In words, there is a latent code $z\in\mathcal{Z}$ that is sampled according to a distribution $p\left(z\right)$, which is then mapped to the image space $\mathcal{X}$ using a (always complicated) function $G_{\theta}\left(z\right)$. If we write this out as a probability function, we would say that:

\[\begin{equation} p_{\theta}\left(z,x\right)=p\left(z\right)p_{\theta}\left(x\vert z\right) \end{equation}\]

In this case, to calculate the log-likelihood of the model we need to solve the following integral:

\[\begin{equation} p_{\theta}\left(x\right)=\intop p\left(z\right)p_{\theta}\left(x\vert z\right)dz \end{equation}\]

If $G_{\theta}\left(z\right)$ is anything beyond linear, then calculating the above integral will be pretty much impossible.

The bottom line is that training generative models almost always involves maximizing the log-likelihood. To actually do so, we will need to use tricks and approximations in order to calculate the log-likelihood in almost every case.

A note about naming

Generative models that have a latent space that is mapped to the data space are sometimes called decoder-based generative models because they use a decoder to describe how data is generated. As I mentioned above, they are very common in computer vision, so now is a good time to get acquainted with some of the terminology.

The naming convention of these generative models is a bit weird. Many times, the distribution over the latent codes $p\left(z\right)$ is called the prior distribution. Seemingly, these models follow from Bayesian statistics. If that is the case, $p_{\theta}\left(x\vert z\right)$ should be called the likelihood, but isn’t and instead $p_{\theta}\left(x\vert z\right)$ is usually called the observation model or observation probability or something like that. The distribution $p_{\theta}\left(z\vert x\right)$ is the posterior distribution and finally $p_{\theta}\left(x\right)$ is either the likelihood, marginal likelihood or the evidence. This is a weird mishmash of terms from Bayesian statistics that isn’t fully faithful to the Bayesian interpretation.

The way to think about everything so it will make sense is this: given a specific decoder $G_{\theta}:\mathcal{Z}\rightarrow\mathcal{X}$, which we have no influence over, the distribution over $z$ basically defines a prior distribution over the possible $x$s. However, in the real world we don’t directly see $x$, there’s usually some noise involved in the observation - think about photos (especially when it’s dark), they tend to have some grain or noise. So, what we observe in the real world is more like $x=G_{\theta}\left(z\right)+\text{noise}$, which explains the distribution $p_{\theta}\left(x\vert z\right)$ and why it’s called the observation model. The name of the probability $p_{\theta}\left(x\right)$ has to match both Bayesian statistics and generative models in general, so it’s called either the evidence or likelihood, interchangeably.


Conclusion

Generative models, the main focus of this series of posts, are very popular. This popularity comes from their unprecedented performance in generating data from computer vision and natural language processing, most of which is very recent. Because of this sudden acceleration in the quality of samples generated from such models, there has been a surge of research into generative models.

The purpose of this primer is to give a good foundation to learn about more complex methods, which I hope will be accessible. In the next post in this series we’ll explore a really simple generative model, which should give a kind of starting point into exploring the more complex, state of the art models.


A Linear Model →