Generative Models 1 - A Linear Model

← What is a Generative Model?Variational Methods →

Last post we saw a very general introduction and definition of generative models. In this post we’ll take a look at a specific class of very simple generative models.

A Linear Model (Probabilistic Principal Component Analysis)

For now, I’m staying with the latent space models, or decoder-based generative models, as very very loosely defined in the previous post. We’ll start off with the most basic decoder: a linear transformation. We will also assume that the latent codes are standard normal, so that everything’s nice and easy. Observed data points are now modeled through:

\[\begin{equation} x=Wz+\mu\qquad z\sim\mathcal{N}\left(0,I\right) \end{equation}\]

I’ll also assume that $z\in\mathbb{R}^{m}$ and $x\in\mathbb{R}^{d}$ such that $m\ll d$. However, if $m<d$ then we might face a problem: the probability of ever sampling something not on this linear transformation is zero! However, real-world examples will probably never neatly lie on a hyperplane.

Instead we’re going to have to assume that there was some noise added during the observation of $x$, i.e.:

\[\begin{equation} x=Wz+\mu+\text{noise} \end{equation}\]

The simplest noise we can assume is an isotropic Gaussian (the noise is the same in all directions), which will give us:

\[\begin{equation} p_{\theta}\left(x|z\right)=\mathcal{N}\left(x|\ Wz+\mu,I\varphi^{2}\right) \end{equation}\]

where $\varphi$ is the standard deviation of the noise we assume was added. Conveniently, this can be written in the following form:

\[\begin{equation} x=\mu+Wz+\varphi\cdot\eta\qquad z\sim\mathcal{N}\left(0,I_{m}\right),\:\eta\sim\mathcal{N}\left(0,I_{d}\right) \end{equation}\]

The joint distribution for our model is now very simply:

\[\begin{align} p_{\theta}\left(z,x\right) & =p\left(z\right)p_{\theta}\left(x|z\right)\\ & =\mathcal{N}\left(z|\,0,I_{m}\right)\times\mathcal{N}\left(x|\,\mu+Wz,\:I_{d}\varphi^{2}\right) \end{align}\]

with the set of parameters $\theta=\left{ \mu\in\mathbb{R}^{d},W\in\mathbb{R}^{d\times m},\varphi\in\mathbb{R}_{+}\right}$ . That’s $d\cdot\left(m+1\right)+1$ parameters.

As defined, the above is the generative model equivalent of linear regression. This model is also called a probabilistic principal component analysis (pPCA, ) model, for reasons that will become clear in a bit.

Likelihood

We are now dealing with multiplications and additions of Gaussian distributions. This makes calculating the likelihood extremely simple, as it means that it will also be GaussianSee my notes on the Gaussian distribution for more information..

Our goal is to maximize the log-likelihood. To do so, we first need to be able to calculate the log-likelihood. Luckily, $x$ is a linear transformation of a Gaussian plus another Gaussian. That is, the marginal $p_{\theta}\left(x\right)$ will also be a Gaussian distribution, which is completely described by the mean and covariance. So let’s calculate the mean and covariance of $x$:

\[\begin{align} \mathbb{E}\left[x\right] & =W\mathbb{E}\left[z\right]+\mu+\varphi\cdot\mathbb{E}\left[\eta\right]=\mu\\ \text{cov}\left[x\right] & =W\text{cov}\left[z\right]W^{T}+\varphi^{2}\text{cov}\left[\eta\right]=WW^{T}+I\varphi^{2} \end{align}\]

So, the likelihood of $x$ under our model is:

\[\begin{equation} p\left(x;\theta\right)=\mathcal{N}\left(x|\;\mu,\:WW^{T}+I\varphi^{2}\right) \end{equation}\]

Maximizing the Likelihood

The model we described is basically a Gaussian. The ML solution for a Gaussian is simple. Given our dataset $\mathcal{D}=\left{ x_{i}\right} _{i=1}^{N}$, the best mean is:

\[\begin{equation} \hat{\mu}=\frac{1}{N}\sum_{i=1}^{N}x_{i} \end{equation}\]

The best covariance we can hope for is:

\[\begin{equation} S=\frac{1}{N}\sum_{i=1}^{N}\left(x_{i}-\hat{\mu}\right)\left(x_{i}-\hat{\mu}\right)^{T} \end{equation}\]

However, the covariance in our model has a specific parameterization. Basically, we want to find $W$ and $\varphi$ such that:

\[\begin{equation} \Sigma_{W}\stackrel{\Delta}{=}WW^{T}+I\varphi^{2}\approx S \end{equation}\]

The best $W$ and $\varphi$ are explicitly related to $S$, and we’ll need the EVD of the data’s covariance to define them:

\[\begin{equation} S=U\Lambda U^{T} \end{equation}\]

where $\Lambda_{ij}=0$ whenever $i\neq j$ and the eigenvalues $\lambda_{i}=\Lambda_{ii}$ are sorted in descending order. The MLE solution for $W$ and $\varphi$ turns out to be (see for the derivation):

\[\begin{align} \hat{\varphi}^{2} & =\frac{1}{d-m}\sum_{i=m+1}^{d}\lambda_{i}\\ \hat{W} & =U_{1:m}\left(\Lambda_{1:m}-\hat{\varphi}^{2}\right)^{1/2} \end{align}\]

where $U_{1:m}\in\mathbb{R}^{d\times m}$ are the first $m$ columns of $U$ (the first $m$ eigenvectors) and $\Lambda_{1:m}\in\mathbb{R}^{m\times m}$ are the first $m$ eigenvalues in $\Lambda$.

Essentially, all we need to do in order to fit this linear model is to calculate the mean of the dataset and the SVD of the centered datapoints, since:

\[\begin{equation} \text{SVD}\left(\left\{ x_{1}-\hat{\mu},\cdots,x_{N}-\hat{\mu}\right\} \right)=U\Lambda^{1/2}V \end{equation}\]

This spares us from calculating the full data covariance $S$, which may sometimes be difficult/impossible to calculate when the dimensions of the data are very very large.

Relation to PCA

This linear model, as I mentioned, is called the probabilistic principal component analysis (pPCA) model. This name is because we model distribution as sitting mostly on the hyperplane described by the first $m<d$ principal components of the dataset. Of course, data points don’t actually lie inside this hyperplane, and this discrepancy is modeled as additional isotropic noise. Simply put: this model is a probabilistic version of PCA. When $\varphi\rightarrow0$, it is exactly equal to the standard PCA.

The main advantage of this model is it’s simplicity and obvious modeling assumptions. The model is simple because it is basically a Gaussian with a low-rank covariance, and what it means for the data to be modeled as a linear transformation of a standard normal distribution is pretty clear. In this sense, pPCA is a very good baseline for comparing other generative models. Also, sometimes it is surprising just how much can be described by this simple linear function.

This is a pretty good place to discuss models similar to pPCA. Remember, we defined a pPCA as:

\[\begin{equation} \text{pPCA:}\qquad x=\mu+Wz+\eta\qquad\begin{matrix}\eta\sim\mathcal{N}\left(0,I\varphi^{2}\right)\\ z\sim\mathcal{N}\left(0,I\right) \end{matrix} \end{equation}\]

A closely related model is the factor analysis (FA) model, defined as basically the same thing:

\[\begin{equation} \text{FA}:\qquad x=\mu+Wz+\tilde{\eta}\qquad\begin{matrix}\tilde{\eta}\sim\mathcal{N}\left(0,\text{diag}\left(\varphi_{1}^{2},\cdots,\varphi_{d}^{2}\right)\right)\\ z\sim\mathcal{N}\left(0,I\right) \end{matrix} \end{equation}\]

The difference is that an FA allows different dimensions to have different amounts of noise, basically taking into account their scales. Funnily enough, this rather small change already makes it so there is no closed-form solution for the MLE. Instead, an expectation maximization (EM) algorithm has to be used to fit this model.

Another model that should be considered as a baseline is a pPCA mixture model (pPCAMM or MoPPCA, ), which basically gathers a bunch of pPCAs into one model:

\[\begin{equation} p_{\theta}\left(x\right)=\sum_{k=1}^{K}\pi_{k}\text{pPCA}\left(x|\;\mu_{k},W_{k},\varphi_{k}\right)\qquad\begin{matrix}\forall k\quad\pi_{k}\ge0\\ \sum_{k}\pi_{k}=1 \end{matrix} \end{equation}\]

This is like the difference between a single Gaussian and a Gaussian mixture model (GMM). FAs can also (as any other distribution) be gathered this way, of course.

Practical Considerations

We basically covered everything I wanted to cover with regards to pPCAs. I would be remiss, however, not to give a brief overview of how these models can actually be used in practice.

As stated, $W\in\mathbb{R}^{d\times m}$ with $m\ll d$. If you’re using a pPCA and not a regular Gaussian, then the dimension of the data is probably large. Very large. However, in order to get the likelihood of data points, you’ll need to calculate, store in memory, and even invert the full covariance:

\[\begin{equation} \Sigma_{W}=WW^{T}+I\varphi^{2} \end{equation}\]

This matrix is a $d\times d$ matrix, which is a huge pain when $x$ is… huge. Intuitively, it doesn’t make sense that we need to calculate something in $d$ dimensions if we already know that it is a mostly $m<d$ dimensional object. This intuition is correct.

Almost all operations with Gaussians require inverting the covariance matrix. It turns out that there’s a mathematical equivalence, known as Woodbury’s matrix identity (among other names), that simplifies the inversion of low-rank matrices such as $\Sigma_{W}$ considerably:

\[\begin{align} \underbrace{\left(WW^{T}+I_{d}\varphi^{2}\right)^{-1}}_{d\times d\text{ inversion}} & =\left(I_{d}-W\underbrace{\left(W^{T}W+I_{m}\varphi^{2}\right)^{-1}}_{m\times m\text{ inversion}}W^{T}\right)/\varphi^{2}\\ & =\left(I_{d}-WM^{-1}W^{T}\right)/\varphi^{2} \end{align}\]

Instead of inverting the full $d\times d$ matrix, it turns out that it’s enough to invert the $m\times m$ matrix:

\[\begin{equation} M=W^{T}W+I_{m}\varphi^{2} \end{equation}\]

But this is still a $d\times d$ matrix, so in the face of it we still didn’t gain all that much by this change (although, believe me, inverting a small matrix instead of a big matrix is already an improvement).

In fact, most of the times we don’t need just the inverse of the covariance (also called the precision, by the way). We need the precision matrix times some vector:

\[\begin{equation} \Sigma_{W}^{-1}v=\frac{1}{\varphi^{2}}v-\frac{1}{\varphi^{2}}\underbrace{WM^{-1}}_{\in\mathbb{R}^{d\times m}}\underbrace{W^{T}v}_{\in\mathbb{R}^{m}} \end{equation}\]

Beyond the fact that this takes much less time, the numerical stability of this operation is much better than if we were to invert a $d\times d$ matrix.

The last thing to note is that the posterior is also completely tractable and equal to:

\[\begin{equation} p_{\theta}\left(z|x\right)=\mathcal{N}\left(z|\;M^{-1}W^{T}\left(x-\mu\right),M/\varphi^{2}\right) \end{equation}\]

This being easy to calculate means that (given an observed $x$) we can carry out all needed operations in the latent dimension with ease. This is sometimes difficult to do, as we will see in the next parts of this primer.

Conclusion

The pPCA model is one of the simplest generative models one can use for continuous data. It’s basically a Gaussian with most of the probability concentrated around a lower-dimensional projection. Because this model is so simple, it serves as a good baseline. Also, it allows us to do solve many tasks with a closed-form solution, which is really rare for generative models.

So, where do we proceed? The obvious next step after a linear model is one that isn’t linear. This turns out to complicate matters considerably. In the next post, we’ll take a look at variation methods (such as variational auto-encoders), which give us a framework for training and using non-linear models in inference tasks.