Deep Bayes: Variational inference

Introduction

这篇笔记比较系统地阐述了一些变分推断的方法，欢迎食用。转载请注明：
https://blog.nowcoder.net/n/c0e560ac1acb42f9a88b525da39b9189
Reference: Deep Bayes

Full Bayesian inference

Training Stage
$training stage$
Testing Stage
$Testing Stage$
Comment: The denominator in training stage sometimes may be intractable. Posterior distributions can be calculated analytically only for simple conjugate models.

Approximate inference

Probabilistic model:
$图片说明$
Variational Inference:
Approximate $图片说明$
Biased but fast and more scalable
MCMC:
samples from unnormalized $图片说明$
unbiased but need a lot of samples
Some mathematic magic:
$图片说明$
The first item is ELBO, evidence lower bound
THe second item is KL divergence, Kullback-Leibler divergence
Variational Inference: ELBO interpretation
Final optimisation problem
$图片说明$
The first item is data item, the second item is regularizer.
Mean field approximation
$图片说明$
$图片说明$
then we could use the following replacement to reformulate the equation:
$图片说明$
So the above equation can become:
$图片说明$
Algorithm
Initialize
$图片说明$
Iterations:
\ Update each factor $图片说明$
Parametric optimization

Inference Summary

图片说明

Statistical Inference

continuous latent variables can be regarded as a mixture of a continuum of distributions $图片说明$
E-step can be done in closed form only in case of contiguous distributions, otherwise the true posterior is intractable.
$图片说明$
Typically continuous latent variables are used for dimension reduction also known as representation learning.
Example: PCA model
Consider $图片说明$ ,such that D>>d
Joint distribution:
$图片说明$
$图片说明$ consists of $图片说明$ matrix V, D-dimensional vector $图片说明$ and scalar $图片说明$
EM-PCA and Mixture of PCA
joint distribution:
$图片说明$
Variational autoencoder
$图片说明$
EM for VAE
$图片说明$
However, the denominator is still intractable.
Variational inference
parametric variational inference
Instead of direct infering of p(z_i | x_i,\theta) let us define flexible variational approximation
$图片说明$
This additional Neural Network ensures tractability of the distribution while being very flexible.

Stochastic optimization
$图片说明$
Problem 1: The training data is assumed to be large which means iterations might be expensive
Problem 2: The integral in ELBO is still intractable
Solution: Compute stochastic gradients by using mini-batching and Monto-Carlo estimation
Optimization w.r.t. $\theta$
$图片说明$
Mini-batching
$图片说明$
However, if we use Monte-Carlo estimation:
$图片说明$
However, when it comes to $\phi$ , it is another case:
Can no longer move gradient inside integral
$图片说明$
Log-derivative trick
$图片说明$
if we apply the trick, it yields to:
$图片说明$
Then the expectations can be estimated using monte carlo methods.
Log-derivative trick for ELBO
$图片说明$
Now consider its first term and apply mini-batching and log-derivative trick
$图片说明$
We can prove that the score function: $\frac{\partial log q(z_i | x_i,\phi)}{\partial \phi}$ is zero mean.
REINFORCE
$图片说明$
However, the term can be arbitrary large negative that leads to very unstable stochastic gradients
A partial solution is to use baselines
Consider a function $b(z_i,\phi)$ , such that:
$图片说明$
Remember that the so-called score function can meet the requirements.
$图片说明$
我是分割线

I am a lazy man.

图片说明

Reparameterization trick

Consider differentiation of complex expectation
$图片说明$
Express $y$ as a deterministic function g(.) of random $\epsilon$ and $x$ and perform change-of-variables rule
$图片说明$
Then stochastic differentiation is simply
$图片说明$
我又懒了～～