Deep Bayes: Variational inference
Introduction
这篇笔记比较系统地阐述了一些变分推断的方法,欢迎食用。转载请注明:
https://blog.nowcoder.net/n/c0e560ac1acb42f9a88b525da39b9189
Reference: Deep Bayes
Full Bayesian inference
Training Stage
Testing Stage
Comment: The denominator in training stage sometimes may be intractable. Posterior distributions can be calculated analytically only for simple conjugate models.
Approximate inference
Probabilistic model:
Variational Inference:
Approximate
Biased but fast and more scalable
MCMC:
samples from unnormalized
unbiased but need a lot of samples
Some mathematic magic:
The first item is ELBO, evidence lower bound
THe second item is KL divergence, Kullback-Leibler divergence
Variational Inference: ELBO interpretation
Final optimisation problem
The first item is data item, the second item is regularizer.
Mean field approximation
then we could use the following replacement to reformulate the equation:
So the above equation can become:
Algorithm
Initialize
Iterations:
\ Update each factor
Parametric optimization
Inference Summary
Statistical Inference
continuous latent variables can be regarded as a mixture of a continuum of distributions
E-step can be done in closed form only in case of contiguous distributions, otherwise the true posterior is intractable.
Typically continuous latent variables are used for dimension reduction also known as representation learning.
Example: PCA model
Consider ,such that D>>d
Joint distribution:
consists of matrix V, D-dimensional vector and scalar
EM-PCA and Mixture of PCA
joint distribution:
Variational autoencoder
EM for VAE
However, the denominator is still intractable.
Variational inference
parametric variational inference
Instead of direct infering of p(z_i | x_i,\theta) let us define flexible variational approximation
This additional Neural Network ensures tractability of the distribution while being very flexible.
Stochastic optimization
Problem 1: The training data is assumed to be large which means iterations might be expensive
Problem 2: The integral in ELBO is still intractable
Solution: Compute stochastic gradients by using mini-batching and Monto-Carlo estimation
Optimization w.r.t.
Mini-batching
However, if we use Monte-Carlo estimation:
However, when it comes to , it is another case:
Can no longer move gradient inside integral
Log-derivative trick
if we apply the trick, it yields to:
Then the expectations can be estimated using monte carlo methods.
Log-derivative trick for ELBO
Now consider its first term and apply mini-batching and log-derivative trick
We can prove that the score function: is zero mean.
REINFORCE
However, the term can be arbitrary large negative that leads to very unstable stochastic gradients
A partial solution is to use baselines
Consider a function , such that:
Remember that the so-called score function can meet the requirements.
我是分割线
I am a lazy man.
Reparameterization trick
Consider differentiation of complex expectation
Express as a deterministic function g(.) of random and and perform change-of-variables rule
Then stochastic differentiation is simply
我又懒了~~
Conclusion
Good Good Study,Day Day Up
不定期分享各类算法以及面经。同时也正在学习相关分布式技术。欢迎一起交流。