This post completes the non-technical introduction to KL-divergence and gradient descent, showing how gradient descent performs on the more complex 6-variable model that we manually solved before.

Previously in this series:

- A non-technical introduction to KL-divergence complete with a perform your own variational inference demo.
- An introduction to gradient descent. using KL-Divergence.

Variance 1: 2.00

Mean 2: 0.00

Variance 2: 2.00

Mean 3: 0.01

Variance 3: 2.00

Step Size: Momentum:

Remember that the stepsize is how many steps is taken along the gradient for each iteration. A large stepsize means big steps that could potentially overshoot the goal. A small stepsize means that it will take longer to get to the solution. The momentum is the proportion of the previous movement that is carried over to the next update and should be set between 0 and 1.

This quick and dirty implementation is surprisingly effective and really starts to show the benefits of using momentum. Even if you set the starting means and variances to be close (but not exactly equal) they will usually separate out and find the solution using \(KL(q,p)\). However, with \(KL(p,q)\) this doesn't work as well as the means converge in the center when the estimate spreads out to cover as much as possible.

The plan for these posts was to introduce KL-divergence and gradient descent by example, through interaction and exploration, and without much technical material. For completion, the next entry in this series will finally turn to the mathematics behind implementing these examples. In the simpler examples, KL-divergence and gradients were calculated exactly, as the functions were known and easily differentiable. In the more complex examples the KL values were estimated using uniform sampling.