The added inertia acts both as a smoother and an accelerator, dampening oscillations and causing us to barrel through narrow valleys, small humps and local minima.

This standard story isn’t wrong, but it fails to explain many important behaviors of momentum.

S e x a r a b-13

Momentum proposes the following tweak to gradient descent.

We give gradient descent a short-term memory: , if things are really bad), this appears to be the boost we need.

In fact, momentum can be understood far more precisely if we study it on the right model. This model is rich enough to reproduce momentum’s local dynamics in real problems, and yet simple enough to be understood in closed form.

This balance gives us powerful traction for understanding this algorithm.

But it does satisfy some curiously beautiful mathematical properties which scratch a very human itch for perfection and closure. Let’s say this for now — momentum is an algorithm for the book. For most step-sizes, the eigenvectors with largest eigenvalues converge the fastest.

This triggers an explosion of progress in the first few iterations, before things slow down as the smaller eigenvectors’ struggles are revealed.You start to get a nagging feeling you’re not making as much progress as you should be. The problem could be the optimizer’s old nemesis, pathological curvature.Pathological curvature is, simply put, regions of which aren’t scaled properly.The eigenfeatures are also much more informative: The observations in the above diagram can be justified mathematically.From a statistical point of view, we would like a model which is, in some sense, robust to noise.We often think of Momentum as a means of dampening oscillations and speeding up the iterations, leading to faster convergence. It allows a larger range of step-sizes to be used, and creates its own oscillations. Momentum is a heavy ball rolling down the same hill.