SciPy optimisation: Newton-CG vs BFGS vs L-BFGS

I am doing an optimisation problem using Scipy, where I am taking a flat network of vertices and bonds of size NNxNN, connecting two sides of it (i.e., making it periodic), and minimising an energy function, so that it curls up to form a cylinder. (See the links below.)

Since I have the function energy(xyz-position) and it's gradient, I decided to use the three methods recommended in the Scipy manual -- Newton-CG, BFGS, L-BFGS-B -- and compare how they performed.

I call the optimisation function as follows, and I merely replace 'Newton-CG' with 'BFGS' and 'L-BFGS-B' according to case:

from scipy.optimize import minimize
res = minimize(energy, xyzInit, method='Newton-CG', jac = energy_der,  options={'disp': True})

I found the following general behaviour (I am giving the output data for the case of NN=9, corresponding to a 3*9^2=243-dimensional parameter space) -

BFGS systematically failed to find the correct minimum (for low NN), and failed to converge at all for large NN. See https://plot.ly/~apal90/162/ for end result.

 NN=9
 Method: BFGS
 Warning: Desired error not necessarily achieved due to precision loss.
 Current function value: 204.465912
 Iterations: 1239
 Function evaluations: 1520
 Gradient evaluations: 1508
 Time taken for minimisation: 340.728140116

Newton-CG found the correct minimum for small NN (<=8), but starting from NN=9, returned an incorrect minimum (viz., a cylinder squashed at one end), and for higher values stopped even converging. Note: This behaviour was for some reason aggravated for odd NN's. See https://plot.ly/~apal90/164/
```
 NN=9
 Method: Newton-CG
 Optimization terminated successfully.
 Current function value: 7.954412
 Iterations: 49
 Function evaluations: 58
 Gradient evaluations: 1654
 Hessian evaluations: 0
 Time taken for minimisation: 294.203114033
```
L-BFGS-B found the correct minimum, and that too blazingly fast, for all NN's that I tested (up to NN=14). See https://plot.ly/~apal90/160/
```
 NN=9
 Method: L-BFGS-B
 Time taken for minimisation: 36.3749790192
```

Question: Why is L-BFGS-B superior in this case to the other two methods? In particular, why is it so much superior to BFGS, when both are supposed to be quasi-Newton methods that work (to my understanding), in exactly the same manner.

My thoughts on the situation: All three methods do quadratic approximations at every point x. For this, it needs a gradient and a Hessian. If the Hessian is not given, it must be calculated by the algorithm. In our case, where only the gradient is explicitly given, this is calculated at every step numerically by the algorithm. More specifically, what we require is the inverse of the Hessian, and this is a very expensive step, especially in higher dimensions. Now, Newton-CG calculates this inverse Hessian explicitly, hence it's longer time requirements. The quasi-Newton methods like BFGS and L-BFGS calculate an approximation to the Hessian (i.e., the curvature) based on the gradient, which is cheaper on time, and which is also supposedly a better estimate of the curvature about a point. Thus, for quadratic functions, Newton-CG converges faster, whereas for non-quadratic functions, the quasi-Newton functions converge better. L-BFGS is a lower memory version of BFGS that stores far less memory at every step than the full NxN matrix, hence it is faster than BFGS.

This explanation shows a divergence between Newton-CG and the quasi-Newton methods. What it does not explain is the inability of the algorithms to find the true minimum, and especially the disparity between BFGS and L-BFGS, which are both supposed to function in the same manner.

My general hunch on the long convergence times is that the system is non-quadratic (i.e. flat) about the minimum, and thus the algorithm oscillates about with converging. Beyond that, if BFGS and L-BFGS truly work in the same manner, I believe there must be some difference between the convergence tolerance levels of the Scipy algorithms. Otherwise, BFGS and L-BFGS don't really work in the same manner, and the latter probably calculates the Hessian far more accurately.

References --

http://www.scipy-lectures.org/advanced/mathematical_optimization/#newton-and-quasi-newton-methods

https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization

https://en.wikipedia.org/wiki/Quasi-Newton_method

https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-bfgs.html#optimize-minimize-bfgs

https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb

Your question is missing two important information: The energy function and the initial guess. The energy function can be convex/non-convex, smooth/piecewise-smooth/discontinuous. For this reason, it makes it hard to fully answer your question in your context. However, I can explain some key differences between BFGS and L-BFGS-B.

Both methods are iterative methods for solving nonlinear optimization problems. They both approximate the Newton method by using an approximation of the Hessian of the function at every iteration. The key difference with the Newton method is that instead of computing the full Hessian at a specific point, they accumulate the gradients at previous points and use the BFGS formula to put them together as an approximation of the Hessian. Newton and BFGS methods are not guaranteed to converge unless the function has a quadratic Taylor expansion near an optimum.

The original BFGS method accumulates all gradients since the given initial guess. There is two problems with this method. First, the memory can increase indefinitely. Second, for nonlinear problems, the Hessian at the initial guess is often not representative of the Hessian at the solution. The approximated Hessian will thus be biased until enough gradients are accumulated close to the solution. This can slow down convergence, but should, in my experience, still converge with a good line search algorithm for energy functions that have a single local minimum.

L-BFGS is the same as BFGS but with a limited-memory, which means that after some time, old gradients are discarded to leave more space for freshly computed gradients. This solves the problem of the memory, and it avoids the bias of the initial gradient. However, depending on the number of gradients kept in memory, the Hessian might never be precisely estimated, and can be another source of bias. This can also slow down convergence, but again, it should still converge with a good line search algorithm for energy functions that have a single local minimum.

L-BFGS-B is the same as L-BFGS but with bound constraints on the input variables. L-BFGS-B will stop optimizing variables that are on the boundary of the domain. Since you did not specify any constraints, this aspect of the algorithm does not apply to your problem.

My hypothesis is that you are trying to solve a smooth but non-convex problem using an initial guess that is far from the solution, and that you end up in a local minimum. Since you mentioned that you start from a flat configuration, I would not be surprised that you start in a singularity that leads to a degenerate Hessian, which can cause troubles for the rest of the optimization. The only difference between BFGS and L-BFGS in your case is that every iteration will compute a gradient that is slightly different, and that the L-BFGS method will end up following a path that leads to the global minimum.