optax.scale_by_lbfgs

Contents

optax.scale_by_lbfgs#

optax.scale_by_lbfgs(memory_size: int = 10, scale_init_precond: bool = True) optax.GradientTransformation[source]#

Scales updates by L-BFGS.

L-BFGS is a quasi-Newton method that multiplies the update (gradient) with an approximation of the inverse Hessian. This algorithm does not need access to the Hessian, as this approximation is constructed from the gradient evaluations seen during optimization. L-BFGS is a limited-memory variant of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. The BFGS algorithm requires storing a matrix of size \(p \times p\) with \(p\) the dimension of the parameters. The limited variant circuments this issue by computing the approximation of the inverse using only \(m\) (memory_size) past differences of parameters/gradients. Namely, the approximation of the Hessian inverse is denoted \(P_k = P_{k, k}\), where

\[\begin{align*} P_{k, j+1} & = V_j^\top P_{k, j} V_j + \rho_j \delta w_j \delta w_j^\top \quad \text{for} \ j \in \{k-m, \ldots, k-1\}\\ P_{k, k-m} & = \gamma_k I \\ V_k & = I - \rho_k \delta u_k \delta w_k^\top \\ \rho_k & = 1/(\delta u_k^\top \delta w_k) \\ \delta w_k & = w_{k+1} - w_k \\ \delta u_k & = u_{k+1} - u_k \\ \gamma_k & = \begin{cases} (\delta w_{k-1}^\top \delta u_{k-1}) / (\delta u_{k-1}^\top \delta u_{k-1}) & \text{if} \ \texttt{scale\_init\_hess} \\ 1 & \text{otherwise} \end{cases}, \end{align*}\]

for \(u_k\) the gradients/updates at iteration \(k\), \(w_k\) the parameters at iteration \(k\).

The formula for updating \(P_k\) is obtained by computing the optimal preconditioning matrix subject to some secant condition, see references for more details. Computing \(P_k u_k\) can be done by a sequence of vector operations using past differences of parameters and gradients stored in a memory buffer.

The present function just outputs the LBFGS direction \(P_k u_k\). It can be chained with a linesearch ensuring sufficient decrease and low curvature, such as a zoom linesearch. The linesearch computes a stepsize \(\eta_k\), such that the updated parameters (using optax.apply_updates()) take the form \(w_{k+1} = w_k - \eta_k P_k u_k\).

Parameters:
  • memory_size โ€“ number of past parameters, gradients/updates to keep in memory to approximate the Hessian inverse.

  • scale_init_precond โ€“ whether to use a scaled identity as the initial preconditioner, see formula of \(\gamma_k\) above.

Returns:

A optax.GradientTransformation object.

References

Algorithms 7.4, 7.5 (page 199) of Nocedal et al, Numerical Optimization, 1999

Liu et al., On the limited memory BFGS method for large scale optimization, 1989.

Note

We initialize the scaling of the identity as a capped reciprocal of the gradient norm. This avoids wasting linesearch iterations for the first step by taking into account the magnitude of the gradients. In other words, we constrain the trust-region of the first step to an Euclidean ball of radius 1 at the first iteration. The choice of \(\gamma_0\) is not detailed in the references above, so this is a heuristic choice.