All Posts

Optimizers in PyPose - Gauss Newton and Levenberg-Marquadt

July 14, 2023SLAMRobotics

In deep learning, we usually use first-order optimizers like Stochastic Gradient Descent, Adam, or RMSProp.

However, in SLAM problems, we use Bundle Adjustment to jointly optimize camera pose and landmark coordinate in real time. Naive first order optimizers is not efficient enough (requires many iterations to converge) for this situation.

In this post, I will introduce the concept of second order optimizer. However, since the second order optimizer requries us to derive the Hessian matrix for the optimization target, it is usually impractical to use it directly. Therefore, we will use the Gauss-Newton and Levenberg-Marquadt optimizers instead.

Optimization Problem

The optimization problem, in general can be represented in the form of

x=argminxf(x)x^* = \arg\min_{x}f(x)

That is, we want to find some optimal input xx^* s.t. such input can minimize some given expression f(x)f(x).

Naive Optimization

When function ff is simple, we can find its Jacobian matrix (first order derivative) J\mathbf{J} and Hessian matrix (second order derivative) H\mathbf{H} easily.

J=[fx0fx1fxn]H=[2fx122fx1x22fx1xn2fx2x12fx222fxnx12fxn2]\mathbf{J} = \begin{bmatrix} \frac{\partial f}{\partial x_0} & \frac{\partial f} {\partial x_1} & \cdots & \frac{\partial f}{\partial x_n} \end{bmatrix} \quad \mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_n} \\ \frac{\partial^2 f}{\partial x_2\partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & & \vdots \\ \vdots & & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n\partial x_1} & \cdots & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}

We can then solve for Jx=0\mathbf{J}x = \mathbf{0}. For every solution xx, if the hessian matrix is positive definite, then a local minimum is found.

Iterative Optimization

However, when the function ff is complex or xx lives in very high dimension (~10k in sparse 3D reconstruction problem, or ~10M in usual neural network models), it is practically impossible to solve for analytical solution of J\mathbf{J} and H\mathbf{H}.

In this case, we need to use iterative optimization. The general algorithm for such approach can be summarized as following:

  1. For some initial guess value x0x_0, we have xx0x \gets x_0

  2. While true

    1. Use the Taylor expansion of ff around xx,

      If using second-order optimizer, we have

      f^(x+Δx)=f(x)+J(x)Δx+12ΔxTH(x)Δx\hat{f}(x + \Delta x) = f(x) + \mathbf{J}(x)\Delta x + \frac{1}{2}\Delta x^T\mathbf{H}(x) \Delta x

      If using first-order optimizer, we have

      f^(x+Δx)=f(x)+J(x)Δx\hat{f}(x + \Delta x) = f(x) + \mathbf{J}(x)\Delta x
    2. Solve for Δx\Delta x^* such that

    Δx=argminΔxf^(x+Δx)\Delta x^* = \arg\min_{\Delta x}{\hat{f}(x + \Delta x)}
    1. Update xx+Δxx \gets x + \Delta x^*
    2. If Δx2<ε\|\Delta x^*\|_2 < \varepsilon, the solution "converges" and break out the loop
  3. Return xx

First Order Optimizers

When using first order optimizer, we only use the first order derivative of ff to calculate Δx\Delta x^*. Then, we have

Δx=argminΔxf(x)+J(x)Δx=argminΔxJ(x)Δx=argmaxΔxJ(x)Δx\Delta x^* = \arg\min_{\Delta x} f(x) + \mathbf{J}(x) \Delta x = \arg\min_{\Delta x} \mathbf{J}(x) \Delta x = -\arg\max_{\Delta x}{\mathbf{J}(x) \Delta x}

Obviously, the solution will be Δx=J(x)\Delta x^* = -\mathbf{J}(x). This is aligned with the definition of naive Stochastic Gradient Descent (SGD).

Intuitively, we can interpret first order optimization as locally approximate the optimization target ff as a plane with form of Jx\mathbf{J}x.

20230723164733

First order Taylor approximation of sin(x)+cos(y)+2\sin(x) + \cos(y) + 2 at (2,2)(2, 2). Generated by GPT-4 with code interpreter.

Such optimizer and its variants like Adam, AdamW, RMSProp are widely used in the field of deep learning and is supported by popular libraries like PyTorch.

Why not First Order Optimizer?

While first order optimizers can support large scale optimization problem, it generally requires more iterations to converge since linearization (first order Taylor expansion) is not a good approximation.

In applications like SLAM where bundle adjustments need to run in real-time, first order optimizer cannot fulfill our need.

Therefore, we have the second order optimizers.

Second Order Optimizers

The second order optimizers, specifically the Newton method uses second order Taylor expansion as the local approximation for optimization target:

f^(x+Δx)=f(x)+J(x)Δx+12ΔxTH(x)Δx\hat{f}(x + \Delta x) = f(x) + \mathbf{J}(x)\Delta x + \frac{1}{2}\Delta x^T \mathbf{H}(x) \Delta x

Then we have

Δx=argminΔxf^(x+Δx)=argminΔxJ(x)Δx+12ΔxTHΔx\Delta x^* = \arg\min_{\Delta x}{\hat{f}(x + \Delta x)} = \arg\min_{\Delta x}\mathbf{J}(x)\Delta x + \frac{1}{2}\Delta x^T\mathbf{H}\Delta x

Solving

Δx(J(x)Δx+12ΔxTHΔx)=0\frac{\partial }{\partial\Delta x} {\left(\mathbf{J}(x)\Delta x + \frac{1}{2}\Delta x^T\mathbf{H}\Delta x\right)} = \mathbf{0}

We have H(x)Δx=J(x)\mathbf{H}(x)\Delta x^* = -\mathbf{J}(x).

Intuitively, the second order optimizers like Newton method is using a paraboloid to locally approximate the function surface.

20230723164919

Second order Taylor approximation of sin(x)+cos(y)+2\sin(x) + \cos(y) + 2 at (2,2)(2, 2). Generated by GPT4 with code interpreter.

Why not Second Order Optimizer?

Second order optimizer provides a much better approximation to the original function. Hence, solver with second order optimizer can converge much faster than first order optimizer.

However, it is often not practical to calculate the H\mathbf{H} of ff. If xRdx \in \mathbf{R}^d, then H\mathbf{H} will have d2d^2 elements.

Pseudo-second-order Optimizers

Gauss-Newton Optimizer

Gauss-Newton optimizers requires the expression to be optimized to be in the form of sum-of-square. That is, the optimization target must have form of

R(x)=f(x)2\mathbf{R}(x) = \sum{f(x)^2}

Then, consider the first order approximation for ff at xx:

f^(x+Δx)f(x)+J(x)Δx\hat{f}(x + \Delta x) \approx f(x) + \mathbf{J}(x) \Delta x

Δx=argminΔxR(x+Δx)argminΔx(f(x)+J(x)Δx)2=argminΔxf2(x)+2f(x)J(x)Δx+(J(x)Δx)(J(x)Δx)\begin{aligned} \Delta x^* &= \arg\min_{\Delta x} \mathbf{R}(x + \Delta x)\\ &\approx \arg\min_{\Delta x}{(f(x) + \mathbf{J}(x)\Delta x)^2}\\ &= \arg\min_{\Delta x}{f^2(x) + 2f(x)\mathbf{J}(x)\Delta x + (\mathbf{J}(x)\Delta x)^\top (\mathbf{J}(x) \Delta x)}\\ \end{aligned}

Since the term in argmin\arg\min is convex (is a square), we know the optimization target must be convex. Hence, we have Δx=Δx\Delta x = \Delta x^* when (f(x)+JΔx)2Δx=0\frac{\partial (f(x) + \mathbf{J}\Delta x)^2}{\partial \Delta x} = 0.

Hence, we have

J(x)J(x)Δx=f(x)J(x)\mathbf{J}^\top(x) \mathbf{J}(x) \Delta x^* = -f(x)\mathbf{J}(x)

where we can solve for Δx\Delta x^*.

Problem of Gauss-Newton Optimizer

In production environment, we may have J\mathbf{J} not full-ranked. This will cause the coefficient matrix JJ\mathbf{J}^\top \mathbf{J} on the left hand side being positive semi-definite (not full-ranked). In this case, the equation system is in singular condition and we can't solve for Δx\Delta x^* reliably.

Also, in Gauss-Newton method, the Δx\*\Delta x^\* we solved for may be very large. Since we only use the first order Taylor expansion, large Δx\*\Delta x^\* usually indicates a poor approximation for f(x+Δx\*)f(x + \Delta x^\*). In these cases, the residual R\mathbf{R} may even increase as we update xx+Δxx \gets x + \Delta x.

Levenberg-Marquadt Optimizer

Levenberg-Marquadt optimizer is a modified version of Gauss-Newton optimizer.

The LM optimizer use a metric ρ\rho to evaluate the quality of approximation:

ρ(Δx)=f(x+Δx)f(x)+J(x)Δx\rho(\Delta x) = \frac{f(x + \Delta x)}{f(x) + \mathbf{J}(x)\Delta x}

When the approximation is accurate, we have ρ(Δx)=1\rho(\Delta x) = 1.

LM optimizer will then use this measure of approximation quality to control the "trusted region". The "trusted region" represents a domain where the first order approximation for f(x)f(x) is acceptable.

The update vector Δx\Delta x must be in the trusted region. Hence, the entire optimization problem is formulated as

argminΔx(f(x)+J(x)Δx)2s.t. DΔxμ\arg\min_{\Delta x}{(f(x) + \mathbf{J}(x)\Delta x)^2} \quad \text{s.t. }D\Delta x \leq \mu

where DD is a diagonal matrix constraining the Δx\Delta x in an ellipsoid domain.

Usually, the DD is configured as diag(J(x)J(x))diag(\mathbf{J}^\top(x)\mathbf{J}(x)). This allows the update step to move more on the direction with lower gradient.