A b o u t   M e      |       P r o j e c t s     |       N o t e s       |       T h e   D a y    ︎ ︎

Nonlinear Least Squares

    Consider the residual function `r:mathbb{R}^n rightarrow mathbb{R}^m`, for `x in mathbb{R}^n` 
\begin{align} r_i (x)=y_i-f(t_i-x),  \qquad (i=1, 2, \cdots, m) \end{align}    where `y_i` is the ground truth and `t_i` is the observed value for a function `f:mathbb{R}^{n+1} rightarrow mathbb{R}`. Given `r`, we want to minimize the cost function `phi`, \begin{align} \phi(x)&=\frac{1}{2} r(x)^tr(x) =\frac{1}{2}\begin{pmatrix} r_1(x) & \cdots & r_m(x) \end{pmatrix} \begin{pmatrix} r_1(x) \\ \vdots \\ r_m(x) \end{pmatrix} \\ &=\frac{1}{2}\left( r_1^2(x)+\cdots +r_m^2(x) \right). \end{align}    For minimizing `phi`, the gradient of `phi` is required. \begin{equation} \frac{\partial \phi}{\partial x_i} = r_1(x)\frac{\partial r_1}{\partial x_i} + \cdots +  r_m(x)\frac{\partial r_m}{\partial x_i} \end{equation} \begin{align} \Rightarrow \nabla \phi(x) &=  \begin{pmatrix} \frac{\partial \phi}{\partial x_1} \\ \vdots \\ \frac{\partial \phi}{\partial x_n} \end{pmatrix} = \begin{pmatrix}  r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_1}} + \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_1}} \\ \vdots \\ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_n}} + \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_n}} \end{pmatrix}  \\ \\ &= \begin{pmatrix} \frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_1} \\\  & \vdots \   \\ \frac{\partial r_1}{\partial x_i} & \cdots & \frac{\partial r_m}{\partial x_i} \\ \  & \vdots \   \\ \frac{\partial r_1}{\partial x_n} & \cdots & \frac{\partial r_m}{\partial x_n} \end{pmatrix} \begin{pmatrix} r_1(x) \\ \vdots \\ r_m(x) \end{pmatrix} = \color{limegreen}{J(x)^t r(x)} \end{align} where `J(x)` is the Jacobian matrix of `r(x)`. In addtion, the Hessian matrix `H_{phi}(x)` of `phi(x)` is as follows: \begin{align} H_{\phi}(x)&= \begin{pmatrix}  \frac{\partial^2 \phi}{\partial x_1^2} & \cdots &  \frac{\partial^2 \phi}{\partial x_1 \partial x_n} \\ \  & \vdots & \  \\ \frac{\partial^2 \phi}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 \phi}{\partial x_n^2} \end{pmatrix} \\ \\ &= \begin{pmatrix} \frac{\partial}{\partial \color{tomato}{x_1}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_1}}+ \cdots +  r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_1}} \right] & \cdots & \frac{\partial}{\partial \color{tomato}{x_n}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_1}}+ \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_1}} \right] \\ \  & \vdots & \   \\ \frac{\partial}{\partial \color{tomato}{x_1}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_n}}+ \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_n}} \right] & \cdots & \frac{\partial}{\partial \color{tomato}{x_n}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_n}}+ \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_n}} \right] \end{pmatrix} \\ \\ \Rightarrow (H_{\phi})_{ij} &= \frac{\partial}{\partial \color{tomato}{x_j}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_i}}+ \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_i}} \right]  \\ \\ &= \left( \frac{\partial r_1}{\partial \color{tomato}{x_j}}  \frac{\partial r_1}{\partial \color{limegreen}{x_i}} + r_1(x) \frac{\partial^2 r_1}{\partial \color{limegreen}{x_i} \partial \color{tomato}{x_j}}\right) + \cdots +  \left( \frac{\partial r_m}{\partial \color{tomato}{x_j}} \frac{\partial r_m}{\partial \color{limegreen}{x_i}} +r_m(x) \frac{\partial^2 r_m}{\partial \color{limegreen}{x_i} \partial \color{tomato}{x_j}} \right) \end{align} where `(H_{phi})_{ij}` is the `(i, j)` element of `H_{phi}`. So it can be rewritten as \begin{align} H_{\phi} &= \begin{pmatrix} \frac{\partial r_1}{\partial x_1} \frac{\partial r_1}{\partial x_1} + \cdots + \frac{\partial r_m}{\partial x_1}\frac{\partial r_m}{\partial x_1} & \cdots & \frac{\partial r_1}{\partial x_n}\frac{\partial r_1}{\partial x_1} + \cdots + \frac{\partial r_m}{\partial x_n}\frac{\partial r_m}{\partial x_1} \\ \  & \vdots & \  \\ \frac{\partial r_1}{\partial x_1}\frac{\partial r_1}{\partial x_n} + \cdots + \frac{\partial r_m}{\partial x_1}\frac{\partial r_m}{\partial x_n} & \cdots &\frac{\partial r_1}{\partial x_n}\frac{\partial r_1}{\partial x_n} + \cdots + \frac{\partial r_m}{\partial x_n}\frac{\partial r_m}{\partial x_n} \end{pmatrix} \\ \\ &\quad +  \begin{pmatrix} r_1(x)\frac{\partial^2 r_1}{\partial x_1 \partial x_1} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_1 \partial x_1} & \cdots & r_1(x)\frac{\partial^2 r_1}{\partial x_1 \partial x_n} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_1 \partial x_n} \\ \  & \vdots & \  \\ r_1(x)\frac{\partial^2 r_1}{\partial x_n \partial x_1} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_n \partial x_1} & \cdots & r_1(x)\frac{\partial^2 r_1}{\partial x_n \partial x_n} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_n \partial x_n} \end{pmatrix} \\ \\ &=\begin{pmatrix} \frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_1} \\ \  & \vdots & \  \\ \frac{\partial r_1}{\partial x_n} & \cdots & \frac{\partial r_m}{\partial x_n} \end{pmatrix}  \begin{pmatrix}\frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_1}{\partial x_n} \\\  & \vdots & \  \\\frac{\partial r_m}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_n}\end{pmatrix} + \sum_{i=1}^m r_i(x) \begin{pmatrix} \frac{\partial^2 r_i}{\partial x_1 \partial x_1} & \cdots & \frac{\partial^2 r_i}{\partial x_1 \partial x_n} \\ \  & \vdots & \  \\ \frac{\partial^2 r_i}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 r_i}{\partial x_n \partial x_n} \end{pmatrix} \\ \\ &= \color{limegreen}{J(x)^t J(x)+\sum_{i=1}^m r_i(x) H_{r_i}(x)} \end{align}

    Meanwhile, for a convex function `f: mathbb{R} rightarrow mathbb{R}`, 
\begin{align} f(x+h) \approx f(x)+f^{\prime}h + \frac{1}{2} f^{\prime \prime}(x)h^2 \end{align} by Taylor Theorem. It tells us that a good `h` can make `f(x)` move to the direction `f` is decreasing. For finding this value `h`, we can define the function `g` for `h` and find the critical point of `g`. \begin{align} g(h) &= h^2\left( \frac{1}{2} f^{\prime \prime}(x) \right) + h f^{\prime}(x) + f(x) \\ \\ g^{\prime}(h) &= h f^{\prime \prime}(x) + f^{\prime}(x)=0 \\ \\h &= -\frac{f^{\prime}(x)}{f^{\prime \prime}(x)} \qquad(f^{\prime \prime}(x) > 0)  \end{align}
    In other words, `x` should be moved by `h` for decreasing `f`. In a higher dimension, the minimum of `f` can be found in the same manner. Suppose that `f` is a function for `x in mathbb{R^n}` and its Hessian matrix `H_f` is positive definite. By Taylor Theorem,
\begin{align} f(x+s) \approx f(x) + \nabla f(x)^t s+ \frac{1}{2} s^t H_f(x)s\end{align}
    Following the process in one dimension, `x` should be moved by `s`, the solution of 
\begin{align} H_f(x) s = -\nabla f(x) \end{align} 

    Returning to the original problem, `phi(x)` can be minimized by finding `s` such that `H_{phi}(x)s = - nabla phi (x)`. More specifically, this linear system is as follows:
\begin{align}  H_{\phi}(x) s = - \nabla \phi (x) \Leftrightarrow \left( J(x)^tJ(x)+ \sum_{i=1}^m r_i(x)H_{r_i}(x) \right) s = -J(x)^t r(x) \end{align}
    This linear system can be solved by some famous methods.

1. Gauss-Newton Method
    This method assumes that `sum_{i=1}^m r_i(x)H_{r_i}(x)` will be negligible when the residual function `r` is small. It is possible when the model function `f` fits the data well so that `r_i(x)=y_i-f(t_i, x)` can be small. In this case, 
\begin{align} J(x)^tJ(x) s \approx -J(x)^t r(x) \end{align}
    It recalls the normal equation, so this problem can be viewed as solving the linear system `J(x)s=-r(x)`. However, there are some drawbacks. this problem may fail to converge when it is not started close enough to the real solution. Moreover, if  `sum_{i=1}^m r_i(x)H_{r_i}(x)` is not small, it cannot be negligible. Finally, the solution could hardly find when `H_{phi}(x)`, `J^t(x)J(x)`, or `J(x)` is ill-conditioned.

2. Levenberg-Marquardt Method

    This is useful even when `H_{phi}(x)`, `J^t(x)J(x)`, or `J(x)` is ill-conditioned, or rank-deficient. It does not ignore the term  `sum_{i=1}^m r_i(x)H_{r_i}(x)`, but replace this by `mu I` for `mu in mathbb{R}`.
\begin{align} \left( J(x)^tJ(x) + \mu I \right) s \approx -J(x)^t r(x) \end{align}
    This technique, which is called regularization, makes the pseudo-`H_f(x)` positive definite. Its characteristics is more obvious when it is considered as that of the lower dimension. `f(x+h)` is decreasing when its  second order derivative is potisitive. This property is corresponding to the Hessian matrix of a function in a higher dimension. If the second order derivative of `f` is not greater than zero, it can be biased by `mu` as follows:
\begin{align} \begin{cases} f^{\prime \prime}(x) > 0 \Rightarrow h=-\frac{f^{\prime}(x)}{ f^{\prime \prime}(x)} \\ \\ f^{\prime \prime}(x) \leq 0 \Rightarrow  h=-\frac{f^{\prime}(x)}{f^{\prime \prime}(x) + \mu} \quad \text{where} \   f^{\prime \prime}(x) + \mu > 0 \end{cases} \end{align}
    Considering this process and applying it to the higer dimension in the same manner, the following properties are induced.
\begin{equation} f^{\prime \prime}(x) + \mu > 0 \quad \text{where} \ f^{\prime \prime}(x) \leq 0 \\ \Updownarrow \\ J(x)^tJ(x) + \mu I \   \text{is positive definite} \\ \text{where} \   H_{\phi}(x) \left( \approx J(x)^tJ(x) \right) \text{is NOT positive definite}  \end{equation}
    Therefore, it can be said that the regularization term, `mu I` makes the pseudo-`H_f(x)`, which is ` J(x)^tJ(x)`, positive definite. Moreover, Levenberg-Marquardt method can be interpreted as the weighted combination of Gauss-Newton and gradient descent methods.
\begin{align} \begin{cases} J(x)^tJ(x) s \approx -J(x)^t r(x) \quad \left( \mu I = O\right) \quad \text{: Gauss-Newton} \\ \\s \approx -J(x)^t r(x) = -\nabla \phi(x) \quad \left( \mu I = I- J(x)^tJ(x) \right) \quad \text{: Gradient Descent} \end{cases} \end{align}
    Meanwhile, there exists a singular value of `J(x)` close to zero when `J(x)` is ill-conditioned. Using the fact that `J(x)` can be diagonalizable,
\begin{align} J(x)^tJ(x) &= (U\Sigma V^t)^t (U\Sigma V^t) = V\Sigma^t U^t U \Sigma V = V \Sigma^2 V \\ \\ \Rightarrow V^t(J^t(x)J(x))V &= \Sigma^2 = \text{diag}(\sigma_1^2, \cdots, \sigma_n^2)  \end{align}
where `sigma_i` is the `i`-th singular value of `J(x)`. It means that the eigenvalues of `J(x)^tJ(x)` are the square of the singular vaules of `J(x)`. In case that `J(x)` is ill-conditioned, for the eigenvalue `lambda` which is close to zero and the eigenvector `v` corresponding to `lambda`,
\begin{align} J(x)^tJ(x)v = \lambda v \Rightarrow (J(x)^tJ(x)+\mu I)v = \lambda v + \mu v = (\lambda + \mu)v \end{align}
    Note that `J(x)^tJ(x)+mu I` is symmetric since `J(x)^t J(x)` is symmetric and it is also positive definite if `mu` is selected well. Accordingly, the cholesky factorization can be applied to it as follows:
\begin{align} (J(x)^tJ(x)+\mu I)v = (\lambda + \mu)v = \hat{J}(x)^t \hat{J}(x)v\end{align}
where `hat{J}(x)^t hat{J}(x)` is positive definite and its all eigenvalues are positive. Then the eigenvalue of `hat{J}(x)^t hat{J}(x)` is `lambda + mu`, so the singular value of `hat{J}(x)` is `sqrt{lambda + mu}`. Therefore, the original singular value of `J(x)`, ` sqrt{lambda}`, is increased to `sqrt{lambda + mu}`, and it helps the ill-conditioned `J(x)` to be far from this condition.
    The corresponding linear least squares problem to Levenberg-Marquardt method is,
\begin{align} \begin{pmatrix} J(x) \\ \sqrt{\mu} I \end{pmatrix} s \cong \begin{pmatrix} -r(x) \\ 0 \end{pmatrix} \end{align}


[1] Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.