A b o u t   M e      |       P r o j e c t s     |       N o t e s       |       T h e   D a y    ︎ ︎

## Nonlinear Least Squares

Consider the residual function r:mathbb{R}^n rightarrow mathbb{R}^m, for x in mathbb{R}^n
\begin{align} r_i (x)=y_i-f(t_i-x),  \qquad (i=1, 2, \cdots, m) \end{align}    where y_i is the ground truth and t_i is the observed value for a function f:mathbb{R}^{n+1} rightarrow mathbb{R}. Given r, we want to minimize the cost function phi, \begin{align} \phi(x)&=\frac{1}{2} r(x)^tr(x) =\frac{1}{2}\begin{pmatrix} r_1(x) & \cdots & r_m(x) \end{pmatrix} \begin{pmatrix} r_1(x) \\ \vdots \\ r_m(x) \end{pmatrix} \\ &=\frac{1}{2}\left( r_1^2(x)+\cdots +r_m^2(x) \right). \end{align}    For minimizing phi, the gradient of phi is required. \begin{equation} \frac{\partial \phi}{\partial x_i} = r_1(x)\frac{\partial r_1}{\partial x_i} + \cdots +  r_m(x)\frac{\partial r_m}{\partial x_i} \end{equation} \begin{align} \Rightarrow \nabla \phi(x) &=  \begin{pmatrix} \frac{\partial \phi}{\partial x_1} \\ \vdots \\ \frac{\partial \phi}{\partial x_n} \end{pmatrix} = \begin{pmatrix}  r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_1}} + \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_1}} \\ \vdots \\ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_n}} + \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_n}} \end{pmatrix}  \\ \\ &= \begin{pmatrix} \frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_1} \\\  & \vdots \   \\ \frac{\partial r_1}{\partial x_i} & \cdots & \frac{\partial r_m}{\partial x_i} \\ \  & \vdots \   \\ \frac{\partial r_1}{\partial x_n} & \cdots & \frac{\partial r_m}{\partial x_n} \end{pmatrix} \begin{pmatrix} r_1(x) \\ \vdots \\ r_m(x) \end{pmatrix} = \color{limegreen}{J(x)^t r(x)} \end{align} where J(x) is the Jacobian matrix of r(x). In addtion, the Hessian matrix H_{phi}(x) of phi(x) is as follows: \begin{align} H_{\phi}(x)&= \begin{pmatrix}  \frac{\partial^2 \phi}{\partial x_1^2} & \cdots &  \frac{\partial^2 \phi}{\partial x_1 \partial x_n} \\ \  & \vdots & \  \\ \frac{\partial^2 \phi}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 \phi}{\partial x_n^2} \end{pmatrix} \\ \\ &= \begin{pmatrix} \frac{\partial}{\partial \color{tomato}{x_1}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_1}}+ \cdots +  r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_1}} \right] & \cdots & \frac{\partial}{\partial \color{tomato}{x_n}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_1}}+ \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_1}} \right] \\ \  & \vdots & \   \\ \frac{\partial}{\partial \color{tomato}{x_1}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_n}}+ \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_n}} \right] & \cdots & \frac{\partial}{\partial \color{tomato}{x_n}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_n}}+ \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_n}} \right] \end{pmatrix} \\ \\ \Rightarrow (H_{\phi})_{ij} &= \frac{\partial}{\partial \color{tomato}{x_j}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_i}}+ \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_i}} \right]  \\ \\ &= \left( \frac{\partial r_1}{\partial \color{tomato}{x_j}}  \frac{\partial r_1}{\partial \color{limegreen}{x_i}} + r_1(x) \frac{\partial^2 r_1}{\partial \color{limegreen}{x_i} \partial \color{tomato}{x_j}}\right) + \cdots +  \left( \frac{\partial r_m}{\partial \color{tomato}{x_j}} \frac{\partial r_m}{\partial \color{limegreen}{x_i}} +r_m(x) \frac{\partial^2 r_m}{\partial \color{limegreen}{x_i} \partial \color{tomato}{x_j}} \right) \end{align} where (H_{phi})_{ij} is the (i, j) element of H_{phi}. So it can be rewritten as \begin{align} H_{\phi} &= \begin{pmatrix} \frac{\partial r_1}{\partial x_1} \frac{\partial r_1}{\partial x_1} + \cdots + \frac{\partial r_m}{\partial x_1}\frac{\partial r_m}{\partial x_1} & \cdots & \frac{\partial r_1}{\partial x_n}\frac{\partial r_1}{\partial x_1} + \cdots + \frac{\partial r_m}{\partial x_n}\frac{\partial r_m}{\partial x_1} \\ \  & \vdots & \  \\ \frac{\partial r_1}{\partial x_1}\frac{\partial r_1}{\partial x_n} + \cdots + \frac{\partial r_m}{\partial x_1}\frac{\partial r_m}{\partial x_n} & \cdots &\frac{\partial r_1}{\partial x_n}\frac{\partial r_1}{\partial x_n} + \cdots + \frac{\partial r_m}{\partial x_n}\frac{\partial r_m}{\partial x_n} \end{pmatrix} \\ \\ &\quad +  \begin{pmatrix} r_1(x)\frac{\partial^2 r_1}{\partial x_1 \partial x_1} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_1 \partial x_1} & \cdots & r_1(x)\frac{\partial^2 r_1}{\partial x_1 \partial x_n} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_1 \partial x_n} \\ \  & \vdots & \  \\ r_1(x)\frac{\partial^2 r_1}{\partial x_n \partial x_1} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_n \partial x_1} & \cdots & r_1(x)\frac{\partial^2 r_1}{\partial x_n \partial x_n} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_n \partial x_n} \end{pmatrix} \\ \\ &=\begin{pmatrix} \frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_1} \\ \  & \vdots & \  \\ \frac{\partial r_1}{\partial x_n} & \cdots & \frac{\partial r_m}{\partial x_n} \end{pmatrix}  \begin{pmatrix}\frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_1}{\partial x_n} \\\  & \vdots & \  \\\frac{\partial r_m}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_n}\end{pmatrix} + \sum_{i=1}^m r_i(x) \begin{pmatrix} \frac{\partial^2 r_i}{\partial x_1 \partial x_1} & \cdots & \frac{\partial^2 r_i}{\partial x_1 \partial x_n} \\ \  & \vdots & \  \\ \frac{\partial^2 r_i}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 r_i}{\partial x_n \partial x_n} \end{pmatrix} \\ \\ &= \color{limegreen}{J(x)^t J(x)+\sum_{i=1}^m r_i(x) H_{r_i}(x)} \end{align}

Meanwhile, for a convex function f: mathbb{R} rightarrow mathbb{R},
\begin{align} f(x+h) \approx f(x)+f^{\prime}h + \frac{1}{2} f^{\prime \prime}(x)h^2 \end{align} by Taylor Theorem. It tells us that a good h can make f(x) move to the direction f is decreasing. For finding this value h, we can define the function g for h and find the critical point of g. \begin{align} g(h) &= h^2\left( \frac{1}{2} f^{\prime \prime}(x) \right) + h f^{\prime}(x) + f(x) \\ \\ g^{\prime}(h) &= h f^{\prime \prime}(x) + f^{\prime}(x)=0 \\ \\h &= -\frac{f^{\prime}(x)}{f^{\prime \prime}(x)} \qquad(f^{\prime \prime}(x) > 0)  \end{align}
In other words, x should be moved by h for decreasing f. In a higher dimension, the minimum of f can be found in the same manner. Suppose that f is a function for x in mathbb{R^n} and its Hessian matrix H_f is positive definite. By Taylor Theorem,
\begin{align} f(x+s) \approx f(x) + \nabla f(x)^t s+ \frac{1}{2} s^t H_f(x)s\end{align}
Following the process in one dimension, x should be moved by s, the solution of
\begin{align} H_f(x) s = -\nabla f(x) \end{align}

Returning to the original problem, phi(x) can be minimized by finding s such that H_{phi}(x)s = - nabla phi (x). More specifically, this linear system is as follows:
\begin{align}  H_{\phi}(x) s = - \nabla \phi (x) \Leftrightarrow \left( J(x)^tJ(x)+ \sum_{i=1}^m r_i(x)H_{r_i}(x) \right) s = -J(x)^t r(x) \end{align}
This linear system can be solved by some famous methods.

1. Gauss-Newton Method
This method assumes that sum_{i=1}^m r_i(x)H_{r_i}(x) will be negligible when the residual function r is small. It is possible when the model function f fits the data well so that r_i(x)=y_i-f(t_i, x) can be small. In this case,
\begin{align} J(x)^tJ(x) s \approx -J(x)^t r(x) \end{align}
It recalls the normal equation, so this problem can be viewed as solving the linear system J(x)s=-r(x). However, there are some drawbacks. this problem may fail to converge when it is not started close enough to the real solution. Moreover, if  sum_{i=1}^m r_i(x)H_{r_i}(x) is not small, it cannot be negligible. Finally, the solution could hardly find when H_{phi}(x), J^t(x)J(x), or J(x) is ill-conditioned.

# 2. Levenberg-Marquardt Method

This is useful even when H_{phi}(x), J^t(x)J(x), or J(x) is ill-conditioned, or rank-deficient. It does not ignore the term  sum_{i=1}^m r_i(x)H_{r_i}(x), but replace this by mu I for mu in mathbb{R}.
\begin{align} \left( J(x)^tJ(x) + \mu I \right) s \approx -J(x)^t r(x) \end{align}
This technique, which is called regularization, makes the pseudo-H_f(x) positive definite. Its characteristics is more obvious when it is considered as that of the lower dimension. f(x+h) is decreasing when its  second order derivative is potisitive. This property is corresponding to the Hessian matrix of a function in a higher dimension. If the second order derivative of f is not greater than zero, it can be biased by mu as follows:
\begin{align} \begin{cases} f^{\prime \prime}(x) > 0 \Rightarrow h=-\frac{f^{\prime}(x)}{ f^{\prime \prime}(x)} \\ \\ f^{\prime \prime}(x) \leq 0 \Rightarrow  h=-\frac{f^{\prime}(x)}{f^{\prime \prime}(x) + \mu} \quad \text{where} \   f^{\prime \prime}(x) + \mu > 0 \end{cases} \end{align}
Considering this process and applying it to the higer dimension in the same manner, the following properties are induced.
\begin{equation} f^{\prime \prime}(x) + \mu > 0 \quad \text{where} \ f^{\prime \prime}(x) \leq 0 \\ \Updownarrow \\ J(x)^tJ(x) + \mu I \   \text{is positive definite} \\ \text{where} \   H_{\phi}(x) \left( \approx J(x)^tJ(x) \right) \text{is NOT positive definite}  \end{equation}
Therefore, it can be said that the regularization term, mu I makes the pseudo-H_f(x), which is  J(x)^tJ(x), positive definite. Moreover, Levenberg-Marquardt method can be interpreted as the weighted combination of Gauss-Newton and gradient descent methods.
\begin{align} \begin{cases} J(x)^tJ(x) s \approx -J(x)^t r(x) \quad \left( \mu I = O\right) \quad \text{: Gauss-Newton} \\ \\s \approx -J(x)^t r(x) = -\nabla \phi(x) \quad \left( \mu I = I- J(x)^tJ(x) \right) \quad \text{: Gradient Descent} \end{cases} \end{align}

Meanwhile, there exists a singular value of J(x) close to zero when J(x) is ill-conditioned. Using the fact that J(x) can be diagonalizable,
\begin{align} J(x)^tJ(x) &= (U\Sigma V^t)^t (U\Sigma V^t) = V\Sigma^t U^t U \Sigma V = V \Sigma^2 V \\ \\ \Rightarrow V^t(J^t(x)J(x))V &= \Sigma^2 = \text{diag}(\sigma_1^2, \cdots, \sigma_n^2)  \end{align}
where sigma_i is the i-th singular value of J(x). It means that the eigenvalues of J(x)^tJ(x) are the square of the singular vaules of J(x). In case that J(x) is ill-conditioned, for the eigenvalue lambda which is close to zero and the eigenvector v corresponding to lambda,
\begin{align} J(x)^tJ(x)v = \lambda v \Rightarrow (J(x)^tJ(x)+\mu I)v = \lambda v + \mu v = (\lambda + \mu)v \end{align}
Note that J(x)^tJ(x)+mu I is symmetric since J(x)^t J(x) is symmetric and it is also positive definite if mu is selected well. Accordingly, the cholesky factorization can be applied to it as follows:
\begin{align} (J(x)^tJ(x)+\mu I)v = (\lambda + \mu)v = \hat{J}(x)^t \hat{J}(x)v\end{align}
where hat{J}(x)^t hat{J}(x) is positive definite and its all eigenvalues are positive. Then the eigenvalue of hat{J}(x)^t hat{J}(x) is lambda + mu, so the singular value of hat{J}(x) is sqrt{lambda + mu}. Therefore, the original singular value of J(x),  sqrt{lambda}, is increased to sqrt{lambda + mu}, and it helps the ill-conditioned J(x) to be far from this condition.
The corresponding linear least squares problem to Levenberg-Marquardt method is,
\begin{align} \begin{pmatrix} J(x) \\ \sqrt{\mu} I \end{pmatrix} s \cong \begin{pmatrix} -r(x) \\ 0 \end{pmatrix} \end{align}

︎Reference

 Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.

emoy.net