# P r o j e c t s

N o t e s

## Nonlinear Least Squares

Consider the residual function r:mathbb{R}^n rightarrow mathbb{R}^m, for x in mathbb{R}^n
\begin{align}
r_i (x)=y_i-f(t_i-x),  \qquad (i=1, 2, \cdots, m)
\end{align}
where y_i is the ground truth and t_i is the observed value for a function f:mathbb{R}^{n+1} rightarrow mathbb{R}. Given r, we want to minimize the cost function phi,
\begin{align}
\phi(x)&=\frac{1}{2} r(x)^tr(x) =\frac{1}{2}
\begin{pmatrix} r_1(x) & \cdots & r_m(x) \end{pmatrix}
\begin{pmatrix} r_1(x) \\ \vdots \\ r_m(x) \end{pmatrix} \\ &=\frac{1}{2}\left( r_1^2(x)+\cdots +r_m^2(x) \right).
\end{align}
For minimizing phi, the gradient of phi is required.
\begin{equation} \frac{\partial \phi}{\partial x_i} = r_1(x)\frac{\partial r_1}{\partial x_i} + \cdots +  r_m(x)\frac{\partial r_m}{\partial x_i}
\end{equation}
\begin{align}
\Rightarrow \nabla \phi(x) &=
\begin{pmatrix} \frac{\partial \phi}{\partial x_1} \\ \vdots \\ \frac{\partial \phi}{\partial x_n} \end{pmatrix} =
\begin{pmatrix}
r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_1}} + \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_1}} \\
\vdots \\
r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_n}} + \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_n}}
\end{pmatrix}  \\ \\ &=
\begin{pmatrix}
\frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_1} \\
\  & \vdots \   \\
\frac{\partial r_1}{\partial x_i} & \cdots & \frac{\partial r_m}{\partial x_i} \\
\  & \vdots \   \\
\frac{\partial r_1}{\partial x_n} & \cdots & \frac{\partial r_m}{\partial x_n}
\end{pmatrix}
\begin{pmatrix} r_1(x) \\ \vdots \\ r_m(x) \end{pmatrix} =
\color{limegreen}{J(x)^t r(x)}
\end{align}
where J(x) is the Jacobian matrix of r(x). In addtion, the Hessian matrix H_{phi}(x) of phi(x) is as follows:
\begin{align} H_{\phi}(x)&=
\begin{pmatrix}
\frac{\partial^2 \phi}{\partial x_1^2} & \cdots &  \frac{\partial^2 \phi}{\partial x_1 \partial x_n} \\
\  & \vdots & \  \\
\frac{\partial^2 \phi}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 \phi}{\partial x_n^2}
\end{pmatrix} \\ \\ &=
\begin{pmatrix}
\frac{\partial}{\partial \color{tomato}{x_1}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_1}}+ \cdots +  r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_1}} \right] & \cdots & \frac{\partial}{\partial \color{tomato}{x_n}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_1}}+ \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_1}} \right] \\
\  & \vdots & \   \\
\frac{\partial}{\partial \color{tomato}{x_1}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_n}}+ \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_n}} \right] & \cdots & \frac{\partial}{\partial \color{tomato}{x_n}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_n}}+ \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_n}} \right]
\end{pmatrix} \\ \\ \Rightarrow (H_{\phi})_{ij} &=
\frac{\partial}{\partial \color{tomato}{x_j}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_i}}+ \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_i}} \right]  \\ \\ &=
\left( \frac{\partial r_1}{\partial \color{tomato}{x_j}}  \frac{\partial r_1}{\partial \color{limegreen}{x_i}} +
r_1(x) \frac{\partial^2 r_1}{\partial \color{limegreen}{x_i} \partial \color{tomato}{x_j}}\right) + \cdots +
\left( \frac{\partial r_m}{\partial \color{tomato}{x_j}} \frac{\partial r_m}{\partial \color{limegreen}{x_i}} +
r_m(x) \frac{\partial^2 r_m}{\partial \color{limegreen}{x_i} \partial \color{tomato}{x_j}} \right)
\end{align}
where (H_{phi})_{ij} is the (i, j) element of H_{phi}. So it can be rewritten as
\begin{align} H_{\phi} &=
\begin{pmatrix}
\frac{\partial r_1}{\partial x_1} \frac{\partial r_1}{\partial x_1} + \cdots + \frac{\partial r_m}{\partial x_1}\frac{\partial r_m}{\partial x_1} & \cdots &
\frac{\partial r_1}{\partial x_n}\frac{\partial r_1}{\partial x_1} + \cdots + \frac{\partial r_m}{\partial x_n}\frac{\partial r_m}{\partial x_1} \\
\  & \vdots & \  \\
\frac{\partial r_1}{\partial x_1}\frac{\partial r_1}{\partial x_n} + \cdots + \frac{\partial r_m}{\partial x_1}\frac{\partial r_m}{\partial x_n} & \cdots &
\frac{\partial r_1}{\partial x_n}\frac{\partial r_1}{\partial x_n} + \cdots + \frac{\partial r_m}{\partial x_n}\frac{\partial r_m}{\partial x_n}
\begin{pmatrix}
r_1(x)\frac{\partial^2 r_1}{\partial x_1 \partial x_1} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_1 \partial x_1} & \cdots & r_1(x)\frac{\partial^2 r_1}{\partial x_1 \partial x_n} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_1 \partial x_n} \\
\  & \vdots & \  \\
r_1(x)\frac{\partial^2 r_1}{\partial x_n \partial x_1} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_n \partial x_1} & \cdots & r_1(x)\frac{\partial^2 r_1}{\partial x_n \partial x_n} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_n \partial x_n}
\end{pmatrix} \\ \\ &=
\begin{pmatrix}
\frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_1} \\
\  & \vdots & \  \\
\frac{\partial r_1}{\partial x_n} & \cdots & \frac{\partial r_m}{\partial x_n}
\end{pmatrix}
\begin{pmatrix}
\frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_1}{\partial x_n} \\
\  & \vdots & \  \\
\frac{\partial r_m}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_n}
\end{pmatrix} + \sum_{i=1}^m r_i(x)
\begin{pmatrix}
\frac{\partial^2 r_i}{\partial x_1 \partial x_1} & \cdots & \frac{\partial^2 r_i}{\partial x_1 \partial x_n} \\
\  & \vdots & \  \\
\frac{\partial^2 r_i}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 r_i}{\partial x_n \partial x_n}
\end{pmatrix} \\ \\ &=
\color{limegreen}{J(x)^t J(x)+\sum_{i=1}^m r_i(x) H_{r_i}(x)}
\end{align}

Meanwhile, for a convex function f: mathbb{R} rightarrow mathbb{R},
\begin{align} f(x+h) \approx f(x)+f^{\prime}h + \frac{1}{2} f^{\prime \prime}(x)h^2
\end{align}
by Taylor Theorem. It tells us that a good h can make f(x) move to the direction f is decreasing. For finding this value h, we can define the function g for h and find the critical point of g.
\begin{align} g(h) &= h^2\left( \frac{1}{2} f^{\prime \prime}(x) \right) + h f^{\prime}(x) + f(x) \\ \\
g^{\prime}(h) &= h f^{\prime \prime}(x) + f^{\prime}(x)=0 \\ \\
(f^{\prime \prime}(x) > 0)
\end{align}
In other words, x should be moved by h for decreasing f. In a higher dimension, the minimum of f can be found in the same manner. Suppose that f is a function for x in mathbb{R^n} and its Hessian matrix H_f is positive definite. By Taylor Theorem,
\begin{align} f(x+s) \approx f(x) + \nabla f(x)^t s+ \frac{1}{2} s^t H_f(x)s
\end{align}
Following the process in one dimension, x should be moved by s, the solution of
\begin{align} H_f(x) s = -\nabla f(x) \end{align}

Returning to the original problem, phi(x) can be minimized by finding s such that H_{phi}(x)s = - nabla phi (x). More specifically, this linear system is as follows:
\begin{align}  H_{\phi}(x) s = - \nabla \phi (x) \Leftrightarrow
\left( J(x)^tJ(x)+ \sum_{i=1}^m r_i(x)H_{r_i}(x) \right) s = -J(x)^t r(x)
\end{align}
This linear system can be solved by some famous methods.

1. Gauss-Newton Method
This method assumes that sum_{i=1}^m r_i(x)H_{r_i}(x) will be negligible when the residual function r is small. It is possible when the model function f fits the data well so that r_i(x)=y_i-f(t_i, x) can be small. In this case,
\begin{align} J(x)^tJ(x) s \approx -J(x)^t r(x) \end{align}
It recalls the normal equation, so this problem can be viewed as solving the linear system J(x)s=-r(x). However, there are some drawbacks. this problem may fail to converge when it is not started close enough to the real solution. Moreover, if  sum_{i=1}^m r_i(x)H_{r_i}(x) is not small, it cannot be negligible. Finally, the solution could hardly find when H_{phi}(x), J^t(x)J(x), or J(x) is ill-conditioned.

# 2. Levenberg-Marquardt Method

This is useful even when H_{phi}(x), J^t(x)J(x), or J(x) is ill-conditioned, or rank-deficient. It does not ignore the term  sum_{i=1}^m r_i(x)H_{r_i}(x), but replace this by mu I for mu in mathbb{R}.
\begin{align} \left( J(x)^tJ(x) + \mu I \right) s \approx -J(x)^t r(x) \end{align}
This technique, which is called regularization, makes the pseudo-H_f(x) positive definite. Its characteristics is more obvious when it is considered as that of the lower dimension. f(x+h) is decreasing when its  second order derivative is potisitive. This property is corresponding to the Hessian matrix of a function in a higher dimension. If the second order derivative of f is not greater than zero, it can be biased by mu as follows:
\begin{align}
\begin{cases}
f^{\prime \prime}(x) > 0 \Rightarrow h=-\frac{f^{\prime}(x)}{ f^{\prime \prime}(x)} \\ \\
f^{\prime \prime}(x) \leq 0 \Rightarrow  h=-\frac{f^{\prime}(x)}{f^{\prime \prime}(x) + \mu} \quad \text{where} \   f^{\prime \prime}(x) + \mu > 0
\end{cases}
\end{align}
Considering this process and applying it to the higer dimension in the same manner, the following properties are induced.
\begin{equation}
f^{\prime \prime}(x) + \mu > 0 \quad \text{where} \ f^{\prime \prime}(x) \leq 0 \\
\Updownarrow \\
J(x)^tJ(x) + \mu I \   \text{is positive definite} \\ \text{where} \   H_{\phi}(x) \left( \approx J(x)^tJ(x) \right) \text{is NOT positive definite}
\end{equation}
Therefore, it can be said that the regularization term, mu I makes the pseudo-H_f(x), which is  J(x)^tJ(x), positive definite. Moreover, Levenberg-Marquardt method can be interpreted as the weighted combination of Gauss-Newton and gradient descent methods.
\begin{align}
\begin{cases}
J(x)^tJ(x) s \approx -J(x)^t r(x) \quad \left( \mu I = O\right) \quad \text{: Gauss-Newton} \\ \\
s \approx -J(x)^t r(x) = -\nabla \phi(x) \quad \left( \mu I = I- J(x)^tJ(x) \right) \quad \text{: Gradient Descent}
\end{cases}
\end{align}

Meanwhile, there exists a singular value of J(x) close to zero when J(x) is ill-conditioned. Using the fact that J(x) can be diagonalizable,
\begin{align} J(x)^tJ(x) &= (U\Sigma V^t)^t (U\Sigma V^t) = V\Sigma^t U^t U \Sigma V = V \Sigma^2 V \\ \\ \Rightarrow V^t(J^t(x)J(x))V &= \Sigma^2 = \text{diag}(\sigma_1^2, \cdots, \sigma_n^2)
\end{align}
where sigma_i is the i-th singular value of J(x). It means that the eigenvalues of J(x)^tJ(x) are the square of the singular vaules of J(x). In case that J(x) is ill-conditioned, for the eigenvalue lambda which is close to zero and the eigenvector v corresponding to lambda,
\begin{align} J(x)^tJ(x)v = \lambda v \Rightarrow (J(x)^tJ(x)+\mu I)v = \lambda v + \mu v = (\lambda + \mu)v
\end{align}
Note that J(x)^tJ(x)+mu I is symmetric since J(x)^t J(x) is symmetric and it is also positive definite if mu is selected well. Accordingly, the cholesky factorization can be applied to it as follows:
\begin{align} (J(x)^tJ(x)+\mu I)v = (\lambda + \mu)v = \hat{J}(x)^t \hat{J}(x)v
\end{align}
where hat{J}(x)^t hat{J}(x) is positive definite and its all eigenvalues are positive. Then the eigenvalue of hat{J}(x)^t hat{J}(x) is lambda + mu, so the singular value of hat{J}(x) is sqrt{lambda + mu}. Therefore, the original singular value of J(x),  sqrt{lambda}, is increased to sqrt{lambda + mu}, and it helps the ill-conditioned J(x) to be far from this condition.
The corresponding linear least squares problem to Levenberg-Marquardt method is,
\begin{align}
\begin{pmatrix} J(x) \\ \sqrt{\mu} I \end{pmatrix} s \cong
\begin{pmatrix} -r(x) \\ 0 \end{pmatrix}
\end{align}

︎Reference

 Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.

emoy.net