P r o j e c t s


          N o t e s



         I n f o r m a t i o n
         g i t h u b









Nonlinear Least Squares



    Consider the residual function `r:mathbb{R}^n rightarrow mathbb{R}^m`, for `x in mathbb{R}^n` 
\begin{align}
r_i (x)=y_i-f(t_i-x),  \qquad (i=1, 2, \cdots, m)
\end{align}
    where `y_i` is the ground truth and `t_i` is the observed value for a function `f:mathbb{R}^{n+1} rightarrow mathbb{R}`. Given `r`, we want to minimize the cost function `phi`,
\begin{align}
\phi(x)&=\frac{1}{2} r(x)^tr(x) =\frac{1}{2}
\begin{pmatrix} r_1(x) & \cdots & r_m(x) \end{pmatrix}
\begin{pmatrix} r_1(x) \\ \vdots \\ r_m(x) \end{pmatrix} \\ &=\frac{1}{2}\left( r_1^2(x)+\cdots +r_m^2(x) \right).
\end{align}
    For minimizing `phi`, the gradient of `phi` is required.
\begin{equation} \frac{\partial \phi}{\partial x_i} = r_1(x)\frac{\partial r_1}{\partial x_i} + \cdots +  r_m(x)\frac{\partial r_m}{\partial x_i}
\end{equation}
\begin{align}
\Rightarrow \nabla \phi(x) &= 
\begin{pmatrix} \frac{\partial \phi}{\partial x_1} \\ \vdots \\ \frac{\partial \phi}{\partial x_n} \end{pmatrix} =
\begin{pmatrix} 
r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_1}} + \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_1}} \\
\vdots \\
r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_n}} + \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_n}}
\end{pmatrix}  \\ \\ &=
\begin{pmatrix}
\frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_1} \\
\  & \vdots \   \\
\frac{\partial r_1}{\partial x_i} & \cdots & \frac{\partial r_m}{\partial x_i} \\
\  & \vdots \   \\
\frac{\partial r_1}{\partial x_n} & \cdots & \frac{\partial r_m}{\partial x_n}
\end{pmatrix}
\begin{pmatrix} r_1(x) \\ \vdots \\ r_m(x) \end{pmatrix} =
\color{limegreen}{J(x)^t r(x)}
\end{align}
where `J(x)` is the Jacobian matrix of `r(x)`. In addtion, the Hessian matrix `H_{phi}(x)` of `phi(x)` is as follows:
\begin{align} H_{\phi}(x)&=
\begin{pmatrix} 
\frac{\partial^2 \phi}{\partial x_1^2} & \cdots &  \frac{\partial^2 \phi}{\partial x_1 \partial x_n} \\
\  & \vdots & \  \\
\frac{\partial^2 \phi}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 \phi}{\partial x_n^2}
\end{pmatrix} \\ \\ &=
\begin{pmatrix}
\frac{\partial}{\partial \color{tomato}{x_1}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_1}}+ \cdots +  r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_1}} \right] & \cdots & \frac{\partial}{\partial \color{tomato}{x_n}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_1}}+ \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_1}} \right] \\
\  & \vdots & \   \\
\frac{\partial}{\partial \color{tomato}{x_1}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_n}}+ \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_n}} \right] & \cdots & \frac{\partial}{\partial \color{tomato}{x_n}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_n}}+ \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_n}} \right]
\end{pmatrix} \\ \\ \Rightarrow (H_{\phi})_{ij} &=
\frac{\partial}{\partial \color{tomato}{x_j}} \left[ r_1(x)\frac{\partial r_1}{\partial \color{limegreen}{x_i}}+ \cdots + r_m(x)\frac{\partial r_m}{\partial \color{limegreen}{x_i}} \right]  \\ \\ &=
\left( \frac{\partial r_1}{\partial \color{tomato}{x_j}}  \frac{\partial r_1}{\partial \color{limegreen}{x_i}} +
r_1(x) \frac{\partial^2 r_1}{\partial \color{limegreen}{x_i} \partial \color{tomato}{x_j}}\right) + \cdots + 
\left( \frac{\partial r_m}{\partial \color{tomato}{x_j}} \frac{\partial r_m}{\partial \color{limegreen}{x_i}} +
r_m(x) \frac{\partial^2 r_m}{\partial \color{limegreen}{x_i} \partial \color{tomato}{x_j}} \right)
\end{align}
where `(H_{phi})_{ij}` is the `(i, j)` element of `H_{phi}`. So it can be rewritten as
\begin{align} H_{\phi} &=
\begin{pmatrix}
\frac{\partial r_1}{\partial x_1} \frac{\partial r_1}{\partial x_1} + \cdots + \frac{\partial r_m}{\partial x_1}\frac{\partial r_m}{\partial x_1} & \cdots &
\frac{\partial r_1}{\partial x_n}\frac{\partial r_1}{\partial x_1} + \cdots + \frac{\partial r_m}{\partial x_n}\frac{\partial r_m}{\partial x_1} \\
\  & \vdots & \  \\
\frac{\partial r_1}{\partial x_1}\frac{\partial r_1}{\partial x_n} + \cdots + \frac{\partial r_m}{\partial x_1}\frac{\partial r_m}{\partial x_n} & \cdots &
\frac{\partial r_1}{\partial x_n}\frac{\partial r_1}{\partial x_n} + \cdots + \frac{\partial r_m}{\partial x_n}\frac{\partial r_m}{\partial x_n}
\end{pmatrix} \\ \\ &\quad + 
\begin{pmatrix}
r_1(x)\frac{\partial^2 r_1}{\partial x_1 \partial x_1} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_1 \partial x_1} & \cdots & r_1(x)\frac{\partial^2 r_1}{\partial x_1 \partial x_n} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_1 \partial x_n} \\
\  & \vdots & \  \\
r_1(x)\frac{\partial^2 r_1}{\partial x_n \partial x_1} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_n \partial x_1} & \cdots & r_1(x)\frac{\partial^2 r_1}{\partial x_n \partial x_n} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_n \partial x_n}
\end{pmatrix} \\ \\ &=
\begin{pmatrix}
\frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_1} \\
\  & \vdots & \  \\
\frac{\partial r_1}{\partial x_n} & \cdots & \frac{\partial r_m}{\partial x_n}
\end{pmatrix} 
\begin{pmatrix}
\frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_1}{\partial x_n} \\
\  & \vdots & \  \\
\frac{\partial r_m}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_n}
\end{pmatrix} + \sum_{i=1}^m r_i(x)
\begin{pmatrix}
\frac{\partial^2 r_i}{\partial x_1 \partial x_1} & \cdots & \frac{\partial^2 r_i}{\partial x_1 \partial x_n} \\
\  & \vdots & \  \\
\frac{\partial^2 r_i}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 r_i}{\partial x_n \partial x_n}
\end{pmatrix} \\ \\ &=
\color{limegreen}{J(x)^t J(x)+\sum_{i=1}^m r_i(x) H_{r_i}(x)}
\end{align}

    Meanwhile, for a convex function `f: mathbb{R} rightarrow mathbb{R}`, 
\begin{align} f(x+h) \approx f(x)+f^{\prime}h + \frac{1}{2} f^{\prime \prime}(x)h^2
\end{align}
by Taylor Theorem. It tells us that a good `h` can make `f(x)` move to the direction `f` is decreasing. For finding this value `h`, we can define the function `g` for `h` and find the critical point of `g`.
\begin{align} g(h) &= h^2\left( \frac{1}{2} f^{\prime \prime}(x) \right) + h f^{\prime}(x) + f(x) \\ \\
g^{\prime}(h) &= h f^{\prime \prime}(x) + f^{\prime}(x)=0 \\ \\
h &= -\frac{f^{\prime}(x)}{f^{\prime \prime}(x)} \qquad
(f^{\prime \prime}(x) > 0) 
\end{align}
    In other words, `x` should be moved by `h` for decreasing `f`. In a higher dimension, the minimum of `f` can be found in the same manner. Suppose that `f` is a function for `x in mathbb{R^n}` and its Hessian matrix `H_f` is positive definite. By Taylor Theorem,
\begin{align} f(x+s) \approx f(x) + \nabla f(x)^t s+ \frac{1}{2} s^t H_f(x)s
\end{align}
    Following the process in one dimension, `x` should be moved by `s`, the solution of 
\begin{align} H_f(x) s = -\nabla f(x) \end{align} 

    Returning to the original problem, `phi(x)` can be minimized by finding `s` such that `H_{phi}(x)s = - nabla phi (x)`. More specifically, this linear system is as follows:
\begin{align}  H_{\phi}(x) s = - \nabla \phi (x) \Leftrightarrow
\left( J(x)^tJ(x)+ \sum_{i=1}^m r_i(x)H_{r_i}(x) \right) s = -J(x)^t r(x)
\end{align}
    This linear system can be solved by some famous methods.


1. Gauss-Newton Method
    This method assumes that `sum_{i=1}^m r_i(x)H_{r_i}(x)` will be negligible when the residual function `r` is small. It is possible when the model function `f` fits the data well so that `r_i(x)=y_i-f(t_i, x)` can be small. In this case, 
\begin{align} J(x)^tJ(x) s \approx -J(x)^t r(x) \end{align}
    It recalls the normal equation, so this problem can be viewed as solving the linear system `J(x)s=-r(x)`. However, there are some drawbacks. this problem may fail to converge when it is not started close enough to the real solution. Moreover, if  `sum_{i=1}^m r_i(x)H_{r_i}(x)` is not small, it cannot be negligible. Finally, the solution could hardly find when `H_{phi}(x)`, `J^t(x)J(x)`, or `J(x)` is ill-conditioned.


2. Levenberg-Marquardt Method


    This is useful even when `H_{phi}(x)`, `J^t(x)J(x)`, or `J(x)` is ill-conditioned, or rank-deficient. It does not ignore the term  `sum_{i=1}^m r_i(x)H_{r_i}(x)`, but replace this by `mu I` for `mu in mathbb{R}`.
\begin{align} \left( J(x)^tJ(x) + \mu I \right) s \approx -J(x)^t r(x) \end{align}
    This technique, which is called regularization, makes the pseudo-`H_f(x)` positive definite. Its characteristics is more obvious when it is considered as that of the lower dimension. `f(x+h)` is decreasing when its  second order derivative is potisitive. This property is corresponding to the Hessian matrix of a function in a higher dimension. If the second order derivative of `f` is not greater than zero, it can be biased by `mu` as follows:
\begin{align}
\begin{cases}
f^{\prime \prime}(x) > 0 \Rightarrow h=-\frac{f^{\prime}(x)}{ f^{\prime \prime}(x)} \\ \\
f^{\prime \prime}(x) \leq 0 \Rightarrow  h=-\frac{f^{\prime}(x)}{f^{\prime \prime}(x) + \mu} \quad \text{where} \   f^{\prime \prime}(x) + \mu > 0
\end{cases}
\end{align}
    Considering this process and applying it to the higer dimension in the same manner, the following properties are induced.
\begin{equation}
f^{\prime \prime}(x) + \mu > 0 \quad \text{where} \ f^{\prime \prime}(x) \leq 0 \\
\Updownarrow \\
J(x)^tJ(x) + \mu I \   \text{is positive definite} \\ \text{where} \   H_{\phi}(x) \left( \approx J(x)^tJ(x) \right) \text{is NOT positive definite} 
\end{equation}
    Therefore, it can be said that the regularization term, `mu I` makes the pseudo-`H_f(x)`, which is ` J(x)^tJ(x)`, positive definite. Moreover, Levenberg-Marquardt method can be interpreted as the weighted combination of Gauss-Newton and gradient descent methods.
\begin{align}
\begin{cases}
J(x)^tJ(x) s \approx -J(x)^t r(x) \quad \left( \mu I = O\right) \quad \text{: Gauss-Newton} \\ \\
s \approx -J(x)^t r(x) = -\nabla \phi(x) \quad \left( \mu I = I- J(x)^tJ(x) \right) \quad \text{: Gradient Descent}
\end{cases}
\end{align}
   
    Meanwhile, there exists a singular value of `J(x)` close to zero when `J(x)` is ill-conditioned. Using the fact that `J(x)` can be diagonalizable,
\begin{align} J(x)^tJ(x) &= (U\Sigma V^t)^t (U\Sigma V^t) = V\Sigma^t U^t U \Sigma V = V \Sigma^2 V \\ \\ \Rightarrow V^t(J^t(x)J(x))V &= \Sigma^2 = \text{diag}(\sigma_1^2, \cdots, \sigma_n^2) 
\end{align}
where `sigma_i` is the `i`-th singular value of `J(x)`. It means that the eigenvalues of `J(x)^tJ(x)` are the square of the singular vaules of `J(x)`. In case that `J(x)` is ill-conditioned, for the eigenvalue `lambda` which is close to zero and the eigenvector `v` corresponding to `lambda`,
\begin{align} J(x)^tJ(x)v = \lambda v \Rightarrow (J(x)^tJ(x)+\mu I)v = \lambda v + \mu v = (\lambda + \mu)v
\end{align}
    Note that `J(x)^tJ(x)+mu I` is symmetric since `J(x)^t J(x)` is symmetric and it is also positive definite if `mu` is selected well. Accordingly, the cholesky factorization can be applied to it as follows:
\begin{align} (J(x)^tJ(x)+\mu I)v = (\lambda + \mu)v = \hat{J}(x)^t \hat{J}(x)v
\end{align}
where `hat{J}(x)^t hat{J}(x)` is positive definite and its all eigenvalues are positive. Then the eigenvalue of `hat{J}(x)^t hat{J}(x)` is `lambda + mu`, so the singular value of `hat{J}(x)` is `sqrt{lambda + mu}`. Therefore, the original singular value of `J(x)`, ` sqrt{lambda}`, is increased to `sqrt{lambda + mu}`, and it helps the ill-conditioned `J(x)` to be far from this condition.
    The corresponding linear least squares problem to Levenberg-Marquardt method is,
\begin{align}
\begin{pmatrix} J(x) \\ \sqrt{\mu} I \end{pmatrix} s \cong
\begin{pmatrix} -r(x) \\ 0 \end{pmatrix}
\end{align}


︎Reference

[1] Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.

emoy.net
Mark