A b o u t   M e      |       P r o j e c t s     |       N o t e s       |       T h e   D a y    ︎ ︎

Normal Equations

    For an `mtimes n` matrix `A` and a vector `b in mathbb{R^m}`, the linear system `Ax=b` is to find the optimal solution `x in mathbb{R^n}`. 

1. The Residual Minimization

    The problem `Ax=b` can be replaced by the problem minimizing the residual, `||b-Ax||_2^2`. Let `A_i` be the `i`-th row vector of `A`. Then the cost function `f(x)` can be defined as follows:
\begin{align} f(x)=\left\| b-Ax \right\|_2^2 = \left\| \begin{pmatrix} b_1 - A_1^t x \\ \vdots \\ b_m-A_m^t x \end{pmatrix} \right\| _2^2 = \sum_{k} (b_k-A_k^t x)^2 \end{align}    For minimizing `f`, finding the critical point is required. Let `a_{ij}` be the `(i, j)` element of `A`, then \begin{align} \frac{\partial f}{\partial x_i} &= \sum_{k} 2(b_k-A_k^tx) \frac{\partial}{\partial x_i}(b_k-A_k^t x) = 2\sum_{k} (b_k-A_k^tx) (-a_{ki}) = 0 \\ \\ &\Rightarrow \sum_{k} a_{ki} A_k^t x = \sum_{k} a_{ki} b_k =  \begin{pmatrix} a_{1i} & \cdots & a_{mi} \end{pmatrix} \begin{pmatrix} A_1^t \\ \vdots \\ A_m^t \end{pmatrix} x = \begin{pmatrix} a_{1i} & \cdots & a_{mi} \end{pmatrix} \begin{pmatrix} b_1 \\ \vdots \\ b_m \end{pmatrix} \\ &\Rightarrow \begin{pmatrix} a_{11} & \cdots & a_{m1} \\ \  & \vdots & \   \\ a_{1n} & \cdots & a_{mn} \end{pmatrix} \begin{pmatrix} A_1^t \\ \vdots \\ A_m^t \end{pmatrix} x = \begin{pmatrix} a_{11} & \cdots & a_{m1} \\ \  & \vdots & \   \\a_{1n} & \cdots & a_{mn}\end{pmatrix} \begin{pmatrix} b_1 \\ \vdots \\ b_m \end{pmatrix} \\ \\ &\Rightarrow \color{red}{A^tAx=A^tb} \end{align}
    This equation is called the normal equation. Moreover, `A^tA` is positive (semi-)definite because `x^tA^tAx=(Ax)^t(Ax)=||Ax||_2^2 geq 0`, so it minimizes `f`. Especially, if `||Ax||_2^2 ne 0`, it tells that there is no `x` such that `Ax=0` except `x=0`.  It also means that `A` is nonsingular whose column vectors are linearly independent. Therefore, the solution of `A^tAx=A^tb` is the optimal solution of `Ax=b` when `A` is nonsingular.

2. Another Approach When `m>n`

    In this case, `A` maps the lower dimension to the higher dimension. For example, when `m=3` and `n=2`, there may be a point `b`  in the objective space which `Ax` cannot cover.

    So, the linear system `Ax=b` cannot be solved exactly. However, we can still find the point which is closest to `b`. Let `y=Ax` be the closest point to `b`.
    When `y` is closest to `b`, the vector `b-y=``b-Ax` is perpendicular to the normal of hyperplane `text{Im}  A`. Suppose that `e_1`, `cdots`, `e_n` are the standard basis in the original space. Then `Ae_1`, `cdots`, `Ae_n` consist of the basis of the objective space.
In other words, each column vector of `A` consists of the hyperplane `text{Im}  A`, which means it is perpendicular to `b-Ax`. Therefore, `A^t(b-Ax)=0`, it yields the normal equation. Accordingly, the normal equation estimates the optimal solution of `Ax=b` even though the exact solution cannot be found.

3. Another Approach Using Projector

  A square matrix `P` is idempotent if `P^2=P`, and it is called projector. `P` projects any vector to `text{Im}  P` and if the vectors are already on `text{Im}  P`, they would be still there after projected.
If `P` is symmetric, it is called orthogonal projector, and `I-P` is also orthogonal projector to the hyperplane which is perpendicular to `text{Im}  P`. Of course, `(I-P)^2=I-2P+P^2``=I-P`. For a vector `v`,

\begin{align} (Pv)^t((I-P)v) &= v^tP^t(v-Pv)=v^tP(v-Pv) \\&=v^tPv-v^tP^2v = v^tPv-v^tPv =0 \\ \\ & \Rightarrow Pv \  \bot \  (I-P)v \\ & \Rightarrow v=Pv+(I-P)v \end{align}    Now, to apply this to solving `Ax=b`, assume that `P` is the orthogonal projector onto `text{span}  A`, which means `PA=A`. Then the cost function `f(x)` can be rewritten as \begin{align} f(x)=\left\| b-Ax \right\|_2^2 &= \left\| P(b-Ax)+(I-P)(b-Ax) \right\|_2^2 \\ &=  \left\| P(b-Ax) \right\|_2^2 + \left\| (I-P)(b-Ax) \right\|_2^2 \\ &= \left\| Pb-PAx \right\|_2^2 + \left\| (I-P)b-(I-P)Ax \right\|_2^2 \\ &= \left\| Pb-Ax \right\|_2^2 + \left\| (I-P)b \right\|_2^2. \end{align}    Therefore, for minimizing this cost function, `|| Pb-Ax ||_2^2 ` should be zero. \begin{align} Pb-Ax &= A^tPb-A^tAx= A^tP^tb-A^tAx \\ &= (PA)^tb-A^tAx= A^tb-A^tAx=O \end{align}
    Note that it also produces the normal equation. 

4. Evaluation

    The solution from the normal equation can be evaluated by the condition number and angle between `b` and `Ax`.
    The condition number of `A` tells how close to singular `A` is. However, if `m ne n`, `A` is not invertible, so the pseudoinverse `A^+=``(A^tA)^{-1}A^t` should be introduced. In this case, the condition number of `A` is
\begin{align} \text{Cond}_2(A)=\left\| A \right\|_2  \left\| A^+ \right\|_2. \end{align}
    Meanwhile, as the angle between `b` and `Ax` is smaller, `b` is closer to `Ax`, so it is a good measure to find whether the solution is well-conditioned. This angle can be obtain as `cos theta = frac{||Ax||_2}{||b||_2}`.


[1] Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.