**Normal Equations**

For an `mtimes n` matrix `A` and a vector `b in mathbb{R^m}`, the linear system `Ax=b` is to find the optimal solution `x in mathbb{R^n}`.

**1. The Residual
Minimization
**

The problem `Ax=b` can be replaced by the problem minimizing the residual, `||b-Ax||_2^2`. Let `A_i` be the `i`-th row vector of `A`. Then the cost function `f(x)` can be defined as follows:

\begin{align} f(x)=\left\| b-Ax \right\|_2^2 = \left\|

\begin{pmatrix} b_1 - A_1^t x \\ \vdots \\ b_m-A_m^t x \end{pmatrix} \right\| _2^2 = \sum_{k} (b_k-A_k^t x)^2

\end{align}

For minimizing `f`, finding the critical point is required. Let `a_{ij}` be the `(i, j)` element of `A`, then

\begin{align} \frac{\partial f}{\partial x_i} &= \sum_{k} 2(b_k-A_k^tx) \frac{\partial}{\partial x_i}(b_k-A_k^t x) =

2\sum_{k} (b_k-A_k^tx) (-a_{ki}) = 0 \\ \\ &\Rightarrow \sum_{k} a_{ki} A_k^t x = \sum_{k} a_{ki} b_k =

\begin{pmatrix} a_{1i} & \cdots & a_{mi} \end{pmatrix}

\begin{pmatrix} A_1^t \\ \vdots \\ A_m^t \end{pmatrix} x =

\begin{pmatrix} a_{1i} & \cdots & a_{mi} \end{pmatrix}

\begin{pmatrix} b_1 \\ \vdots \\ b_m \end{pmatrix}

\\ &\Rightarrow

\begin{pmatrix}

a_{11} & \cdots & a_{m1} \\

\ & \vdots & \ \\

a_{1n} & \cdots & a_{mn}

\end{pmatrix}

\begin{pmatrix} A_1^t \\ \vdots \\ A_m^t \end{pmatrix} x =

\begin{pmatrix}

a_{11} & \cdots & a_{m1} \\

\ & \vdots & \ \\

a_{1n} & \cdots & a_{mn}

\end{pmatrix}

\begin{pmatrix} b_1 \\ \vdots \\ b_m \end{pmatrix} \\ \\ &\Rightarrow \color{red}{A^tAx=A^tb}

\end{align}

This equation is called the

*normal equation*. Moreover, `A^tA` is positive (semi-)definite because `x^tA^tAx=(Ax)^t(Ax)=||Ax||_2^2 geq 0`, so it minimizes `f`. Especially, if `||Ax||_2^2 ne 0`, it tells that there is no `x` such that `Ax=0` except `x=0`. It also means that `A` is nonsingular whose column vectors are linearly independent. Therefore, the solution of `A^tAx=A^tb` is the optimal solution of `Ax=b` when `A` is nonsingular.

**2. Another Approach When `m>n`**

In this case, `A` maps the lower dimension to the higher dimension. For example, when `m=3` and `n=2`, there may be a point `b` in the objective space which `Ax` cannot cover.

When `y` is closest to `b`, the vector `b-y=``b-Ax` is perpendicular to the normal of hyperplane `text{Im} A`. Suppose that `e_1`, `cdots`, `e_n` are the standard basis in the original space. Then `Ae_1`, `cdots`, `Ae_n` consist of the basis of the objective space.

In other words, each column vector of `A` consists of the hyperplane `text{Im} A`, which means it is perpendicular to `b-Ax`. Therefore, `A^t(b-Ax)=0`, it yields the normal equation. Accordingly, the normal equation estimates the optimal solution of `Ax=b` even though the exact solution cannot be found.

**3. Another Approach Using Projector**

A square matrix `P` is

*idempotent*if `P^2=P`, and it is called*projector*. `P` projects any vector to `text{Im} P` and if the vectors are already on `text{Im} P`, they would be still there after projected.If `P` is symmetric, it is called

\begin{align} (Pv)^t((I-P)v) &= v^tP^t(v-Pv)=v^tP(v-Pv) \\

&=v^tPv-v^tP^2v = v^tPv-v^tPv =0 \\ \\ & \Rightarrow

Pv \ \bot \ (I-P)v \\ & \Rightarrow

v=Pv+(I-P)v

\end{align}

Now, to apply this to solving `Ax=b`, assume that `P` is the orthogonal projector onto `text{span} A`, which means `PA=A`. Then the cost function `f(x)` can be rewritten as

\begin{align} f(x)=\left\| b-Ax \right\|_2^2 &=

\left\| P(b-Ax)+(I-P)(b-Ax) \right\|_2^2 \\ &=

\left\| P(b-Ax) \right\|_2^2 + \left\| (I-P)(b-Ax) \right\|_2^2 \\ &=

\left\| Pb-PAx \right\|_2^2 + \left\| (I-P)b-(I-P)Ax \right\|_2^2 \\ &=

\left\| Pb-Ax \right\|_2^2 + \left\| (I-P)b \right\|_2^2.

\end{align}

Therefore, for minimizing this cost function, `|| Pb-Ax ||_2^2 ` should be zero.

\begin{align} Pb-Ax &= A^tPb-A^tAx= A^tP^tb-A^tAx \\ &=

(PA)^tb-A^tAx= A^tb-A^tAx=O

\end{align}

Note that it also produces the normal equation.

*orthogonal projector*, and `I-P` is also orthogonal projector to the hyperplane which is perpendicular to `text{Im} P`. Of course, `(I-P)^2=I-2P+P^2``=I-P`. For a vector `v`,&=v^tPv-v^tP^2v = v^tPv-v^tPv =0 \\ \\ & \Rightarrow

Pv \ \bot \ (I-P)v \\ & \Rightarrow

v=Pv+(I-P)v

\end{align}

Now, to apply this to solving `Ax=b`, assume that `P` is the orthogonal projector onto `text{span} A`, which means `PA=A`. Then the cost function `f(x)` can be rewritten as

\begin{align} f(x)=\left\| b-Ax \right\|_2^2 &=

\left\| P(b-Ax)+(I-P)(b-Ax) \right\|_2^2 \\ &=

\left\| P(b-Ax) \right\|_2^2 + \left\| (I-P)(b-Ax) \right\|_2^2 \\ &=

\left\| Pb-PAx \right\|_2^2 + \left\| (I-P)b-(I-P)Ax \right\|_2^2 \\ &=

\left\| Pb-Ax \right\|_2^2 + \left\| (I-P)b \right\|_2^2.

\end{align}

Therefore, for minimizing this cost function, `|| Pb-Ax ||_2^2 ` should be zero.

\begin{align} Pb-Ax &= A^tPb-A^tAx= A^tP^tb-A^tAx \\ &=

(PA)^tb-A^tAx= A^tb-A^tAx=O

\end{align}

Note that it also produces the normal equation.

**4. Evaluation**

The solution from the normal equation can be evaluated by the condition number and angle between `b` and `Ax`.

The condition number of `A` tells how close to singular `A` is. However, if `m ne n`, `A` is not invertible, so the pseudoinverse `A^+=``(A^tA)^{-1}A^t` should be introduced. In this case, the condition number of `A` is

\begin{align} \text{Cond}_2(A)=\left\| A \right\|_2 \left\| A^+ \right\|_2. \end{align}

Meanwhile, as the angle between `b` and `Ax` is smaller, `b` is closer to `Ax`, so it is a good measure to find whether the solution is well-conditioned. This angle can be obtain as `cos theta = frac{||Ax||_2}{||b||_2}`.

**︎Reference**

[1] Michael T. Heath,

*Scientific Computing: An Introductory Survey*. 2nd Edition, McGraw-Hill Higher Education.

emoy.net