Normal Equations
For an `mtimes n` matrix `A` and a vector `b in mathbb{R^m}`, the linear system `Ax=b` is to find the optimal solution `x in mathbb{R^n}`.
1. The Residual
Minimization
The problem `Ax=b` can be replaced by the problem minimizing the residual, `||b-Ax||_2^2`. Let `A_i` be the `i`-th row vector of `A`. Then the cost function `f(x)` can be defined as follows:
\begin{align} f(x)=\left\| b-Ax \right\|_2^2 = \left\| \begin{pmatrix} b_1 - A_1^t x \\ \vdots \\ b_m-A_m^t x \end{pmatrix} \right\| _2^2 = \sum_{k} (b_k-A_k^t x)^2 \end{align} For minimizing `f`, finding the critical point is required. Let `a_{ij}` be the `(i, j)` element of `A`, then \begin{align} \frac{\partial f}{\partial x_i} &= \sum_{k} 2(b_k-A_k^tx) \frac{\partial}{\partial x_i}(b_k-A_k^t x) = 2\sum_{k} (b_k-A_k^tx) (-a_{ki}) = 0 \\ \\ &\Rightarrow \sum_{k} a_{ki} A_k^t x = \sum_{k} a_{ki} b_k = \begin{pmatrix} a_{1i} & \cdots & a_{mi} \end{pmatrix} \begin{pmatrix} A_1^t \\ \vdots \\ A_m^t \end{pmatrix} x = \begin{pmatrix} a_{1i} & \cdots & a_{mi} \end{pmatrix} \begin{pmatrix} b_1 \\ \vdots \\ b_m \end{pmatrix} \\ &\Rightarrow \begin{pmatrix} a_{11} & \cdots & a_{m1} \\ \ & \vdots & \ \\ a_{1n} & \cdots & a_{mn} \end{pmatrix} \begin{pmatrix} A_1^t \\ \vdots \\ A_m^t \end{pmatrix} x = \begin{pmatrix} a_{11} & \cdots & a_{m1} \\ \ & \vdots & \ \\a_{1n} & \cdots & a_{mn}\end{pmatrix} \begin{pmatrix} b_1 \\ \vdots \\ b_m \end{pmatrix} \\ \\ &\Rightarrow \color{red}{A^tAx=A^tb} \end{align}
This equation is called the normal equation. Moreover, `A^tA` is positive (semi-)definite because `x^tA^tAx=(Ax)^t(Ax)=||Ax||_2^2 geq 0`, so it minimizes `f`. Especially, if `||Ax||_2^2 ne 0`, it tells that there is no `x` such that `Ax=0` except `x=0`. It also means that `A` is nonsingular whose column vectors are linearly independent. Therefore, the solution of `A^tAx=A^tb` is the optimal solution of `Ax=b` when `A` is nonsingular.
2. Another Approach When `m>n`
In this case, `A` maps the lower dimension to the higher dimension. For example, when `m=3` and `n=2`, there may be a point `b` in the objective space which `Ax` cannot cover.
![]()
So, the linear system `Ax=b` cannot be solved exactly. However, we can still find the point which is closest to `b`. Let `y=Ax` be the closest point to `b`.
In this case, `A` maps the lower dimension to the higher dimension. For example, when `m=3` and `n=2`, there may be a point `b` in the objective space which `Ax` cannot cover.

When `y` is closest to `b`, the vector `b-y=``b-Ax` is perpendicular to the normal of hyperplane `text{Im} A`. Suppose that `e_1`, `cdots`, `e_n` are the standard basis in the original space. Then `Ae_1`, `cdots`, `Ae_n` consist of the basis of the objective space.

In other words, each column vector of `A` consists of the hyperplane `text{Im} A`, which means it is perpendicular to `b-Ax`. Therefore, `A^t(b-Ax)=0`, it yields the normal equation. Accordingly, the normal equation estimates the optimal solution of `Ax=b` even though the exact solution cannot be found.
3. Another Approach Using Projector
A square matrix `P` is idempotent if `P^2=P`, and it is called projector. `P` projects any vector to `text{Im} P` and if the vectors are already
on `text{Im} P`, they would be still there after projected.

If `P` is symmetric, it is called orthogonal projector, and `I-P` is also orthogonal projector to the hyperplane which is perpendicular to `text{Im} P`. Of course, `(I-P)^2=I-2P+P^2``=I-P`. For a vector `v`,
![]()
\begin{align} (Pv)^t((I-P)v) &= v^tP^t(v-Pv)=v^tP(v-Pv) \\&=v^tPv-v^tP^2v =
v^tPv-v^tPv =0 \\ \\ & \Rightarrow
Pv \ \bot \ (I-P)v \\ & \Rightarrow
v=Pv+(I-P)v
\end{align} Now, to apply this to solving `Ax=b`, assume that `P` is the orthogonal projector onto `text{span} A`, which means `PA=A`. Then the cost function `f(x)` can be rewritten as
\begin{align} f(x)=\left\| b-Ax \right\|_2^2 &=
\left\| P(b-Ax)+(I-P)(b-Ax) \right\|_2^2 \\ &=
\left\| P(b-Ax) \right\|_2^2 + \left\| (I-P)(b-Ax) \right\|_2^2 \\ &=
\left\| Pb-PAx \right\|_2^2 + \left\| (I-P)b-(I-P)Ax \right\|_2^2 \\ &=
\left\| Pb-Ax \right\|_2^2 + \left\| (I-P)b \right\|_2^2.
\end{align} Therefore, for minimizing this cost function, `|| Pb-Ax ||_2^2
` should be zero.
\begin{align} Pb-Ax &= A^tPb-A^tAx=
A^tP^tb-A^tAx \\ &=
(PA)^tb-A^tAx=
A^tb-A^tAx=O
\end{align}
Note that it also produces the normal equation.

Note that it also produces the normal equation.
4. Evaluation
The solution from the normal equation can be evaluated by the condition number and angle between `b` and `Ax`.
The condition number of `A` tells how close to singular `A` is. However, if `m ne n`, `A` is not invertible, so the pseudoinverse `A^+=``(A^tA)^{-1}A^t` should be introduced. In this case, the condition number of `A` is
\begin{align} \text{Cond}_2(A)=\left\| A \right\|_2 \left\| A^+ \right\|_2. \end{align}
Meanwhile, as the angle between `b` and `Ax` is smaller, `b` is closer to `Ax`, so it is a good measure to find whether the solution is well-conditioned. This angle can be obtain as `cos theta = frac{||Ax||_2}{||b||_2}`.![]()
The solution from the normal equation can be evaluated by the condition number and angle between `b` and `Ax`.
The condition number of `A` tells how close to singular `A` is. However, if `m ne n`, `A` is not invertible, so the pseudoinverse `A^+=``(A^tA)^{-1}A^t` should be introduced. In this case, the condition number of `A` is
\begin{align} \text{Cond}_2(A)=\left\| A \right\|_2 \left\| A^+ \right\|_2. \end{align}
Meanwhile, as the angle between `b` and `Ax` is smaller, `b` is closer to `Ax`, so it is a good measure to find whether the solution is well-conditioned. This angle can be obtain as `cos theta = frac{||Ax||_2}{||b||_2}`.

︎Reference
[1] Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.
[1] Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.
emoy.net