Floating-Point Arithmetic
- When adding two floating-point numbers, the number whose exponent is smaller should be modified for the other. Consider two decimal numbers whose precision is `6`.
- Cancellation: the subtraction between the similar numbers, the addtion/subtraction between the relative large number and relative small number, and the division by the small number.
1. For the machine epsilon `epsilon`, `(1+epsilon)-(1-epsilon)=1-1=0` although it should be ` 2epsilon` in the real mathematics.
2. For quadratic formula `frac{-b pm sqrt{b^2-4ac}}{2a}`, when `b>0` and `b^2 text{>>} ac`, `-b pm sqrt{b^2-4ac}` part is numerically unstable.
3. For performance, a standard deviation `sigma`
\begin{align} \sigma= \sqrt{\frac{1}{n-1} \sum_{i=1}^n(x_i-\bar{x})^2}, \end{align} where `bar{x}` is the mean of `n`-points `x_1`, `cdots`, `x_n`, can be replaced by \begin{align} \sqrt{\frac{1}{n-1} \left( \sum_{i=1}^nx_i^2-n\bar{x}^2 \right)}. \end{align}
However, it can be numerically unstable in ` left( sum_{i=1}^nx_i^2-nbar{x}^2 right) ` part, and it can be even negative.
4. For `a=1.1` and `x=123456.789`, `(x+a)-x` may not be the same as `a`.
\begin{align} a=1.1 &\Rightarrow \underbrace{\color{plum }{0}\color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \ \underbrace{ \color{limegreen}{1}000}_{8} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1101}_{D} \color{salmon}{: (1.f_1\cdots f_{23})2^0 \text{ form}}\\ x=123456.789 &\Rightarrow \underbrace{\color{plum}{0}\color{limegreen}{100}}_{4} \ \underbrace{\color{limegreen}{0111}}_{7} \ \underbrace{\color{limegreen}{1}111}_{F} \ \underbrace{0001}_{1} \ \underbrace{0010}_{2} \ \underbrace{0000}_{0} \ \underbrace{0110}_{6} \ \underbrace{0101}_{5} \color{salmon}{: (1.f_1\cdots f_{23})2^{16} \text{ form}} \end{align} \begin{align} x+a = &\ \color{plum}{0}\color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}{1}111 \ 0001 \ 0010 \ 0000 \ 0110 \ 0101 \\ + &\ \color{plum}{0}\color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}{1} \underbrace{\color{orangered}{000\ 0000\ 0000\ 0000\ 1}}_{\text{appeared 16-bit number} \\ \text{shifting} \ (1.f_1\cdots f_{23}) \ \text{part}} 000 \ 1100 \ \color{cadetblue}{\underbrace{1100\ 1100\ 1100\ 1101}_{\text{loss}}} \\ \\ = &\ \color{plum}{0}\color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}{1}111 \ 0001 \ 0010 \ 0000 \ 0110 \ 0101 \\ + &\ \color{plum}{0}\color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}{1} \color{orangered}{000\ 0000\ 0000\ 0000\ 1} 000 \ 110\color{red}{1} \quad \text{(round to nearest)} \\ \\ = &\ \underbrace{\color{plum}{0}\color{limegreen}{100}}_{4} \ \underbrace{\color{limegreen}{0111}}_{7} \ \underbrace{\color{limegreen}{1}111}_{F} \ \underbrace{0001}_{1} \ \underbrace{0010}_{2} \ \underbrace{0000}_{0} \ \underbrace{1111}_{F} \ \underbrace{0010}_{2} \\ \\ \\ (x+a)-x = &\ \color{plum}{0}\color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}{1}111 \ 0001 \ 0010 \ 0000 \ 1111 \ 0010 \\ - &\ \color{plum}{0}\color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}{1}111\ 0001\ 0010\ 0000\ 0110 \ 0101 \\ \\ = &\ \color{plum}{0}\color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}{1} \underbrace{\color{orangered}{000\ 0000\ 0000\ 0000\ 1}}_{\text{should be shifted}} 000 \ 1101 \ \\ \\ =&\ \underbrace{\color{plum}{0}\color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \ \underbrace{\color{limegreen}{1}000}_{8} \ \underbrace{1101}_{D} \ \underbrace{0000}_{0} \ \underbrace{0000}_{0} \ \underbrace{0000}_{0} \ \underbrace{0000}_{0} \\ \color{red}{\boldsymbol{\ne}} &\ \underbrace{\color{plum}{0}\color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \ \underbrace{\color{limegreen}{1}000}_{8} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1101}_{D} = a \end{align}
In this example, `(x+a)-x ne a` after the addtion between the relative large number and relative small number(`x+a` part), and the subtraction between the similar numbers(`(x+a)-x` part). Therefore, this calculation is numerically unstable.
emoy.net