A b o u t   M e      |       P r o j e c t s     |       N o t e s       |       T h e   D a y    ︎ ︎




Floating-Point Arithmetic



  • When adding two floating-point numbers, the number whose exponent is smaller should be modified for the other. Consider two decimal numbers whose precision is `6`.
\begin{align} x&=1.92403 \times 10^2, \quad y=6.35782 \times 10^{-1} \\ \\ x+y&=(1.92403+0.00635782)\times 10^2 \\ &=  (1.92403+0.00636)\times 10^2 \qquad \text{(round to nearest)} \\ &= 1.93039\times 10^2 \end{align}

  • Cancellation: the subtraction between the similar numbers, the addtion/subtraction between the relative large number and relative small number, and the division by the small number.

    1. For the machine epsilon `epsilon`, `(1+epsilon)-(1-epsilon)=1-1=0` although it should be ` 2epsilon` in the real mathematics.

    2. For quadratic formula `frac{-b pm sqrt{b^2-4ac}}{2a}`, when `b>0` and `b^2 text{>>} ac`, `-b pm sqrt{b^2-4ac}` part is numerically unstable.

    3. For performance, a standard deviation `sigma` 
\begin{align} \sigma= \sqrt{\frac{1}{n-1} \sum_{i=1}^n(x_i-\bar{x})^2}, \end{align}    where `bar{x}` is the mean of `n`-points `x_1`, `cdots`, `x_n`, can be replaced by \begin{align} \sqrt{\frac{1}{n-1} \left( \sum_{i=1}^nx_i^2-n\bar{x}^2 \right)}. \end{align}
    However, it can be numerically unstable in ` left( sum_{i=1}^nx_i^2-nbar{x}^2 right) ` part, and it can be even negative.

    4. For `a=1.1` and `x=123456.789`, `(x+a)-x` may not be the same as `a`.
\begin{align} a=1.1 &\Rightarrow \underbrace{\color{plum }{0}\color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \   \underbrace{ \color{limegreen}{1}000}_{8} \  \underbrace{1100}_{C} \   \underbrace{1100}_{C}  \   \underbrace{1100}_{C} \   \underbrace{1100}_{C} \  \underbrace{1101}_{D} \color{salmon}{: (1.f_1\cdots f_{23})2^0 \text{ form}}\\ x=123456.789 &\Rightarrow \underbrace{\color{plum}{0}\color{limegreen}{100}}_{4} \ \underbrace{\color{limegreen}{0111}}_{7} \   \underbrace{\color{limegreen}{1}111}_{F} \  \underbrace{0001}_{1} \  \underbrace{0010}_{2}  \ \underbrace{0000}_{0} \  \underbrace{0110}_{6} \  \underbrace{0101}_{5} \color{salmon}{: (1.f_1\cdots f_{23})2^{16} \text{ form}} \end{align} \begin{align} x+a = &\   \color{plum}{0}\color{limegreen}{100} \ \color{limegreen}{0111} \  \color{limegreen}{1}111 \  0001 \  0010 \  0000 \  0110 \  0101 \\ + &\   \color{plum}{0}\color{limegreen}{100} \ \color{limegreen}{0111} \  \color{limegreen}{1} \underbrace{\color{orangered}{000\  0000\  0000\  0000\  1}}_{\text{appeared 16-bit number} \\  \text{shifting} \ (1.f_1\cdots f_{23}) \  \text{part}} 000 \  1100 \   \color{cadetblue}{\underbrace{1100\  1100\  1100\  1101}_{\text{loss}}} \\ \\  = &\   \color{plum}{0}\color{limegreen}{100} \ \color{limegreen}{0111} \  \color{limegreen}{1}111 \  0001 \  0010 \  0000 \  0110 \  0101 \\ + &\  \color{plum}{0}\color{limegreen}{100} \ \color{limegreen}{0111} \  \color{limegreen}{1} \color{orangered}{000\  0000\  0000\  0000\  1} 000 \  110\color{red}{1} \quad \text{(round to nearest)} \\ \\  = &\  \underbrace{\color{plum}{0}\color{limegreen}{100}}_{4} \ \underbrace{\color{limegreen}{0111}}_{7} \   \underbrace{\color{limegreen}{1}111}_{F} \  \underbrace{0001}_{1} \  \underbrace{0010}_{2}  \  \underbrace{0000}_{0} \  \underbrace{1111}_{F} \  \underbrace{0010}_{2} \\ \\ \\ (x+a)-x = &\   \color{plum}{0}\color{limegreen}{100} \ \color{limegreen}{0111} \  \color{limegreen}{1}111 \  0001 \  0010 \  0000 \  1111 \  0010 \\ - &\  \color{plum}{0}\color{limegreen}{100} \ \color{limegreen}{0111} \  \color{limegreen}{1}111\  0001\  0010\  0000\  0110 \  0101 \\ \\  = &\  \color{plum}{0}\color{limegreen}{100} \ \color{limegreen}{0111} \  \color{limegreen}{1} \underbrace{\color{orangered}{000\  0000\  0000\  0000\  1}}_{\text{should be shifted}} 000 \  1101 \   \\ \\ =&\  \underbrace{\color{plum}{0}\color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \   \underbrace{\color{limegreen}{1}000}_{8} \  \underbrace{1101}_{D} \  \underbrace{0000}_{0}  \  \underbrace{0000}_{0} \  \underbrace{0000}_{0} \  \underbrace{0000}_{0} \\ \color{red}{\boldsymbol{\ne}}  &\ \underbrace{\color{plum}{0}\color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \   \underbrace{\color{limegreen}{1}000}_{8} \  \underbrace{1100}_{C} \  \underbrace{1100}_{C}  \  \underbrace{1100}_{C} \  \underbrace{1100}_{C} \  \underbrace{1101}_{D} = a \end{align}
    In this example,  `(x+a)-x ne a` after the addtion between the relative large number and relative small number(`x+a` part), and the subtraction between the similar numbers(`(x+a)-x` part). Therefore, this calculation is numerically unstable.

emoy.net
Mark