P r o j e c t s


          N o t e s



         I n f o r m a t i o n
         g i t h u b









Floating-Point Arithmetic



  • When adding two floating-point numbers, the number whose exponent is smaller should be modified for the other. Consider two decimal numbers whose precision is `6`.
\begin{align} x&=1.92403 \times 10^2, \quad y=6.35782 \times 10^{-1} \\ \\
x+y&=(1.92403+0.00635782)\times 10^2 \\ &=  (1.92403+0.00636)\times 10^2 \qquad \text{(round to nearest)} \\ &= 1.93039\times 10^2
\end{align}

  • Cancellation: the subtraction between the similar numbers, the addtion/subtraction between the relative large number and relative small number, and the division by the small number.

    1. For the machine epsilon `epsilon`, `(1+epsilon)-(1-epsilon)=1-1=0` although it should be ` 2epsilon` in the real mathematics.

    2. For quadratic formula `frac{-b pm sqrt{b^2-4ac}}{2a}`, when `b>0` and `b^2 text{>>} ac`, `-b pm sqrt{b^2-4ac}` part is numerically unstable.

    3. For performance, a standard deviation `sigma` 
\begin{align} \sigma= \sqrt{\frac{1}{n-1} \sum_{i=1}^n(x_i-\bar{x})^2}, \end{align}
    where `bar{x}` is the mean of `n`-points `x_1`, `cdots`, `x_n`, can be replaced by
\begin{align} \sqrt{\frac{1}{n-1} \left( \sum_{i=1}^nx_i^2-n\bar{x}^2 \right)}. \end{align}
    However, it can be numerically unstable in ` \left( \sum_{i=1}^nx_i^2-n\bar{x}^2 \right) ` part, and it can be even negative.

    4. For `a=1.1` and `x=123456.789`, `(x+a)-x` may not be the same as `a`.
\begin{align} a=1.1 &\Rightarrow \underbrace{\color{plum }{0}\color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \   \underbrace{ \color{limegreen}{1}000}_{8} \  \underbrace{1100}_{C} \   \underbrace{1100}_{C}  \   \underbrace{1100}_{C} \   \underbrace{1100}_{C} \  \underbrace{1101}_{D}
\color{salmon}{: (1.f_1\cdots f_{23})2^0 \text{ form}}\\
x=123456.789 &\Rightarrow
\underbrace{\color{plum}{0}\color{limegreen}{100}}_{4} \ \underbrace{\color{limegreen}{0111}}_{7} \   \underbrace{\color{limegreen}{1}111}_{F} \  \underbrace{0001}_{1} \  \underbrace{0010}_{2}  \ \underbrace{0000}_{0} \  \underbrace{0110}_{6} \  \underbrace{0101}_{5}
\color{salmon}{: (1.f_1\cdots f_{23})2^{16} \text{ form}}
\end{align}
\begin{align} x+a = &\   
\color{plum}{0}\color{limegreen}{100} \
\color{limegreen}{0111} \  \color{limegreen}{1}111 \ 
0001 \  0010 \  0000 \  0110 \  0101 \\ + &\  
\color{plum}{0}\color{limegreen}{100} \
\color{limegreen}{0111} \  \color{limegreen}{1} \underbrace{\color{orangered}{000\  0000\  0000\  0000\  1}}_{\text{appeared 16-bit number} \\  \text{shifting} \ (1.f_1\cdots f_{23}) \  \text{part}} 000 \  1100 \  
\color{cadetblue}{\underbrace{1100\  1100\  1100\  1101}_{\text{loss}}} \\ \\  = &\   
\color{plum}{0}\color{limegreen}{100} \
\color{limegreen}{0111} \  \color{limegreen}{1}111 \ 
0001 \  0010 \  0000 \  0110 \  0101 \\ + &\  
\color{plum}{0}\color{limegreen}{100} \
\color{limegreen}{0111} \  \color{limegreen}{1}
\color{orangered}{000\  0000\  0000\  0000\  1} 000 \  110\color{red}{1} \quad \text{(round to nearest)} \\ \\  = &\ 
\underbrace{\color{plum}{0}\color{limegreen}{100}}_{4} \ \underbrace{\color{limegreen}{0111}}_{7} \   \underbrace{\color{limegreen}{1}111}_{F} \  \underbrace{0001}_{1} \  \underbrace{0010}_{2}  \  \underbrace{0000}_{0} \  \underbrace{1111}_{F} \  \underbrace{0010}_{2} \\ \\ \\

(x+a)-x = &\   
\color{plum}{0}\color{limegreen}{100} \
\color{limegreen}{0111} \  \color{limegreen}{1}111 \ 
0001 \  0010 \  0000 \  1111 \  0010 \\ - &\  
\color{plum}{0}\color{limegreen}{100} \
\color{limegreen}{0111} \  \color{limegreen}{1}
111\  0001\  0010\  0000\  0110 \  0101 \\ \\  = &\  
\color{plum}{0}\color{limegreen}{100} \
\color{limegreen}{0111} \  \color{limegreen}{1} \underbrace{\color{orangered}{000\  0000\  0000\  0000\  1}}_{\text{should be shifted}} 000 \  1101 \   \\ \\ =&\ 
\underbrace{\color{plum}{0}\color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \   \underbrace{\color{limegreen}{1}000}_{8} \  \underbrace{1101}_{D} \  \underbrace{0000}_{0}  \  \underbrace{0000}_{0} \  \underbrace{0000}_{0} \  \underbrace{0000}_{0} \\ \color{red}{\boldsymbol{\ne}}  &\
\underbrace{\color{plum}{0}\color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \   \underbrace{\color{limegreen}{1}000}_{8} \  \underbrace{1100}_{C} \  \underbrace{1100}_{C}  \  \underbrace{1100}_{C} \  \underbrace{1100}_{C} \  \underbrace{1101}_{D} = a
\end{align}
    In this example,  `(x+a)-x ne a` after the addtion between the relative large number and relative small number(`x+a` part), and the subtraction between the similar numbers(`(x+a)-x` part). Therefore, this calculation is numerically unstable.

emoy.net
Mark