# P r o j e c t s

N o t e s

## Floating-Point Arithmetic

• When adding two floating-point numbers, the number whose exponent is smaller should be modified for the other. Consider two decimal numbers whose precision is 6.
\begin{align} x&=1.92403 \times 10^2, \quad y=6.35782 \times 10^{-1} \\ \\
x+y&=(1.92403+0.00635782)\times 10^2 \\ &=  (1.92403+0.00636)\times 10^2 \qquad \text{(round to nearest)} \\ &= 1.93039\times 10^2
\end{align}

• Cancellation: the subtraction between the similar numbers, the addtion/subtraction between the relative large number and relative small number, and the division by the small number.

1. For the machine epsilon epsilon, (1+epsilon)-(1-epsilon)=1-1=0 although it should be  2epsilon in the real mathematics.

2. For quadratic formula frac{-b pm sqrt{b^2-4ac}}{2a}, when b>0 and b^2 text{>>} ac, -b pm sqrt{b^2-4ac} part is numerically unstable.

3. For performance, a standard deviation sigma
\begin{align} \sigma= \sqrt{\frac{1}{n-1} \sum_{i=1}^n(x_i-\bar{x})^2}, \end{align}
where bar{x} is the mean of n-points x_1, cdots, x_n, can be replaced by
\begin{align} \sqrt{\frac{1}{n-1} \left( \sum_{i=1}^nx_i^2-n\bar{x}^2 \right)}. \end{align}
However, it can be numerically unstable in  \left( \sum_{i=1}^nx_i^2-n\bar{x}^2 \right)  part, and it can be even negative.

4. For a=1.1 and x=123456.789, (x+a)-x may not be the same as a.
\begin{align} a=1.1 &\Rightarrow \underbrace{\color{plum }{0}\color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \   \underbrace{ \color{limegreen}{1}000}_{8} \  \underbrace{1100}_{C} \   \underbrace{1100}_{C}  \   \underbrace{1100}_{C} \   \underbrace{1100}_{C} \  \underbrace{1101}_{D}
\color{salmon}{: (1.f_1\cdots f_{23})2^0 \text{ form}}\\
x=123456.789 &\Rightarrow
\underbrace{\color{plum}{0}\color{limegreen}{100}}_{4} \ \underbrace{\color{limegreen}{0111}}_{7} \   \underbrace{\color{limegreen}{1}111}_{F} \  \underbrace{0001}_{1} \  \underbrace{0010}_{2}  \ \underbrace{0000}_{0} \  \underbrace{0110}_{6} \  \underbrace{0101}_{5}
\color{salmon}{: (1.f_1\cdots f_{23})2^{16} \text{ form}}
\end{align}
\begin{align} x+a = &\
\color{plum}{0}\color{limegreen}{100} \
\color{limegreen}{0111} \  \color{limegreen}{1}111 \
0001 \  0010 \  0000 \  0110 \  0101 \\ + &\
\color{plum}{0}\color{limegreen}{100} \
\color{limegreen}{0111} \  \color{limegreen}{1} \underbrace{\color{orangered}{000\  0000\  0000\  0000\  1}}_{\text{appeared 16-bit number} \\  \text{shifting} \ (1.f_1\cdots f_{23}) \  \text{part}} 000 \  1100 \
\color{cadetblue}{\underbrace{1100\  1100\  1100\  1101}_{\text{loss}}} \\ \\  = &\
\color{plum}{0}\color{limegreen}{100} \
\color{limegreen}{0111} \  \color{limegreen}{1}111 \
0001 \  0010 \  0000 \  0110 \  0101 \\ + &\
\color{plum}{0}\color{limegreen}{100} \
\color{limegreen}{0111} \  \color{limegreen}{1}
\color{orangered}{000\  0000\  0000\  0000\  1} 000 \  110\color{red}{1} \quad \text{(round to nearest)} \\ \\  = &\
\underbrace{\color{plum}{0}\color{limegreen}{100}}_{4} \ \underbrace{\color{limegreen}{0111}}_{7} \   \underbrace{\color{limegreen}{1}111}_{F} \  \underbrace{0001}_{1} \  \underbrace{0010}_{2}  \  \underbrace{0000}_{0} \  \underbrace{1111}_{F} \  \underbrace{0010}_{2} \\ \\ \\

(x+a)-x = &\
\color{plum}{0}\color{limegreen}{100} \
\color{limegreen}{0111} \  \color{limegreen}{1}111 \
0001 \  0010 \  0000 \  1111 \  0010 \\ - &\
\color{plum}{0}\color{limegreen}{100} \
\color{limegreen}{0111} \  \color{limegreen}{1}
111\  0001\  0010\  0000\  0110 \  0101 \\ \\  = &\
\color{plum}{0}\color{limegreen}{100} \
\color{limegreen}{0111} \  \color{limegreen}{1} \underbrace{\color{orangered}{000\  0000\  0000\  0000\  1}}_{\text{should be shifted}} 000 \  1101 \   \\ \\ =&\
\underbrace{\color{plum}{0}\color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \   \underbrace{\color{limegreen}{1}000}_{8} \  \underbrace{1101}_{D} \  \underbrace{0000}_{0}  \  \underbrace{0000}_{0} \  \underbrace{0000}_{0} \  \underbrace{0000}_{0} \\ \color{red}{\boldsymbol{\ne}}  &\
\underbrace{\color{plum}{0}\color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \   \underbrace{\color{limegreen}{1}000}_{8} \  \underbrace{1100}_{C} \  \underbrace{1100}_{C}  \  \underbrace{1100}_{C} \  \underbrace{1100}_{C} \  \underbrace{1101}_{D} = a
\end{align}
In this example,  (x+a)-x ne a after the addtion between the relative large number and relative small number(x+a part), and the subtraction between the similar numbers((x+a)-x part). Therefore, this calculation is numerically unstable.

emoy.net