**Floating-Point Number**

**1. System Format**

Suppose that `beta` is the radix, or base, `p` is precision, and `[L, U]` is the range of exponent `E`. Then for `x in mathbb{R}`,

\begin{align} x=\pm (d_0.d_1 d_2 \cdots d_{p-1})_{\beta} \beta^E = \pm \left( d_0 + \frac{d_1}{\beta}+ \frac{d_2}{\beta^2}+ \cdots + \frac{d_{p-1}}{\beta^{p-1}} \right) \beta^E

\end{align}

where `d_i` is an integer in `[0, beta - 1]`.

- `p`-digit number based-`beta`, `d_0d_1 cdots d_{p-1}`:
*mantissa*, or*significant* - `d_1 cdots d_{p-1}` of mantissa:
*fraction* - `E`:
*exponent*, or*characteristic*

**2. Normalization**

For `x ne 0 in mathbb{R}`, it can be normalized so that `d_0 ne 0` and mantissa `m` is in `[1, beta)`. This normalization is unique and saves space for leading zeros. Especially, `d_1` is always `1` when `beta=2`, so it does not have to be stored and saves, in turn, one bit more.

- The number of the normalized floating-point number `x` is

\underbrace{2}_{\pm} \times

\underbrace{(\beta - 1)}_{d_0 \ne 0} \times

\underbrace{\beta^{p-1}}_{d_1 \sim d_{p-1}} \times

\underbrace{(U-L+1)}_{E} + \underbrace{1}_{\text{zero}}

\end{align}

- The smallest positive `x` is `(1.0 cdots 0)_{beta}beta^L=beta^L`
- The largest `x` is

\end{align}

- In general, floating-point numbers are not uniformly distributed. However, they are uniformly distributed in the exponent range `[E, E+1)` for `E in mathbb{Z}`. In this range, the minimal difference between numbers which floating-point system can represent is `(0.0 cdots 1)_{beta}beta^E=beta^{1-p}beta^E=beta^{E-p+1}`. If this range is changed to `[E+1, E+2)`, then the minimal difference is multiplied by `beta`.

**3. Subnormal(Denormal) Numbers**

When looking the series the floating-point system represents, there is empty space in `[0, beta^L]`. This range can be divided by `epsilon`, which is the interval in `[L, L+1)`. Then the number in this range can be represented as `d_0=0` and `d_1 ne 0`, that is, `pm(0.d_1 cdots d_{p-1})_{beta}beta^L` if some conditions are satisfied which will come later.

**4. Rounding**

The number which the floating-point system can exactly represent is called *machine number*. However, the number the system cannot do should be rounded. There are rules for rounding such as chopping or round-to-nearest method. Here are some examples about these rules when `p=2`.

\begin{align}

\begin{matrix}

\text{number} & \text{chop} & \text{round-to-nearest} \\

1.649 & 1.6 & 1.6 \\

1.650 & 1.6 & 1.6 \\

1.651 & 1.6 & 1.7 \\

1.699 & 1.6 & 1.7 \\

\end{matrix} \qquad

\begin{matrix}

\text{number} & \text{chop} & \text{round-to-nearest} \\

1.749 & 1.7 & 1.7 \\

1.750 & 1.7 & 1.8 \\

1.751 & 1.7 & 1.8 \\

1.799 & 1.7 & 1.8 \\

\end{matrix}

\end{align}

The round-to-nearest is also known as

*round-to-even*, because it rounds the number to the one whose last digit is even in case of a tie. This rule is the most accurate and unbiased, but expensive. Meanwhile, IEEE standard system has the round-to-nearest as the default rule.

**5. Machine Precision**

The floating-point system can be measured by the machine precision, machine epsilon, or unit roundoff which is denoted by `epsilon_{mach}`. It is the minimal number so that `1+epsilon_{mach}>1`. Considering that the interval between the floating-point numbers in `[1, beta)` which can be exactly represented is `beta^{1-p}` because `E=0`,

Now, consider the floating-point number `x` that can be exactly represented. Then there are many numbers that can be rounded to `x`.

\begin{align} |\text{relative error}| \leq

\begin{cases}

\left| \frac{\beta^{E-p+1}}{x} \right|= \frac{\beta^{E-p+1}}{(d_0.d_1 \cdots d_{p-1})\beta^E} \leq \beta^{1-p} \quad \text{(chopping)} \\ \\

\left| \frac{\frac{1}{2}\beta^{E-p+1}}{x} \right|=\frac{ \frac{1}{2} \beta^{E-p+1}}{(d_0.d_1 \cdots d_{p-1})\beta^E} \leq \frac{1}{2}\beta^{1-p} \quad \text{(round-to-nearest)}

\end{cases}

\end{align}

It means that `|\text{relative error}| leq epsilon_{mach}`.

**6. IEEE Floating-Point Format**

This system has `beta=2`, `p=24`, `L=-126`, and `U=127` for 32-bit floating-point numbers.

\begin{align}

\begin{cases}

1 \leq E \leq 254 \Rightarrow \pm (1.d_1 \cdots d_{23})_2 2^{E-127} \quad \color{limegreen}{\text{(normalized)}} \\ \\

E=0 \quad

\begin{cases}

\text{mantissa} \ne 0 \Rightarrow \pm (0.d_1 \cdots d_{23})_2 2^{-126} \quad \color{plum}{\text{(subnormal)}} \\

\text{mantissa} = 0 \Rightarrow \pm 0

\end{cases} \\ \\

E=255 \quad

\begin{cases}

\text{mantissa} \ne 0 \Rightarrow NaN \\

\text{mantissa} = 0 \Rightarrow \pm \infty

\end{cases}

\end{cases}

\end{align}

- The smallest positive number is

\begin{cases}

(1.0\cdots 0)_2 2^{-126} \approx 1.8\times 10^{-38} \quad \color{limegreen}{\text{(normalized)}} \\

(0.0\cdots 1)_2 2^{-126} = 2^{-23} 2^{-126} \approx 1.4\times 10^{-45} \quad \color{plum}{\text{(subnormal)}}

\end{cases}

\end{align}

- The largest number is `(1.1 cdots 1)_2 2^{127}=(1-2^{-24})2^{128} approx 3.4 times 10^{38}`.
- The machine epsilon `epsilon_{mach}` is

\frac{1}{2}\beta^{1-p}=\frac{1}{2}2^{1-24}=2^{-24}\approx10^{-7}

\end{align}

since IEEE standard system uses the round-to-nearest as the default rounding rule. It has about `7`-precision in decimals.

\begin{align} \log \epsilon_{mach} &=

\log 2^{-24} \approx -24\times 0.3010=-8+\alpha, \quad \alpha \in [0, 1) \\ \\ \Rightarrow \epsilon_{mach} &=

2^{-24} = 10^{-8+\alpha} < 10^{-7}

\end{align}

**︎Reference**

[1] Michael T. Heath,

*Scientific Computing: An Introductory Survey*. 2nd Edition, McGraw-Hill Higher Education.

emoy.net