Common Floating Point Representations

Number Bases
General Floating-Point Representations
"The Most Important Fact" About Floating-Point Number Systems
Tables of Common Floating-Point Representations
Sources of IEEE software and information

Number Bases

There are a variety of number bases. The most popular happens to correspond to the number of fingers your average human has on his hands. However, computers have a natural base of 2 corresponding to either on or off, high voltage or low. To represent a large number often requires a long string of 1's or 0's that gets a little cumbersome to write or work with on a sheet of paper. The Octal or Hexadecimal representation serves as a compromise between humans & computers. It allows humans to work with numbers that can be written in a compact fashion that translates directly to the computer's binary representation.

	Base	Example
Decimal	10	10	.1
Binary	2	1010₂	.0~~0011~~₂
Octal	8	12₈	.0~~6314~~₈
Hexadecimal	16	A₁₆	.19₁₆

Only rational numbers with denominators containing the same prime factors as the base have non-repeating fractional parts. In the above example repeating digits are represented with the strike-through font.

The following Fortran code fragment demonstrates one of the pitfalls of the floating point representation ... it's only an approximation and as such the programmer must be careful to realize that the actual computed result may not be the expected result.

    parameter (zero = 0.0, one = 1.0, tenth = .1)
    a = tenth
    x = zero
    do 100 i = 1,10
      x = x + tenth
100 continue
    error = x - one

These results were produced on a Sun workstation using 32 bit IEEE representation.

a	=	0.10000000
x	=	1.00000012
error	=	1.1920929E-07

The expected results is zero. However, the computed results is small, but definitely not zero.

General Floating-Point Representations

The general floating point representation can be summarized as: x = +/-

^{e - b} x 0.d₁d₂...d_p

where

+/-	=	sign bit (s)
	=	base or radix
d_n	=	digit (0 d_n < -1)
		(d₁ 0)
d	=	0.d₁d₂...d_p (mantissa)
p	=	precision
m	=	minimum exponent
M	=	maximum exponent
e - b	=	exponent - bias (m e - b M)

where the following stipulation defines a normalized representation

^-1

d < 1 And the special values

^m-1 =

^m x .1

^M(1 -

^-p) where

is the minimum positive value and

is the maximum positive value.

Example: =2, p=3, m=-1, M=1
The only allowed non-negative values are:
.0, =.25, .3125, .375, .4375, .5, .625, .75, .875, 1., 1.25, 1.5, =1.75

The assumption is made that for the basic numeric operations (+-*/) between any two floating point values the operation will yield a value that is a floating point value that's the nearest to the "exact" value.

Take for example the values .875 and 1.25 from above. The product of .875*1.25=1.09375; however, this value does not exist exactly within the floating point representation. The closest representable value is 1.0 which will (or should) be returned as the result of this operation.

If x is the result of the basic floating-point operations (+-*/) then the following assumptions will hold:

If |x| then x will be set to the nearest allowable floating-point representation.
If |x| < then the result is 0 which gives a "silent and nondestructive underflow."
If |x| > then calculation is terminated with a "fatal overflow."

"The Most Important Fact" About Floating-Point Number Systems

The spacing between a floating-point number x and an adjacent floating-point number is at least x / and at most x (unless x or the neighbor is 0).

Therefore, the floating point representation corresponds to a discrete and finite set of points on the real number line which gets denser nearer the origin until it reaches some limit . The relative spacing is proportional to .

Table of Common Floating-Point Representations

All the representations shown are radix 2. The ones listed correspond to Cray parallel vector processors (PVPs), Digital Equipment Corporation (DEC) VMS VAX, and the last is the IEEE representation commonly found on workstations and for Linux boxes. The first table gives the single precision representation, and second table gives the double precision representation.

Fortran `REAL` and C `float` (Cray `double`)
REAL	CRAY	VAX	IEEE

length (bits)	64	32	32
sign bit (s)	yes	yes	yes
exponent (bits)	15	8	8
exponent bias (b)	3FFF	7F	7F
fraction (bits)	48	23	23
hidden bit normalization	no	yes	yes
range low ()	3.67x10^-2466	2.93x10^-39	1.175x10^-38
	2^-8189	2^-128	2^-126
range high ()	2.73x10²⁴⁶⁵	1.701x10³⁸	3.403x10³⁸
	2⁸¹⁹⁰	2¹²⁷	2¹²⁸
machine epsilon ()	7.11x10^-15	5.96x10^-8	1.19x10^-7
digits accuracy	14	7	7

Fortran `DOUBLE PRECISION` and C `double` (Cray `long double`)
DOUBLE PRECISION	CRAY	VAX	IEEE

length (bits)	128	64	64
sign bit (s)	yes	yes	yes
exponent (bits)	15	8	11
exponent bias (b)	3FFF	7F	3FF
fraction (bits)	96	55	52
hidden bit normalization	no	yes	yes
range low ()	3.67x10^-2466	2.93x10^-39	2.23x10^-308
	2^-8189	2^-128	2^-11022
range high ()	2.73x10²⁴⁶⁵	1.701x10³⁸	1.80x10³⁰⁸
	2⁸¹⁹⁰	2¹²⁷	2¹⁰²⁴
machine epsilon ()	2.52x10^-29	1.39x10^-17	2.22x10^-16
digits accuracy	29	17	16

The VAX & IEEE use "hidden bit" normalization. That is the first digit in the mantissa is assumed to be 1 and does not need to be "stored". The IEEE normalization is slightly different as noted below

Cray	(-1)^s x 2^e-b x .f
VAX	(-1)^s x 2^e-b x .1f
IEEE	(-1)^s x 2^e-b x 1.f

The IEEE format also defines:

Inf	:	e = all one's and f = 0
NaN	:	e = all one's and f 0

Sources of IEEE software and information

SoftFloat - freely available C code to implement IEC/IEEE floating point representation with software.
Sun Microsystems' Numerical Computation Guide, which covers the IEEE Standard in detail.

Brought to you by: R.K. Owen,Ph.D.
This page is http://rkowen.owentrek.com/howto/fltpt/index.html