E:\resonance-work\resowrk\2016\Jan-2016\1-generaleditorial.pmd

RESONANCE ¨ January 2016

GENERAL ¨ ARTICLE

IEEE Standard for Floatin

Point Numbers

V Rajaraman

Keywords

Floating point numbers, round-

ing, decimal floating point num-

bers, IEEE 754-2008 Standard.

V Rajaraman is at the

Indian Institute of Science,

Ben

aluru. Several

generations of scientists

and en

ineers in India

have learnt com

uter

science usin

his lucidl

written textbooks on

programming and

computer fundamentals.

His current research

interests are

arallel

com

utin

and histor

com

utin

Floating point numbers are an important data type in compu-

tation which is used extensively. Yet, many users do not know

the standard which is used in almost all computer hardware

to store and process these. In this article, we explain the

standards evolved by The Institute of Electrical and Elec-

tronic Engineers in 1985 and augmented in 2008 to represent

floating point numbers and process them. This standard is

now used by all computer manufacturers while designing

floating point arithmetic units so that programs are portable

among computers.

Introductio

There are two types of arithmetic which are performed in comput-

ers: integer arithmetic and real arithmetic. Integer arithmetic is

simple. A decimal number is converted to its binary equivalent

and arithmetic is performed using binary arithmetic operations.

The largest positive integer that may be stored in an 8-bit byte is

+127, if 1 bit is used for sign. If 16 bits are used, the largest

ositive integer is +32767 and with 32 bits, it is +2147483647,

quite large! Integers are used mostly for counting. Most scientific

computations are however performed using real numbers, that is,

numbers with a fractional part. In order to represent real numbers

in computers, we have to ask two questions. The first is to decide

how many bits are needed to encode real numbers and the secon

is to decide how to represent real numbers using these bits.

Normally, in numerical computation in science and engineering,

one would need at least 7 to 8 significant digits precision. Thus,

the number of bits needed to encode 8 decimal digits is approxi-

mately 26, as log

10 = 3.32 bits are needed on the average to

encode a digit. In computers, numbers are stored as a sequence of

8-bit bytes. Thus 32 bits (4 bytes) which is bigger than 26 bits is

a logical size to use for real numbers. Given 32 bits to encode real

12 RESONANCE ¨January 2016

GENERAL

ARTICLE

numbers, the next question is how to break it up into an integer

art and a fractional part. One method is to

divide the 32 bits into

2 parts, one part to represent an integer part of the number and the

other the fractional part as shown in Figure 1.

In this figure, we have arbitrarily fixed the (virtual) binary point

etween bits 7 and 8. With this representation, called fixed point

representation, the largest and the smallest positive binary num-

ers that may be represented using 32 bits are

Binary Floating Point Numbers

This range of real numbers, when fixed point representation is

used, is not sufficient in many practical problems. Therefore,

another representation called normalized floating point represen-

tation is used for real numbers in computers. In this representa-

tion, 32 bits are divided into two parts: a part called the mantissa

with its sign and the other called the exponent with its sign. The

mantissa represents fractions with a non-zero leading bit and the

exponent the power of 2 by which the mantissa is multiplied. This

method increases the range of numbers that may be represente

using 32 bits. In this method, a binary floating point number is

represented b

(sign)×mantissa ×2

± exponen

where the sign is one bit, the mantissa is a binary fraction with a

non-zero leading bit, and the exponent is a binary integer. If 32

Figure 1. Fixed point repre-

sentation of real numbers in

binary using 32 bits.

s the range of real

numbers

representable with

fixed point is not

sufficient, normalized

floating point is used to

represent real

numbers.

GENERAL



ARTICLE

RESONANCE



January 2016

bits are available to store floating point numbers, we have to

decide the following:

How many bits are to be used to represent the mantissa (1 bit

is reserved for the sign of the number).

How many bits are to be used for the exponent.

How to represent the sign of the exponent.

The number of bits to be used for the mantissa is determined by

the number of significant decimal digits required in computation,

which is at least seven. Based on experience in numerical

computation, it is found that at least seven significant decimal

digits are needed if results are expected without too much error.

The number of bits required to represent seven decimal digits is

approximately 23. The remaining 8 bits are allocated for the

exponent. The exponent may be represented in sign magnitude

form. In this case, 1 bit is used for sign and 7 bits for the

magnitude. The exponent will range from –127 to +127. The

only disadvantage to this method is that there are two representa-

tions for 0 exponent: + 0 and – 0. To avoid this, one may use an

excess representation

biased format

to represent the exponent.

The exponent has no sign in this format. The range 0 to 255 of an

8-bit number is divided into two parts 0 to 127 and 128 to 255 as

shown below:

All bit strings from 0 to 126 are considered negative, exponent

127 represents 0, and values greater than 127 are positive. Thus

the range of exponents is –127 to +128. Given an exponent string

exp

the value of the exponent will be taken as (

exp

–127). The

main advantage of this format is a unique representation for

exponent zero. With the representation of binary floating point

numbers explained above, the largest floating point number

which can be represented is

The number of bits

to be used for the

mantissa is

determined by the

number of

significant decimal

digits required in

computation,

which is at least

seven.

The main

advantage of

biased exponent

format is unique

representation for

exponent 0.

14 RESONANCE ¨January 2016

GENERAL

ARTICLE

0.1111 . . . 1111 × 2

11111111

|m 23 bits

=(1–2

–

)×2

255–127

# 3.4 × 10

The smallest floating point number is

0.10000 … 00 × 2

–

127

|m 23 bitso_

# 0.293 × 10

–

Example. Represent 52.21875 in 32-bit binary floating point

format.

52.21875 = 110100.00111 = .11010000111 × 2

Normalized 23 bit mantissa = 0.11010000111000000000000.

As excess representation is being used for exponent, it is equal to

127 + 6 = 133.

Thus the representation is

52.21875 = 0.11010000111 × 2

133

= 0.110100000111 × 2

10000101

The 32-bit string used to store 52.21875 in a computer will thus be

01000010111010000111000000000000.

Here, the most significant bit is the sign, the next 8 bits the

exponent, and the last 23 bits the mantissa.

IEEE Floating Point Standard 754-1985

Floating point binary numbers were beginning to be used in the

mid50s. Therewasnouniformityintheformatsusedtorepresent

floating point numbers and programs were not portable from one

manufacturer’s computer to another. By the mid 1980s, with the

advent of personal computers, the number of bits used to store

floating point numbers was standardized as 32 bits. A Standards

Committee was formed by the Institute of Electrical and Elec-

tronics Engineers to standardize how floating point binary num-

ers would be represented in computers. In addition, the standar

specified uniformity in rounding numbers, treating exception

conditions such as attempt to divide by 0, and representation of 0

and infinity (fThis standard, called IEEE Standard 754 for

Institute of Electrical

and Electronics

Engineers

standardized how

floating point binary

numbers would be

represented in

computers, how these

would be rounded,

how 0 and fwould be

represented and how

to treat exception

condition such as

attempt to divide by 0.

RESONANCE



January 2016

GENERAL



ARTICLE

floating point numbers, was adopted in 1985 by all computer

manufacturers. It allowed porting of programs from one com-

puter to another without the answers being different. This stan-

dard defined floating point formats for 32-bit and 64-bit numbers.

With improvement in computer technology it became feasible to

use a larger number of bits for floating point numbers. After the

standard was used for a number of years, many improvements

were suggested. The standard was updated in 2008. The current

standard is IEEE 754-2008 version. This version retained all the

features of the 1985 standard and introduced new standards for 16

and128-bit numbers. It also introduced standards for representing

decimal floating point numbers. We will describe these standards

later in this article. In this section, we will describe IEEE 754-

1985 Standard.

IEEE floating point representation for binary real numbers con-

sists of three parts. For a 32-bit (called single precision) number,

they are:

Sign, for which 1 bit is allocated.

Mantissa (called

significand

in the standard) is allocated 23

bits.

Exponent is allocated 8 bits. As both positive and negative

numbers are required for the exponent, instead of using a

separate sign bit for the exponent, the standard uses a biased

representation. The value of the bias is 127. Thus an exponent

0 means that –127 is stored in the exponent field. A stored

value 198 means that the exponent value is (198 – 127) = 71.

The exponents –127 (all 0s) and + 128 (all 1s) are reserved for

representing special numbers which we discuss later.

To increase the precision of the significand, the IEEE 754 Stan-

dard uses a normalized significand which implies that its most

significant bit is always 1. As this is implied, it is assumed to be

on the left of the (virtual) decimal point of the significand. Thus

in the IEEE Standard, the significand is 24 bits long – 23 bits of

the significand which is stored in the memory and an implied 1 as

the most significant 24th bit. The extra bit increases the number

The current standard

is IEEE 754-2008

version. This version

retained all the

features of the 1985

standard and

introduced new

standards for 16-bit

and128-bit numbers.

It also introduced

standards for

representing decimal

floating point

numbers.

16 RESONANCE ¨January 2016

GENERAL

ARTICLE

of significant digits in the significand. Thus, a floating point

number in the IEEE Standard is

(– 1)

×(1.f )

×2

exponen

–

127

where

is the sign bit;

= 0 is used for positive numbers and

= 1 for representing negative numbers. f

represents the bits in the

significand. Observe that the implied 1 as the most significant bit

of the significand is explicitly shown for clarity.

For a 32-bit word machine, the allocation of bits for a floating

oint number are given in Figure 2. Observe that the exponent is

laced before the significand (see Bo

1).

Example. Represent 52.21875 in IEEE 754 – 32-bit floating point

format.

52.21875 = 110100.00111

= 1.1010000111 × 2

Normalized significand = .1010000111.

Exponent: (e – 127) = 5 or, e = 132.

The bit representation in IEEE format is

0 10000100 10100001110000000000000

Sign Exponent Significan

1 bit 8 bits 23 bits

Figure 2. IEEE 754 repre-

sentation of 32-bit floating

point number.

Box 1.

Why does IEEE 754 standard use biased exponent representation and place it before the significand?

Computers normally have separate arithmetic units to compute with integer operands and floating point

operands. They are called FPU (Floating Point Unit) and IU (Integer Unit). An IU works faster and is cheaper

to build. In most recent computers, there are several IUs and a smaller number of FPUs. It is preferable to use

IUs whenever possible. By placing the sign bit and biased exponent bits as shown in Figure 2, operations such

<r.o.> y, where r.o. is a relational operator, namely, >, < , td,and

are reals, can be carried out by integer

arithmetic units as the exponents are integers. Sorting real numbers may also be carried out by IUs as biased

exponent are the most significant bits. This ensures lexicographical ordering of real numbers; the significand

need not be considered as a fraction.

RESONANCE ¨ January 2016

GENERAL ¨ ARTICLE

Special Values in IEEE 754-1985 Standard (32 Bits)

Representation of Zero: As the significand is assumed to have a

hidden 1 as the most significant bit, all 0s in the significand part

of the number will be taken as 1.00…0. Thus zero is represente

in the IEEE Standard by all 0s for the exponent and all 0s for the

significand. All 0s for the exponent is not allowed to be used for

any other number. If the sign bit is 0 and all the other bits 0, the

number is +0. If the sign bit is 1 and all the other bits 0, it is –0.

Even though +0 and –0 have distinct representations they are

assumed equal. Thus the representations of +0 and –0 are:

0 00000000 00000000000000000000000

Sign Exponent Significand

1 bit 8 bits 23 bits

–

1 00000000 00000000000000000000000

Sign Exponent Significand

1 bit 8 bits 23 bits

Representation of Infinit

: All 1s in the exponent field is

assumed to represent infinity (f Asignbit0represents+f an

asignbit1represents–f. Thus the representations of +f

and

–

f are:



0 11111111 00000000000000000000000

Sign Exponent Significand

1 bit 8 bits 23 bits

–

1 11111111 00000000000000000000000

Sig

Exponent Significan

1 bit 8 bits 23 bits

Representation of Non Numbers: When an operation is per-

formed by a computer on a pair of operands, the result may not be

mathematically defined. For example, if zero is divided by zero,

s the significand is

assumed to have a

hidden 1 as the most

significant bit, all 0s

in the significand part

of the number will be

taken as 1.00…0.

Thus zero is

represented in the

IEEE Standard by all

0s for the exponent

and all 0s for the

significand.

ll 1s in the

exponent field is

assumed to

represent infinity (f).

18 RESONANCE ¨January 2016

GENERAL

ARTICLE

the result is indeterminate. Such a result is called Not a Number

(NaN) in the IEEE Standard. In fact the IEEE Standard defines

two types of NaN. When the result of an operation is not define

(i.e., indeterminate) it is called a Quiet NaN (QNaN). Examples

are: 0/0, (f

–

f), . Quiet NaNs are normally carried over in the

computation. The other type of NaN is called a Signalling Nan

(SNaN). Thisisusedtogiveanerrormessage.Whenanoperation

leads to a floating point underflow, i.e., the result of a computation

is smaller than the smallest number that can be stored as a floating

oint number, or the result is an overflow, i.e., it is larger than the

largest number that can be stored, SNaN is used. When no vali

value is stored in a variable name (i.e., it is undefined) and an

attempt is made to use it in an arithmetic operation, SNaN woul

result. QNaN is represented by 0 or 1 as the sign bit, all 1s as

exponent, and a 0 as the left-most bit of the significand and at least

one 1 in the rest of the significand. SNaN is represented by 0 or

1 as the sign bit, all 1s as exponent, and a 1 as the left-most bit of

the significand and any string of bits for the remaining 22 bits. We

give below the representations of QNaN and SNaN.

QNaN

0 or 1 11111111 00010000000000000000000

Sig

Exponent

Significand

1 bit 8 bits 23 bits

The most significant bit of the significand is 0. There is at least

one 1 in the rest of the significand.

SNaN

0 or 1 11111111 10000000000001000000000

Sig

Exponent

Significand

1 bit 8 bits 23 bits

The

ost significant bit of the significand is 1. Any combination

of bits is allowed for the other bits of the significand.

When an arithmetic

operation is

performed on two

numbers which

results in an

indeterminate

answer, it is called

NaN (Not a Number)

in IEEE Standard.

RESONANCE ¨ January 2016

GENERAL ¨ ARTICLE

Largest and Smallest Positive Floating Point Numbers:

Largest Positive Number

0 11111110 11111111111111111111111

Sign Exponent Significand

1 bit 8 bits 23 bits

Significand: 1111 … 1 = 1 + (1 – 2

–

)=2–2

–

Exponent: (254 – 127) = 127.

Largest Number = (2 – 2

–

)×2

127

# 3.403 × 10

If the result of a computation exceeds the largest number that can

e stored in the computer, then it is called an overflo

Smallest Positive Number

0 00000001 00000000000000000000000

Sign Exponent Significand

1 bit 8 bits 23 bits

Significand = 1.0.

Exponent = 1 – 127 = –126.

The smallest normalized number is = 2

–

126

# 1.1755 × 10

–

ubnormal Numbers: When all the exponent bits are 0 and the

leading hidden bit of the siginificand is 0, then the floating point

number is called a

ubnormal number. Thus, one logical represen-

tation of a subnormal number is

(–1)

×0.f ×2

–

127

(all 0s for the exponent) ,

where f has at least one 1 (otherwise the number will be taken as

0). However, the standard uses –126, i.e., bias +1 for the exponent

rather than –127 which is the bias for some not so obvious reason,

ossibly because by using –126 instead of –127, the gap between

the largest subnormal number and the smallest normalized num-

er is smaller.

The largest subnormal number is 0.999999988×2

–126

.Itisclose

to the smallest normalized number 2

–

126

When all the

exponent bits are 0

and the leading

hidden bit of the

siginificand is 0, then

the floating point

number is called a

subnormal number.

20 RESONANCE ¨January 2016

GENERAL

ARTICLE

The smallest positive subnormal number is

0 00000000 00000000000000000000001

Sign Exponent Significan

1 bit 8 bits 23 bits

the value of which is 2

–23

×2

–126

–

149

A result that is smaller than the smallest number that can be store

in a computer is called an

underflow.

You may wonder why subnormal numbers are allowed in the

IEEE Standard. By using subnormal numbers, underflow which

may occur in some calculations are gradual. Also, the smallest

number that could be represented in a machine is closer to zero.

(See Bo

2.)

Box 2. Machine Epsilon and Dwarf

In any of the formats for representing floating point numbers, machine epsilon

is defined as the difference

etween 1 and the next larger number that can be stored in that format. For example, in 32-bit IEEE format with

a 23-bit significand, the machine epsilon is 2

–23

=1.19×10

–

. This essentially tells us that the precision of

decimal numbers stored in this format is 7 digits. The term precision and accuracy are not the same. Accuracy

implies correctness whereas precision does not. For 64-bit representation of IEEE floating point numbers the

significand length is 52 bits. Thus, machine epsilon is 2

–

=2.22×10

–

. Therefore, decimal calculations with

64 bits give 16 digit precision. This is a conservative definition used by industry. In MATLAB, machine epsilon

is as defined above. However, academics define machine epsilon as the upper bound of the relative error when

numbers are rounded. Thus, for 32-bit floating point numbers, the machine epsilon is 2

–23

/2 # 5.96 ×10

–

The machine epsilon is useful in iterative computation. When two successive iterates differ by less than |epsilon|

one may assume that the iteration has converged. The IEEE Standard does not define machine epsilon.

Tiny or dwar

is the minimum subnormal number that can be represented using the specified floating point

format. Thus, for IEEE 32-bit format, it is

0 00000000 00000000000000000000001

Sign Exponent Significand

1bit 8bits 23bits

whose value is: 2

–

126– 23

–

149

# 1.4 × 10

–

. Any number less than this will signal underflow. The advantage

of using subnormal numbers is that underflow is gradual as was stated earlier.

By using subnormal

numbers, underflow

which may occur in

some calculations are

gradual.

RESONANCE ¨ January 2016

GENERAL ¨ ARTICLE

When a mathematical operation such as add, subtract, multiply,

or divide is performed with two floating point numbers, the

significand of the result may exceed 23 bits after the adjustment

of the exponent. In such a case, there are two alternatives. One is

to truncate the result, i.e., ignore all bits beyond the 32nd bit. The

other is to round the result to the nearest significand. For

example, if a significand is 0.110 .. 01 and the overflow bit is 1,

a rounded value would be 0.111 … 10. In other words, if the first

it which is truncated is 1, add 1 to the least significant bit, else

ignore the bits. This is called rounding upwards.Iftheextrabits

are ignored, it is called rounding downwards. The IEEE Stan-

dard suggests to hardware designers to get the best possible result

while performing arithmetic that are reproducible across differ-

ent manufacturer’s computers using the standard. The Standar

suggests that in the equatio

c = a <op> b,

where a and b are operands and <op> an arithmetic operation, the

result c should be as if it was computed exactly and then rounded.

This is called correct rounding.

In Tables 1 and 2, we summarise the discussions in this section.

Table 1. IEEE 754-85 float-

ing point standard. We use

to represent the significand,

e to represent the exponent,

b the bias of the exponent,

and ± for the sign.

Value Sign Exponent (8 bits)

Significand (23 bits)

+0 0 00000000 00 ….00 (all 23 bits 0)

–

1 00000000 00 ….00 (all 23 bits 0)

+1.f u2

(e–

)

e exponent, b bias

0 00000001to

11111110

aa…. aa

(

a =0or1)

–

1.f u 2

(e–b)

1 00000001to

11111110

aa…. aa

(

a =0or1)

+ f

0 11111111

000 … 00

(all 23 bits 0)

–

1 11111111

000 … 00

(all 23 bits 0)

SNaN 0 or 1 11111111

000 … 01 to

leading bit 0

011… 11

(at least one 1 in

the rest)

QNaN 0 or 1 11111111 1000 … 10 leading bit 1

Positive subnormal

0.f u 2

x+1-b

(

the number of

leading 0s in

significand)

0 00000000

000 … 01 to

(at least one 1)

111… 11

22 RESONANCE ¨January 2016

GENERAL

ARTICLE

(–1)

×(1.f

)×2

(exponent – 1023)

where

= ± 1, and

is the value of the significand.

The largest positive number which can be represented in this

standard is

0 11111111110

1111………………………………………..1111

Sig

Exponent

Significand

1 bit 11 bits 52 bits

whichis =(2–2

–

)×2

(2046 – 1023)

=(2–2

–

)×2

1023

3083

The smallest positive number that can be represented using 64

Table 2. Operations on spe-

cial numbers. All NaNs are

Quiet NaNs.

IEEE 754 Floating Point 64-Bit Standard – 1985

The IEEE 754 floating point standard for 64-bit (called double

recision) numbers is very similar to the 32-bit standard. The

main difference is the allocation bits for the exponent and the

significand. With 64 bits available to store floating point num-

ers, there is more freedom to increase the significant digits in the

significand and increase the range by allocating more bits to the

exponent. The standard allocates 1 bit for sign, 11 bits for the

exponent, and 52 bits for the significand. The exponent uses a

iased representation with a bias of 1023. The representation is

shown below:

Sig

Exponent Significan

1 bit 11 bits 52 bits

A number in this standard is thus

Operation

Result

Operation Result

n /±f

±0/±0

± f /±f

ff– f

± n /0

f ± f /±f

f + ff± f ×0

With 64 bits

available to store

floating point

numbers, there is

more freedom to

increase the

significant digits in

the significand and

increase the range

by allocating more

bits to the exponent.

The standard

allocates 1 bit for

sign, 11 bits for the

exponent, and 52

bits for the

significand.

RESONANCE ¨ January 2016

GENERAL ¨ ARTICLE

its using this standard is

0 00000000001

0000………………………………….. 00000

Sig

Exponent Significand

1 bit 11 bits 52 bits

hose value is 2

1–1023

–

1022

The definitions of ± 0, ± f, QNaN, SNaN remain the same except

that the number of bits in the exponent and significand are now 11

and 52 respectively. Subnormal numbers have a similar definition

as in 32-bit standard.

IEEE 754 Floating Point Standard – 2008

This standard was introduced to enhance the scope of IEEE 754

floating point standard of 1985. The enhancements were intro-

duced to meet the demands of three sets of professionals:

1. Those working in the graphics area.

2. Those who use high performance computers for numeric

intensive computations.

3. Professionals carrying out financial transactions who require

exact decimal computation.

This standard has not changed the format of 32- and 64-bit

numbers. It has introduced floating point numbers which are 16

and 128 bits long. These are respectively called half and qua-

druple precision. Besides these, it has also introduced standards

for floating point decimal numbers.

The 16-Bit Standard

The 16-bit format for real numbers was introduced for pixel

storage in graphics and not for computation. (Some graphics

rocessing units use the 16-bit format for computation.) The 16-

it format is shown in Figure 3.

IEEE 2008 Standard

has introduced

standards for 16-bit

and 128-bit floating

point numbers. It

has also introduced

standards for

floating point

decimal numbers.

The 16-bit format

for real numbers

was introduced for

pixel storage in

graphics and not

for computation.

Figure 3. Representation o

16-bit floating point numbers.

24 RESONANCE ¨January 2016

GENERAL

ARTICLE

Figure 4. Representation o

128-bit floating point number.

With 16-bit representation, 5 bits are used for the exponent and 10

its for the significand. Biased exponent is used with a bias of 15.

Thus the exponent range is: –14 to +15. Remember that all 0s an

all 1s for the exponent are reserved to represent 0 and f respec-

tively.

The maximum positive integer that can be stored is

+ 1.111..1 × 2

=1+(1–2

–

)×2

=(2–2

–

)×2

# 65504.

The minimum positive number is

+1.000…0×2

–

–14

=0.61×10

–

The minimum subnormal 16-bit floating point number is

–

# 5.96 × 10

–

The definitions of ± 0, ±

, NaN, and SNaN follow the same ideas

as in 32-bit format.

The 128-Bit Standard

With improvements in computer technology, using 128 bits to

represent floating point numbers has become feasible. This is

sometimes used in numeric-intensive computation, where large

rounding errors may occur. In this standard, besides a sign bit, 14

its are used for the exponent, and 113 bits are used for the

significand. The representation is shown in

Figure

A number in this standard is

(–1)

×(1.f)×2

(

exponent

–

16384)

With improvements in

computer technology,

using 128 bits to

represent floating point

numbers has become

feasible. This is

sometimes used in

numeric-intensive

computation, where

large rounding errors

may occur.

RESONANCE ¨ January 2016

GENERAL ¨ ARTICLE

The largest positive number which can be represented in this

format is:

0 11111111111110

111………………………………………..111

Sign Exponent Significand

1bit 14bits 113bits

which equals (1 – 2

–

113

)×2

16383

= # 10

4932

The smallest normalized positive number which can be repre-

sented is

1–16383

–16382

# 10

–

4931

A subnormal number is represented b

(– 1)

×0.f ×2

–16382

The smallest positive subnormal number is

–

16382–113

–

16485

The definitions of ± 0, ± f, and QNaN, and SNaN are the same as

in the 1985 standard except for the increase of bits in the exponent

and significand. There are other minor changes which we will not

discuss.

The Decimal Standard

The main motivation for introducing a decimal standard is the

fact that a terminating decimal fraction need not give a terminat-

ingbinaryfraction. Forexample,(0.1)

= (0.00011 (0011)

recurring)

. If one computes 100000000 × 0.1 using binary arith-

metic and rounding the non-terminating binary fraction up to the

next larger number, the answer is: 10000001.490116 instead of

10000000. This is unacceptable in many situations, particularly

in financial transactions. There was a demand from professionals

carrying out financial transactions to introduce a standard for

representing decimal floating point numbers. Decimal floating

oint numbers are not new. They were available in COBOL.

There was, however, no standard and this resulted in non-portable

rograms. IEEE 754-2008 has thus introduced a standard for

The main motivation

for introducing a

decimal standard is

the fact that a

terminating decimal

fraction need not give

a terminating binary

fraction.

There was a demand

from professionals

carrying out financial

transactions to

introduce a standard

for representing

decimal floating point

numbers.

26 RESONANCE ¨January 2016

GENERAL

ARTICLE

Figure 6. Representation o

32-bit floating point numbers

in IEEE 2008 Standard.

decimal floating point numbers to facilitate portability of pro-

grams.

Decimal Representation

The idea of encoding decimal numbers using bits is simple. For

example, to represent 0.1 encoded (not converted) to binary, it is

written as 0.1u 10

. The exponent instead of being interpreted as

a power of 2 is now interpreted as a power of 10. Both the

significand and the exponent are now integers and their binary

representations have a finite number of bits. One method of

representing 0.1 u 10

using this idea in a 32-bit word is shown i

Figure 5.

We have assumed an 8-bit binary exponent in biased form. In this

representation, there is no hidden 1 in the significand. The

significand is an integer. (Fractions are represented by using an

appropriate power of 10 in the exponent field.) For example,

.000753658 will be written as 0.753658 u 10

–

and 753658, a

integer, will be converted to binary and stored as the significand.

The power of 10, namely –3, will be an integer in the exponent

art.

The IEEE Standard does not use this simple representation. This

is due to the fact that the largest decimal significand that may be

represented using this method is 0.8388607.

It is preferable to obtain 7 significant digits, i.e., significand up to

0.9999999. The IEEE Standard achieves this by borrowing some

its from the exponent field and reducing the range of the

exponent. In other words, instead of maximum exponent of 128, it

is reduced to 96 and the bits saved are used to increase the significant

digits in the significand. The coding of bits is as shown in Figur

Figure 5. Representation o

decimal floating point num-

bers.

RESONANCE ¨ January 2016

GENERAL ¨ ARTICLE

Observe the difference in terminology for exponent and significan

fields. In this coding, one part of the five combination bits is

interpreted as exponent and the other part as an extension of

significand bits using an ingenious method. This will be ex-

lained later in this section.

IEEE 2008 Standard defines two formats for representing deci-

mal floating point numbers. One of them converts

the binary

significand field to a decimal integer between 0 and 10

–

wher

is the number of significant bits in the significand. The other,

called densely packed decimal form, encodes the bits in the

significand field directly to decimal. Both methods represent

decimal floating point numbers as shown below.

Decimal floating point number = (–1)

× f ×10

exp – bias

,where

the sign bit, f is a decimal fraction (significand), exp and bias ar

decimal numbers. The fraction need not be normalized but is

usually normalized with a non-zero most significant digit. There

is no hidden bit as in the binary standard. The standard defines

decimal floating point formats for 32-bit, 64-bit and 128-bit

numbers.

Densely Packed Decimal Significand Representation

Normally,whendecimalnumbersarecodedtobinary,4bitsare

required to represent each digit, as 3 bits give only 8 combinations

that are not sufficient to represent 10 digits. Four bits give 16

combinations out of which we need only 10 combinations to

represent decimal numbers from 0 to 9. Thus, we waste 6 combi-

nations out of 16. For a 32-bit number, if 1 bit is used for sign, 8

its for the exponent and 23 bits for the significand, the maximu

ositive value of the significand will be +0.799999. The expo-

nent range will be 00 to 99. With a bias of 50, the maximu

ositive decimal number will be 0.799999 × 10

The number of significand digits in the above coding is not

sufficient. A clever coding scheme for decimal encoding of

inary numbers was discovered by Cowlinshaw. This metho

encodes 10 bits as 3 decimal digits (remember that 2

= 1024 an

IEEE 2008 Standard

defines two formats

for representing

decimal floating point

numbers. One of

them converts the

binary significand

field to a decimal

integer between 0

and 10

–

where

the number of

significant bits in the

significand. The

other, called densely

packed decimal form,

encodes the bits in

the significand field

directly to decimal.

28 RESONANCE ¨January 2016

GENERAL

ARTICLE

Table 3. Use of combination

bits in IEEE Decimal Stan-

dard.

one can encode 000 to 999 using the available 1024 combinations

of 10 bits). IEEE 754 Standard uses this coding scheme. We will

now describe the standard using this coding method for 32-bit

numbers.

The standard defines the following:

1. A

ign bit 0 for + and 1 for – .

2. A combination field of 5 bits. It is called a combination field as

one part of it is used for the exponent and another part for the

significand.

3. An exponent continuation field that is used for the exponent.

4. A coefficient field that encodes strings of 10 bits as 3 digits.

Thus, 20 bits are encoded as 6 digits. This is part of the

significand field.

For 32-bit IEEE word, the distribution of bits was given in Figure

6 which is reproduced below for ready reference.

Sig

Combination field

Exponent continuation field

Coefficient field

1bit 5bit

6bits 20bits

The 5 bits of the combination field are used to increase the

coefficient field by 1 digit. It uses the first ten 4-bit combinations

of 16 possible 4-bit combinations to represent this. As there are

32 combinations of 5 bits, out of which only 10 are needed to exten

the coefficient digits, the remaining 22-bit combinations are avail-

able to increase the exponent range and also encode fand NaN. This

is done by an ingenious method explained in Tabl

Combination bits Type

Most

significant

bits of

exponent

Most

significant

digit of

coefficient

xyabc

11abc

11110

11111

Finite

f

NaN

–

0abc

100c

–

RESONANCE ¨ January 2016

GENERAL ¨ ARTICLE

Table 4. Decimal represen-

tation for 64-bit and 128-bit

numbers.

Sign bit 1 1

Combination field (bits) 5 5

Exponent continuation field (bits) 8 12

Coefficient field (bits) 50 110

Number of bits in number 64 128

Significant digits 16 34

Exponent range (decimal) 768

12288

Exponent bias (decimal) 384 6144

Thus, the most significant 2 bits of the exponent are constraine

to take on values 00, 01, 10. Using the 6 bits of exponent

combination bits, we get an exponent range of 3 × 2

= 192. The

exponent bias is 96. (Remember that the exponent is an integer

that can be exactly represented in binary.) With the addition of 1

significant digit to the 6 digits of the coefficient bits, the decimal

representation of 32-bit numbers has 7 significant digits. The

largest number that can be stored is 0.9999999 × 10

. A similar

method is used to represent 64- and 128-bit floating point num-

ers. This is shown in Table 4.

In the IEEE 2008 Decimal Standard, the combination bits along

with exponent bits are used to represent NaN, signaling NaN an

± f as shown in Table 5.

Quantity Sign bit Combination bits

SNaN 0or1 11 11 11

QNaN 0or1 111110

+f 0 11110x

–

f 1 11110x

+0 0 00000x

–

0 1 00000x

(x is 0 or 1)

Table 5. Representation o

NaNs, ± f,and±0.

30 RESONANCE ¨January 2016

GENERAL

ARTICLE