11
RESONANCE ¨ January 2016
GENERAL ¨ ARTICLE
IEEE Standard for Floatin
Point Numbers
V Rajaraman
Keywords
Floating point numbers, round-
ing, decimal floating point num-
bers, IEEE 754-2008 Standard.
Indian Institute of Science,
generations of scientists
ineers in India
have learnt com
His current research
interests are
.
Floating point numbers are an important data type in compu-
tation which is used extensively. Yet, many users do not know
the standard which is used in almost all computer hardware
to store and process these. In this article, we explain the
standards evolved by The Institute of Electrical and Elec-
tronic Engineers in 1985 and augmented in 2008 to represent
floating point numbers and process them. This standard is
now used by all computer manufacturers while designing
floating point arithmetic units so that programs are portable
among computers.
Introductio
There are two types of arithmetic which are performed in comput-
ers: integer arithmetic and real arithmetic. Integer arithmetic is
simple. A decimal number is converted to its binary equivalent
and arithmetic is performed using binary arithmetic operations.
The largest positive integer that may be stored in an 8-bit byte is
+127, if 1 bit is used for sign. If 16 bits are used, the largest
ositive integer is +32767 and with 32 bits, it is +2147483647,
quite large! Integers are used mostly for counting. Most scientific
computations are however performed using real numbers, that is,
numbers with a fractional part. In order to represent real numbers
in computers, we have to ask two questions. The first is to decide
how many bits are needed to encode real numbers and the secon
is to decide how to represent real numbers using these bits.
Normally, in numerical computation in science and engineering,
one would need at least 7 to 8 significant digits precision. Thus,
the number of bits needed to encode 8 decimal digits is approxi-
mately 26, as log
2
10 = 3.32 bits are needed on the average to
encode a digit. In computers, numbers are stored as a sequence of
8-bit bytes. Thus 32 bits (4 bytes) which is bigger than 26 bits is
a logical size to use for real numbers. Given 32 bits to encode real