High Performance Computing - Charles Severance [23]
Table 1.2. Parameters of IEEE 32- and 64-Bit Formats IEEE75 FORTRAN C Bits Exponent Bits Mantissa Bits
Single REAL*4 float 32 8 24
Double REAL*8 double 64 11 53
Double-Extended REAL*10 long double >=80 >=15 >=64
In FORTRAN, the 32-bit format is usually called REAL, and the 64-bit format is usually called DOUBLE. However, some FORTRAN compilers double the sizes for these data types. For that reason, it is safest to declare your FORTRAN variables as REAL*4 or REAL*8. The double-extended format is not as well supported in compilers and hardware as the single- and double-precision formats. The bit arrangement for the single and double formats are shown in Figure 1.15.
Based on the storage layouts in Table 1.2, we can derive the ranges and accuracy of these formats, as shown in Table 1.3.
Figure 1.15. IEEE754 floating-point formats
Table 1.3. Range and Accuracy of IEEE 32- and 64-Bit Formats IEEE754 Minimum Normalized Number Largest Finite Number Base-10 Accuracy
Single 1.2E-38 3.4 E+38 6-9 digits
Double 2.2E-308 1.8 E+308 15-17 digits
Extended Double 3.4E-4932 1.2 E+4932 18-21 digits
Converting from Base-10 to IEEE Internal Format
We now examine how a 32-bit floating-point number is stored. The high-order bit is the sign of the number. Numbers are stored in a sign-magnitude format (i.e., not 2’s - complement). The exponent is stored in the 8-bit field biased by adding 127 to the exponent. This results in an exponent ranging from -126 through +127.
The mantissa is converted into base-2 and normalized so that there is one nonzero digit to the left of the binary place, adjusting the exponent as necessary. The digits to the right of the binary point are then stored in the low-order 23 bits of the word. Because all numbers are normalized, there is no need to store the leading 1.
This gives a free extra bit of precision. Because this bit is dropped, it’s no longer proper to refer to the stored value as the mantissa. In IEEE parlance, this mantissa minus its leading digit is called the significand.
Figure 1.16 shows an example conversion from base-10 to IEEE 32-bit format.
Figure 1.16. Converting from base-10 to IEEE 32-bit format
The 64-bit format is similar, except the exponent is 11 bits long, biased by adding 1023 to the exponent, and the significand is 54 bits long.
IEEE Operations*
The IEEE standard specifies how computations are to be performed on floating- point values on the following operations:
Addition
Subtraction
Multiplication
Division
Square root
Remainder (modulo)
Conversion to/from integer
Conversion to/from printed base-10
These operations are specified in a machine-independent manner, giving flexibility to the CPU designers to implement the operations as efficiently as possible while maintaining compliance with the standard. During operations, the IEEE standard requires the maintenance of two guard digits and a sticky bit for intermediate values. The guard digits above and the sticky bit are used to indicate if any of the bits beyond the second guard digit is nonzero.
Figure 1.17. Computation using guard and sticky bits
In Figure 1.17, we have five bits of normal precision, two guard digits, and a sticky bit. Guard bits simply operate as normal bits — as if the significand were 25 bits. Guard bits participate in rounding as the extended operands are added. The sticky bit is set to 1 if any of the bits beyond the guard bits is nonzero in either operand.[13] Once the extended sum is computed, it is rounded so that the value stored in memory is the closest possible value to the extended sum including the guard digits. Table 1.4 shows all eight possible values of the two guard digits and the sticky bit and the resulting stored value with an explanation as to why.
Table 1.4. Extended Sums and Their Stored Values Extended Sum Stored Value Why
1.0100 000 1.0100 Truncated based on guard digits
1.0100 001 1.0100 Truncated based on guard digits