High Performance Computing - Charles Severance [20]
When storing floating-point numbers in digital computers, typically the mantissa is normalized, and then the mantissa and exponent are converted to base-2 and packed into a 32- or 64-bit word. If more bits were allocated to the exponent, the overall range of the format would be increased, and the number of digits of accuracy would be decreased. Also the base of the exponent could be base-2 or base-16. Using 16 as the base for the exponent increases the overall range of exponents, but because normalization must occur on four-bit boundaries, the available digits of accuracy are reduced on the average. Later we will see how the IEEE 754 standard for floating-point format represents numbers.
Effects of Floating-Point Representation*
One problem with the mantissa/base/exponent representation is that not all base-10 numbers can be expressed perfectly as a base-2 number. For example, 1/2 and 0.25 can be represented perfectly as base-2 values, while 1/3 and 0.1 produce infinitely repeating base-2 decimals. These values must be rounded to be stored in the floating-point format. With sufficient digits of precision, this generally is not a problem for computations. However, it does lead to some anomalies where algebraic rules do not appear to apply. Consider the following example:
REAL*4 X,Y
X = 0.1
Y = 0
DO I=1,10
Y = Y + X
ENDDO
IF ( Y .EQ. 1.0 ) THEN
PRINT *,’Algebra is truth’
ELSE
PRINT *,’Not here’
ENDIF
PRINT *,1.0-Y
END
At first glance, this appears simple enough. Mathematics tells us ten times 0.1 should be one. Unfortunately, because 0.1 cannot be represented exactly as a base-2 decimal, it must be rounded. It ends up being rounded down to the last bit. When ten of these slightly smaller numbers are added together, it does not quite add up to 1.0. When X and Y are REAL*4, the difference is about 10-7, and when they are REAL*8, the difference is about 10-16.
One possible method for comparing computed values to constants is to subtract the values and test to see how close the two values become. For example, one can rewrite the test in the above code to be:
IF ( ABS(1.0-Y).LT. 1E-6) THEN
PRINT *,’Close enough for government work’
ELSE
PRINT *,’Not even close’
ENDIF
The type of the variables in question and the expected error in the computation that produces Y determines the appropriate value used to declare that two values are close enough to be declared equal.
Another area where inexact representation becomes a problem is the fact that algebraic inverses do not hold with all floating-point numbers. For example, using REAL*4, the value (1.0/X) * X does not evaluate to 1.0 for 135 values of X from one to 1000. This can be a problem when computing the inverse of a matrix using LU-decomposition. LU-decomposition repeatedly does division, multiplication, addition, and subtraction. If you do the straightforward LU-decomposition on a matrix with integer coefficients that has an integer solution, there is a pretty good chance you won’t get the exact solution when you run your algorithm. Discussing techniques for improving the accuracy of matrix inverse computation is best left to a numerical analysis text.
More Algebra That Doesn't Work*
While the examples in the proceeding section focused on the limitations of multiplication and division, addition and subtraction are not, by any means, perfect. Because of the limitation of the number of digits of precision, certain additions or subtractions have no effect. Consider the following example using REAL*4 with 7 digits of precision:
X = 1.25E8
Y = X + 7.5E-3
IF ( X.EQ.Y ) THEN
PRINT *,’Am I nuts or what?’
ENDIF
While both of these numbers are precisely representable in floating-point, adding them is problematic. Prior to adding these numbers together, their decimal points must be aligned as in Figure 1.13.
Figure 1.13. Figure 4-4: Loss of accuracy while aligning decimal points
Unfortunately, while we have computed the exact result, it cannot fit back into a REAL*4 variable (7 digits of accuracy)