High Performance Computing - Charles Severance [54]
z0 = 0.;
z1 = 0.;
for (i=0; i< n-1; i+=2) {
z0 = z0 < a[i] ? a[i] : z0;
z1 = z1 < a[i+1] ? a[i+1] : z1;
}
z = z0 < z1 ? z1 : z0;
Do you see how the new loop calculates two new maximum values each iteration? These maximums are then compared with one another, and the winner becomes the new official max. It’s analogous to a play-off arrangement in a Ping-Pong tournament. Whereas the old loop was more like two players competing at a time while the rest sat around, the new loop runs several matches side by side. In general this particular optimization is not a good one to code by hand. On parallel processors, the compiler performs the reduction in its own way. If you hand-code similar to this example, you may inadvertently limit the compiler’s flexibility on a parallel system.
Conditionals That Transfer Control
Let’s step back a second. Have you noticed a similarity among all the loops so far? We have looked only at a particular type of conditional, conditional assignments — based on the outcome of the test, a variable gets reassigned. Of course, not every conditional ends up in an assignment. You can have statements that transfer flow of control, such as subroutine calls or goto statements. In the following example, the programmer is carefully checking before dividing by zero.
However, this test has an extremely negative impact on the performance because it forces the iterations to be done precisely in order:
DO I=1,N
DO J=1,N
IF (B(J,I) .EQ. 0 ) THEN
PRINT *,I,J
STOP
ENDIF
A(J,I) = A(J,I) / B(J,I)
ENDDO
ENDDO
Avoiding these tests is one of the reasons that the designers of the IEEE floating- point standard added the trap feature for operations such as dividing by zero. These traps allow the programmer in a performance-critical section of the code to achieve maximum performance yet still detect when an error occurs.
Other Clutter*
Clutter comes in many forms. Consider the previous sections as having dealt with large pieces of junk you might find in the front hall closet: an ironing board, hockey sticks, and pool cues. Now we are down to the little things: a widowed checker, a tennis ball, and a hat nobody owns. We want to mention a few of them here. We apologize in advance for changing subjects a lot, but that’s the nature of cleaning out a closet!
Data Type Conversions
Statements that contain runtime type conversions suffer a little performance penalty each time the statement is executed. If the statement is located in a portion of the program where there is a lot of activity, the total penalty can be significant.
People have their reasons for writing applications with mixed typing. Often it is a matter of saving memory space, memory bandwidth, or time. In the past, for instance, double-precision calculations took twice as long as their single-precision counterparts, so if some of the calculations could be arranged to take place in single precision, there could be a performance win.[40] But any time saved by performing part of the calculations in single precision and part in double precision has to be measured against the additional overhead caused by the runtime type conversions. In the following code, the addition of A(I) to B(I) is mixed type:
INTEGER NUMEL, I
PARAMETER (NUMEL = 1000)
REAL*8 A(NUMEL)
REAL*4 B(NUMEL)
DO I=1,NUMEL
A(I) = A(I) + B(I)
ENDDO
In each iteration, B(I) has to be promoted to double precision before the addition can occur. You don’t see the promotion in the source code, but it’s there, and it takes time.
C programmers beware: in Kernighan and Ritchie (K&R) C, all floating-point calculations in C programs take place in double precision — even if all the variables involved