High Performance Computing - Charles Severance [50]
DO I=1,N
A(I) = A(I) + B(I) * C
ENDDO
The code below performs the same calculations, but look at what we have done:
DO I=1,N
CALL MADD (A(I), B(I), C)
ENDDO
SUBROUTINE MADD (A,B,C)
A = A + B * C
RETURN
END
Each iteration calls a subroutine to do a small amount of work that was formerly within the loop. This is a particularly painful example because it involves floating- point calculations. The resulting loss of parallelism, coupled with the procedure call overhead, might produce code that runs 100 times slower. Remember, these operations are pipelined, and it takes a certain amount of “wind-up” time before the throughput reaches one operation per clock cycle. If there are few floating-point operations to perform between subroutine calls, the time spent winding up and winding down pipelines figures prominently.
Subroutine and function calls complicate the compiler’s ability to efficiently man- age COMMON and external variables, delaying until the last possible moment actually storing them in memory. The compiler uses registers to hold the “live” values of many variables. When you make a call, the compiler cannot tell whether the subroutine will be changing variables that are declared as external or COMMON. Therefore, it’s forced to store any modified external or COMMON variables back into memory so that the callee can find them. Likewise, after the call has returned, the same variables have to be reloaded into registers because the compiler can no longer trust the old, register-resident copies. The penalty for saving and restoring variables can be substantial, especially if you are using lots of them. It can also be unwarranted if variables that ought to be local are specified as external or COMMON, as in the following code:
COMMON /USELESS/ K
DO K=1,1000
IF (K .EQ. 1) CALL AUX
ENDDO
In this example, K has been declared as a COMMON variable. It is used only as a do-loop counter, so there really is no reason for it to be anything but local. However, because it is in a COMMON block, the call to AUX forces the compiler to store and reload K each iteration. This is because the side effects of the call are unknown.
So far, it looks as if we are preparing a case for huge main programs without any subroutines or functions! Not at all. Modularity is important for keeping source code compact and understandable. And frankly, the need for maintainability and modularity is always more important than the need for small performance improvements. However, there are a few approaches for streamlining subroutine calls that don’t require you to scrap modular coding techniques: macros and procedure inlining.
Remember, if the function or subroutine does a reasonable amount of work, procedure call overhead isn’t going to matter very much. However, if one small routine appears as a leaf node in one of the busiest sections of the call graph, you might want to think about inserting it in appropriate places in the program.
Macros
Macros are little procedures that are substituted inline at compile time. Unlike subroutines or functions, which are included once during the link, macros are replicated every place they are used. When the compiler makes its first pass through your program, it looks for patterns that match previous macro definitions and expands them inline. In fact, in later stages, the compiler sees an expanded macro as source code.
Macros are part of both C and FORTRAN (although the FORTRAN notion of a macro, the statement function, is reviled by the FORTRAN community, and won’t survive much longer).[37] For C programs, macros are created with a #define construct, as demonstrated here:
#define average(x,y) ((x+y)/2)
main ()
{
float q = 100, p = 50;
float a;
a = average(p,q);
printf ("%f\n",a);
}
The first compilation step for a C program is a pass through the C preprocessor, cpp. This happens automatically when you invoke the compiler. cpp expands #define