High Performance Computing - Charles Severance [56]
Sometimes you want to sink an operation below the loop. Usually, it’s some calculation performed each iteration but whose result is only needed for the last. To illustrate, here’s a sort of loop that is different from the ones we have been looking at. It searches for the final character in a character string:
while (*p != ’ ’)
c = *p++;
becomes:
while (*p++ != ’ ’);
c = *(p-1);
The new version of the loop moves the assignment of c beyond the last iteration. Admittedly, this transformation would be a reach for a compiler and the savings wouldn’t even be that great. But it illustrates the notion of sinking an operation very well.
Again, hoisting or sinking instructions to get them out of loops is something your compiler should be capable of doing. But often you can slightly restructure the calculations yourself when you move them to get an even greater benefit.
Handling Array Elements in Loops
Here’s another area where you would like to trust the compiler to do the right thing. When making repeated use of an array element within a loop, you want to be charged just once for loading it from memory. Take the following loop as an example. It reuses X(I) twice:
DO I=1,N
XOLD(I) = X(I)
X(I)= X(I) + XINC(I)
ENDDO
In reality, the steps that go into retrieving X(I) are just additional common subex- pressions: an address calculation (possibly) and a memory load operation. You can see that the operation is repeated by rewriting the loop slightly:
DO I=1,N
TEMP= X(I)
XOLD(I) = TEMP
X(I)= TEMP + XINC(I)
ENDDO
FORTRAN compilers should recognize that the same X(I) is being used twice and that it only needs to be loaded once, but compilers aren’t always so smart. You sometimes have to create a temporary scalar variable to hold the value of an array element over the body of a loop. This is particularly true when there are subroutine calls or functions in the loop, or when some of the variables are external or COMMON. Make sure to match the types between the temporary variables and the other variables. You don’t want to incur type conversion overhead just because you are “helping” the compiler. For C compilers, the same kind of indexed expres- sions are an even greater challenge. Consider this code:
doinc(int xold[],int x[],int xinc[],int n)
{
for (i=0; i x[i]= x[i] + xinc[i]; } } Unless the compiler can see the definitions of x, xinc, and xold, it has to assume that they are pointers leading back to the same storage, and repeat the loads and stores. In this case, introducing temporary variables to hold the values x, xinc, and xold is an optimization the compiler wasn’t free to make. Interestingly, while putting scalar temporaries in the loop is useful for RISC and superscalar machines, it doesn’t help code that runs on parallel hardware. A parallel compiler looks for opportunities to eliminate the scalars or, at the very least, to replace them with temporary vectors. If you run your code on a parallel machine from time to time, you might want to be careful about introducing scalar temporary variables into a loop. A dubious performance gain in one instance could be a real performance loss in another. In this chapter, we introduced tuning techniques for eliminating program clutter — anything that contributes to the runtime without contributing to the answer. We saw many examples of tuning techniques — enough that you may be asking yourself, “What’s left?” Well, as we will see in the upcoming chapters, there are a couple of ways we can help the compiler: Find more parallelism Use memory as effectively
Closing Notes*