High Performance Computing - Charles Severance [62]
Loop Interchange to Move Computations to the Center
When someone writes a program that represents some kind of real-world model, they often structure the code in terms of the model. This makes perfect sense. The computer is an analysis tool; you aren’t writing the code on the computer’s behalf. However, a model expressed naturally often works on one point in space at a time, which tends to give you insignificant inner loops — at least in terms of the trip count. For performance, you might want to interchange inner and outer loops to pull the activity into the center, where you can then do some unrolling. Let’s illustrate with an example. Here’s a loop where KDIM time-dependent quantities for points in a two-dimensional mesh are being updated:
PARAMETER (IDIM = 1000, JDIM = 1000, KDIM = 3)
...
DO I=1,IDIM
DO J=1,JDIM
DO K=1,KDIM
D(K,J,I) = D(K,J,I) + V(K,J,I) * DT
ENDDO
ENDDO
ENDDO
In practice, KDIM is probably equal to 2 or 3, where J or I, representing the number of points, may be in the thousands. The way it is written, the inner loop has a very low trip count, making it a poor candidate for unrolling.
By interchanging the loops, you update one quantity at a time, across all of the points. For tuning purposes, this moves larger trip counts into the inner loop and allows you to do some strategic unrolling:
DO K=1,KDIM
DO J=1,JDIM
DO I=1,IDIM
D(K,J,I) = D(K,J,I) + V(K,J,I) * DT
ENDDO
ENDDO
ENDDO
This example is straightforward; it’s easy to see that there are no inter-iteration dependencies. But how can you tell, in general, when two loops can be inter- changed? Interchanging loops might violate some dependency, or worse, only violate it occasionally, meaning you might not catch it when optimizing. Can we interchange the loops below?
DO I=1,N-1
DO J=2,N
A(I,J) = A(I+1,J-1) * B(I,J)
C(I,J) = B(J,I)
ENDDO
ENDDO
While it is possible to examine the loops by hand and determine the dependencies, it is much better if the compiler can make the determination. Very few single-processor compilers automatically perform loop interchange. However, the compilers for high-end vector and parallel computers generally interchange loops if there is some benefit and if interchanging the loops won’t alter the program results.[45]
Memory Access Patterns*
The best pattern is the most straightforward: increasing and unit sequential. For an array with a single dimension, stepping through one element at a time will accomplish this. For multiply-dimensioned arrays, access is fastest if you iterate on the array subscript offering the smallest stride or step size. In FORTRAN programs, this is the leftmost subscript; in C, it is the rightmost. The FORTRAN loop below has unit stride, and therefore will run quickly:
DO J=1,N
DO I=1,N
A(I,J) = B(I,J) + C(I,J) * D
ENDDO
ENDDO
In contrast, the next loop is slower because its stride is N (which, we assume, is greater than 1). As N increases from one to the length of the cache line (adjusting for the length of each element), the performance worsens. Once N is longer than the length of the cache line (again adjusted for element size), the performance won’t decrease:
DO J=1,N
DO I=1,N
A(J,I) = B(J,I) + C(J,I) * D
ENDDO
ENDDO
Here’s a unit-stride loop like the previous one, but written in C:
for (i=0; i Unit stride gives you the best performance because it conserves cache entries. Recall how a data cache works.[46] Your program makes a memory reference; if the data is in the cache, it gets returned immediately. If not, your program suffers a cache miss while a new cache line is fetched from main memory, replacing an old one. The line holds