High Performance Computing - Charles Severance [65]
DO I=1,N,2
DO J=1,N/2,2
A(J,I) = A(J,I) + B(I,J)
A(J+1,I) = A(J+1,I) + B(I+1,J)
A(J,I+1) = A(J,I+1) + B(I+1,J)
A(J+1,I+1) = A(J+1,I+1) + B(I+1,J+1)
ENDDO
ENDDO
DO I=1,N,2
DO J=N/2+1,N,2
A(J,I) = A(J,I) + B(I,J)
A(J+1,I) = A(J+1,I) + B(I+1,J)
A(J,I+1) = A(J,I+1) + B(I+1,J)
A(J+1,I+1) = A(J+1,I+1) + B(I+1,J+1)
ENDDO
ENDDO
You might guess that adding more loops would be the wrong thing to do. But if you work with a reasonably large value of N, say 512, you will see a significant increase in performance. This is because the two arrays A and B are each 256 KB × 8 bytes = 2 MB when N is equal to 512 — larger than can be handled by the TLBs and caches of most processors.
The two boxes in Figure 2.12 illustrate how the first few references to A and B look superimposed upon one another in the blocked and unblocked cases. Unblocked references to B zing off through memory, eating through cache and TLB entries. Blocked references are more sparing with the memory system.
Figure 2.12. Picture of unblocked versus blocked references
You can take blocking even further for larger problems. This code shows another method that limits the size of the inner loop and visits it repeatedly:
II = MOD (N,16)
JJ = MOD (N,4)
DO I=1,N
DO J=1,JJ
A(J,I) = A(J,I) + B(J,I)
ENDDO
ENDDO
DO I=1,II
DO J=JJ+1,N
A(J,I) = A(J,I) + B(J,I)
A(J,I) = A(J,I) + 1.0D0
ENDDO
ENDDO
DO I=II+1,N,16
DO J=JJ+1,N,4
DO K=I,I+15
A(J,K) = A(J,K) + B(K,J)
A(J+1,K) = A(J+1,K) + B(K,J+1)
A(J+2,K) = A(J+2,K) + B(K,J+2)
A(J+3,K) = A(J+3,K) + B(K,J+3)
ENDDO
ENDDO
ENDDO
Where the inner I loop used to execute N iterations at a time, the new K loop executes only 16 iterations. This divides and conquers a large memory address space by cutting it into little pieces.
While these blocking techniques begin to have diminishing returns on single-processor systems, on large multiprocessor systems with nonuniform memory access (NUMA), there can be significant benefit in carefully arranging memory accesses to maximize reuse of both cache lines and main memory pages.
Again, the combined unrolling and blocking techniques we just showed you are for loops with mixed stride expressions. They work very well for loop nests like the one we have been looking at. However, if all array references are strided the same way, you will want to try loop unrolling or loop interchange first.
Programs That Require More Memory Than You Have*
People occasionally have programs whose memory size requirements are so great that the data can’t fit in memory all at once. At any time, some of the data has to reside outside of main memory on secondary (usually disk) storage. These out-of- core solutions fall into two categories:
Software-managed, out-of-core solutions
Virtual memory–managed, out-of-core solutions
With a software-managed approach, the programmer has recognized that the problem is too big and has modified the source code to move sections of the data out to disk for retrieval at a later time. The other method depends on the computer’s memory system handling the secondary storage requirements on its own, some- times at a great cost in runtime.
Software-Managed, Out-of-Core Solutions
Most codes with software-managed, out-of-core solutions have adjustments; you can tell the program how much memory it has to work with, and it takes care of the rest. It is important to make sure the adjustment is set correctly. Code that was tuned for a machine with limited memory could have been ported to another without taking into account the storage available. Perhaps the whole problem will fit easily.
If we are writing an out-of-core solution, the trick is to group memory references together so that they are localized. This usually occurs naturally as a side effect of partitioning, say, a matrix factorization into groups of columns. Blocking references