High Performance Computing - Charles Severance [60]
Loops Containing Procedure Calls
As with fat loops, loops containing subroutine or function calls generally aren’t good candidates for unrolling. There are several reasons. First, they often contain a fair number of instructions already. And if the subroutine being called is fat, it makes the loop that calls it fat as well. The size of the loop may not be apparent when you look at the loop; the function call can conceal many more instructions.
Second, when the calling routine and the subroutine are compiled separately, it’s impossible for the compiler to intermix instructions. A loop that is unrolled into a series of function calls behaves much like the original loop, before unrolling.
Last, function call overhead is expensive. Registers have to be saved; argument lists have to be prepared. The time spent calling and returning from a subroutine can be much greater than that of the loop overhead. Unrolling to amortize the cost of the loop structure over several calls doesn’t buy you enough to be worth the effort.
The general rule when dealing with procedures is to first try to eliminate them in the “remove clutter” phase, and when this has been done, check to see if unrolling gives an additional performance improvement.
Loops with Branches in Them
In The Section Called “Introduction” we showed you how to eliminate certain types of branches, but of course, we couldn’t get rid of them all. In cases of iteration-independent branches, there might be some benefit to loop unrolling. The IF test becomes part of the operations that must be counted to determine the value of loop unrolling. Below is a doubly nested loop. The inner loop tests the value of B(J,I):
DO I=1,N
DO J=1,N
IF (B(J,I) .GT. 1.0) A(J,I) = A(J,I) + B(J,I) * C
ENDDO
ENDDO
Each iteration is independent of every other, so unrolling it won’t be a problem. We’ll just leave the outer loop undisturbed:
II = IMOD (N,4)
DO I=1,N
DO J=1,II
IF (B(J,I) .GT. 1.0)
+ A(J,I) = A(J,I) + B(J,I) * C
ENDDO
DO J=II+1,N,4
IF (B(J,I) .GT. 1.0)
+ A(J,I) = A(J,I) + B(J,I) * C
IF (B(J+1,I) .GT. 1.0)
+ A(J+1,I) = A(J+1,I) + B(J+1,I) * C
IF (B(J+2,I) .GT. 1.0)
+ A(J+2,I) = A(J+2,I) + B(J+2,I) * C
IF (B(J+3,I) .GT. 1.0)
+ A(J+3,I) = A(J+3,I) + B(J+3,I) * C
ENDDO
ENDDO
This approach works particularly well if the processor you are using supports conditional execution. As described earlier, conditional execution can replace a branch and an operation with a single conditionally executed assignment. On a superscalar processor with conditional execution, this unrolled loop executes quite nicely.
Nested Loops*
When you embed loops within other loops, you create a loop nest. The loop or loops in the center are called the inner loops. The surrounding loops are called outer loops. Depending on the construction of the loop nest, we may have some flexibility in the ordering of the loops. At times, we can swap the outer and inner loops with great benefit. In the next sections we look at some common loop nestings and the optimizations that can be performed on these loop nests.
Often when we are working with nests of loops, we are working with multidimensional arrays. Computing in multidimensional arrays can lead to non-unit-stride memory access. Many of the optimizations we perform on loop nests are meant to improve the memory access patterns.
First, we examine the computation-related optimizations followed by the memory optimizations.
Outer Loop Unrolling
If you are faced with a loop nest, one simple