Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

High Performance Computing - Charles Severance [60]

By Root 1310 0

helps performance because it fattens up a loop with more calculations per iteration. By the same token, if a particular loop is already fat, unrolling isn’t going to help. The loop overhead is already spread over a fair number of instructions. In fact, unrolling a fat loop may even slow your program down because it increases the size of the text segment, placing an added burden on the memory system (we’ll explain this in greater detail shortly). A good rule of thumb is to look elsewhere for performance when the loop innards exceed three or four statements.

Loops Containing Procedure Calls

As with fat loops, loops containing subroutine or function calls generally aren’t good candidates for unrolling. There are several reasons. First, they often contain a fair number of instructions already. And if the subroutine being called is fat, it makes the loop that calls it fat as well. The size of the loop may not be apparent when you look at the loop; the function call can conceal many more instructions.

Second, when the calling routine and the subroutine are compiled separately, it’s impossible for the compiler to intermix instructions. A loop that is unrolled into a series of function calls behaves much like the original loop, before unrolling.

Last, function call overhead is expensive. Registers have to be saved; argument lists have to be prepared. The time spent calling and returning from a subroutine can be much greater than that of the loop overhead. Unrolling to amortize the cost of the loop structure over several calls doesn’t buy you enough to be worth the effort.

The general rule when dealing with procedures is to first try to eliminate them in the “remove clutter” phase, and when this has been done, check to see if unrolling gives an additional performance improvement.

Loops with Branches in Them

In The Section Called “Introduction” we showed you how to eliminate certain types of branches, but of course, we couldn’t get rid of them all. In cases of iteration-independent branches, there might be some benefit to loop unrolling. The IF test becomes part of the operations that must be counted to determine the value of loop unrolling. Below is a doubly nested loop. The inner loop tests the value of B(J,I):

DO I=1,N

DO J=1,N

IF (B(J,I) .GT. 1.0) A(J,I) = A(J,I) + B(J,I) * C

ENDDO

Each iteration is independent of every other, so unrolling it won’t be a problem. We’ll just leave the outer loop undisturbed:

II = IMOD (N,4)

DO I=1,N

DO J=1,II

IF (B(J,I) .GT. 1.0)

+ A(J,I) = A(J,I) + B(J,I) * C

ENDDO

DO J=II+1,N,4

IF (B(J,I) .GT. 1.0)

+ A(J,I) = A(J,I) + B(J,I) * C

IF (B(J+1,I) .GT. 1.0)

+ A(J+1,I) = A(J+1,I) + B(J+1,I) * C

IF (B(J+2,I) .GT. 1.0)

+ A(J+2,I) = A(J+2,I) + B(J+2,I) * C

IF (B(J+3,I) .GT. 1.0)

+ A(J+3,I) = A(J+3,I) + B(J+3,I) * C

ENDDO

This approach works particularly well if the processor you are using supports conditional execution. As described earlier, conditional execution can replace a branch and an operation with a single conditionally executed assignment. On a superscalar processor with conditional execution, this unrolled loop executes quite nicely.

Nested Loops*

When you embed loops within other loops, you create a loop nest. The loop or loops in the center are called the inner loops. The surrounding loops are called outer loops. Depending on the construction of the loop nest, we may have some flexibility in the ordering of the loops. At times, we can swap the outer and inner loops with great benefit. In the next sections we look at some common loop nestings and the optimizations that can be performed on these loop nests.

Often when we are working with nests of loops, we are working with multidimensional arrays. Computing in multidimensional arrays can lead to non-unit-stride memory access. Many of the optimizations we perform on loop nests are meant to improve the memory access patterns.

First, we examine the computation-related optimizations followed by the memory optimizations.

Outer Loop Unrolling

If you are faced with a loop nest, one simple

Online Book Reader

High Performance Computing - Charles Severance [60]

®Online Book Reader