High Performance Computing - Charles Severance [57]
Sometimes this means we make changes that are not beautiful. However, they are often quick.
Exercises*
Exercise 2.22.1.
How would you simplify the following loop conditional?
DO I=1,N
A(I) = A(I) * B
IF (I .EQ. N/2) A(I) = 0.
ENDDO
Exercise 2.22.2.
Time this loop on your computer, both with and without the test. Run it with three sets of data: one with all A(I)s less than SMALL, one with all A(I)s greater than SMALL, and one with an even split. When is it better to leave the test in the loop, if ever?
PARAMETER (SMALL = 1.E-20)
DO I=1,N
IF (ABS(A(I)) .GE. SMALL) THEN
B(I) = B(I) + A(I) * C
ENDIF
ENDDO
Exercise 2.22.3.
Write a simple program that calls a simple subroutine in its inner loop. Time the program execution. Then tell the compiler to inline the routine and test the performance again. Finally, modify the code to perform the operations in the body of the loop and time the code. Which option ran faster? You may have to look at the generated machine code to figure out why.
2.4. Loop Optimizations
Introduction*
In nearly all high performance applications, loops are where the majority of the execution time is spent. In The Section Called “Introduction” we examined ways in which application developers introduced clutter into loops, possibly slowing those loops down. In this chapter we focus on techniques used to improve the performance of these “clutter-free” loops. Sometimes the compiler is clever enough to generate the faster versions of the loops, and other times we have to do some rewriting of the loops ourselves to help the compiler.
It’s important to remember that one compiler’s performance enhancing modifications are another compiler’s clutter. When you make modifications in the name of performance you must make sure you’re helping by testing the performance with and without the modifications. Also, when you move to another architecture you need to make sure that any modifications aren’t hindering performance. For this reason, you should choose your performance-related modifications wisely. You should also keep the original (simple) version of the code for testing on new architectures. Also if the benefit of the modification is small, you should probably keep the code in its most simple and clear form.
We look at a number of different loop optimization techniques, including:
Loop unrolling
Nested loop optimization
Loop interchange
Memory reference optimization
Blocking
Out-of-core solutions
Someday, it may be possible for a compiler to perform all these loop optimizations automatically. Typically loop unrolling is performed as part of the normal compiler optimizations. Other optimizations may have to be triggered using explicit compile-time options. As you contemplate making manual changes, look carefully at which of these optimizations can be done by the compiler. Also run some tests to determine if the compiler optimizations are as good as hand optimizations.
Operation Counting*
Before you begin to rewrite a loop body or reorganize the order of the loops, you must have some idea of what the body of the loop does for each iteration. Operation counting is the process of surveying a loop to understand the operation mix. You need to count the number of loads, stores, floating-point, integer, and library calls per iteration of the loop. From the count, you can see how well the operation mix of a given loop matches the capabilities of the processor. Of course, operation counting doesn’t guarantee that the compiler will generate an efficient representation of a loop.[42] But it generally provides enough insight to the loop to direct tuning efforts.
Bear in mind that an instruction mix that is balanced for one machine may be imbalanced for another. Processors on the market today can generally issue some combination of one to four operations per clock cycle. Address arithmetic is often embedded in the instructions that reference memory. Because the compiler can replace complicated loop address calculations with simple expressions (provided