High Performance Computing - Charles Severance [74]
Output Dependencies
The third class of data dependencies, output dependencies, is of particular interest to users of parallel computers, particularly multiprocessors. Output dependencies involve getting the right values to the right variables when all calculations have been completed. Otherwise, an output dependency is violated. The loop below assigns new values to two elements of the vector A with each iteration:
DO I=1,N
A(I) = C(I) * 2.
A(I+2) = D(I) + E
ENDDO
As always, we won’t have any problems if we execute the code sequentially. But if several iterations are performed together, and statements are reordered, then incorrect values can be assigned to the last elements of A. For example, in the naive vectorized equivalent below, A(I+2) takes the wrong value because the assignments occur out of order:
A(I) = C(I) * 2.
A(I+1) = C(I+1) * 2.
A(I+2) = C(I+2) * 2.
A(I+2) = D(I) + E ← Output dependency violated
A(I+3) = D(I+1) + E
A(I+4) = D(I+2) + E
Whether or not you have to worry about output dependencies depends on whether you are actually parallelizing the code. Your compiler will be conscious of the danger, and will be able to generate legal code — and possibly even fast code, if it’s clever enough. But output dependencies occasionally become a problem for programmers.
Dependencies Within an Iteration
We have looked at dependencies that cross iteration boundaries but we haven’t looked at dependencies within the same iteration. Consider the following code fragment:
DO I = 1,N
D = B(I) * 17
A(I) = D + 14
ENDDO
When we look at the loop, the variable D has a flow dependency. The second statement cannot start until the first statement has completed. At first glance this might appear to limit parallelism significantly. When we look closer and manually unroll several iterations of the loop, the situation gets worse:
D = B(I) * 17
A(I) = D + 14
D = B(I+1) * 17
A(I+1) = D + 14
D = B(I+2) * 17
A(I+2) = D + 14
Now, the variable D has flow, output, and antidependencies. It looks like this loop has no hope of running in parallel. However, there is a simple solution to this problem at the cost of some extra memory space, using a technique called promoting a scalar to a vector. We define D as an array withN elements and rewrite the code as follows:
DO I = 1,N
D(I) = B(I) * 17
A(I) = D(I) + 14
ENDDO
Now the iterations are all independent and can be run in parallel. Within each iteration, the first statement must run before the second statement.
Reductions
The sum of an array of numbers is one example of a reduction — so called because it reduces a vector to a scalar. The following loop to determine the total of the values in an array certainly looks as though it might be able to be run in parallel:
SUM = 0.0
DO I=1,N
SUM = SUM + A(I)
ENDDO
However, if we perform our unrolling trick, it doesn’t look very parallel:
SUM = SUM + A(I)
SUM = SUM + A(I+1)
SUM = SUM + A(I+2)
This loop also has all three types of dependencies and looks impossible to parallelize. If we are willing to accept the potential effect of rounding, we can add some parallelism to this loop as follows (again we did not add the preconditioning loop):
SUM0 = 0.0
SUM1 = 0.0
SUM2 = 0.0
SUM3 = 0.0
DO I=1,N,4
SUM0 = SUM0 + A(I)
SUM1 = SUM1 + A(I+1)
SUM2 = SUM2 + A(I+2)
SUM3 = SUM3 + A(I+3)
ENDDO
SUM = SUM0 + SUM1 + SUM2 + SUM3
Again, this is not precisely the same computation, but all four partial sums can be computed independently. The partial sums are combined at the end of the loop.
Loops that look for the maximum or minimum elements in an array, or multiply all the elements of an array, are also reductions. Likewise, some of these can be reorganized into partial results, as with the sum, to expose more computations.