High Performance Computing - Charles Severance [73]
Flow Dependencies
For comparison, look at the next code fragment:
DO I=2,N
A(I) = A(I-1) + B(I)
ENDDO
This loop has the regularity of the previous example, but one of the subscripts is changed. Again, it’s useful to manually unroll the loop and look at several iterations together:
A(I) = A(I-1) + B(I)
A(I+1) = A(I) + B(I+1)
A(I+2) = A(I+1) + B(I+2)
In this case, there is a dependency problem. The value of A(I+1) depends on the value of A(I), the value of A(I+2) depends on A(I+1), and so on; every iteration depends on the result of a previous one. Dependencies that extend back to a previous calculation and perhaps a previous iteration (like this one), are loop carried flow dependencies or backward dependencies. You often see such dependencies in applications that perform Gaussian elimination on certain types of matrices, or numerical solutions to systems of differential equations. However, it is impossible to run such a loop in parallel (as written); the processor must wait for intermediate results before it can proceed.
In some cases, flow dependencies are impossible to fix; calculations are so dependent upon one another that we have no choice but to wait for previous ones to complete. Other times, dependencies are a function of the way the calculations are expressed. For instance, the loop above can be changed to reduce the dependency. By replicating some of the arithmetic, we can make it so that the second and third iterations depend on the first, but not on one another. The operation count goes up — we have an extra addition that we didn’t have before — but we have reduced the dependency between iterations:
DO I=2,N,2
A(I) = A(I-1) + B(I)
A(I+1) = A(I-1) + B(I) + B(I+1)
ENDDO
The speed increase on a workstation won’t be great (most machines run the recast loop more slowly). However, some parallel computers can trade off additional calculations for reduced dependency and chalk up a net win.
Antidependencies
It’s a different story when there is a loop-carried antidependency, as in the code below:
DO I=1,N
A(I) = B(I) * E
B(I) = A(I+2) * C
ENDDO
In this loop, there is an antidependency between the variable A(I) and the variable A(I+2). That is, you must be sure that the instruction that uses A(I+2) does so before the previous one redefines it. Clearly, this is not a problem if the loop is executed serially, but remember, we are looking for opportunities to overlap instructions. Again, it helps to pull the loop apart and look at several iterations together. We have recast the loop by making many copies of the first statement, followed by copies of the second:
A(I) = B(I) * E
A(I+1) = B(I+1) * E
A(I+2) = B(I+2) * E
...
B(I) = A(I+2) * C ← assignment makes use of the new
B(I+1) = A(I+3) * C value of A(I+2) incorrect.
B(I+2) = A(I+4) * C
The reference to A(I+2) needs to access an “old” value, rather than one of the new ones being calculated. If you perform all of the first statement followed by all of the second statement, the answers will be wrong. If you perform all of the second statement followed by all of the first statement, the answers will also be wrong. In a sense, to run the iterations in parallel, you must either save the A values to use for the second statement or store all of the B value in a temporary area until the loop completes.
We can also directly unroll the loop and find some parallelism:
1 A(I) = B(I) * E
2 B(I) = A(I+2) * C →
3 A(I+1) = B(I+1) * E | Output dependency
4 B(I+1) = A(I+3) * C |
5 A(I+2) = B(I+2) * E ←
6 B(I+2) = A(I+4) * C
Statements 1–4 could all be executed simultaneously. Once