Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

High Performance Computing - Charles Severance [74]

By Root 1279 0

those statements completed execution, statements 5–8 could execute in parallel. Using this approach, there are sufficient intervening statements between the dependent statements that we can see some parallel performance improvement from a superscalar RISC processor.

Output Dependencies

The third class of data dependencies, output dependencies, is of particular interest to users of parallel computers, particularly multiprocessors. Output dependencies involve getting the right values to the right variables when all calculations have been completed. Otherwise, an output dependency is violated. The loop below assigns new values to two elements of the vector A with each iteration:

DO I=1,N

A(I) = C(I) * 2.

A(I+2) = D(I) + E

ENDDO

As always, we won’t have any problems if we execute the code sequentially. But if several iterations are performed together, and statements are reordered, then incorrect values can be assigned to the last elements of A. For example, in the naive vectorized equivalent below, A(I+2) takes the wrong value because the assignments occur out of order:

A(I) = C(I) * 2.

A(I+1) = C(I+1) * 2.

A(I+2) = C(I+2) * 2.

A(I+2) = D(I) + E ← Output dependency violated

A(I+3) = D(I+1) + E

A(I+4) = D(I+2) + E

Whether or not you have to worry about output dependencies depends on whether you are actually parallelizing the code. Your compiler will be conscious of the danger, and will be able to generate legal code — and possibly even fast code, if it’s clever enough. But output dependencies occasionally become a problem for programmers.

Dependencies Within an Iteration

We have looked at dependencies that cross iteration boundaries but we haven’t looked at dependencies within the same iteration. Consider the following code fragment:

DO I = 1,N

D = B(I) * 17

A(I) = D + 14

ENDDO

When we look at the loop, the variable D has a flow dependency. The second statement cannot start until the first statement has completed. At first glance this might appear to limit parallelism significantly. When we look closer and manually unroll several iterations of the loop, the situation gets worse:

D = B(I) * 17

A(I) = D + 14

D = B(I+1) * 17

A(I+1) = D + 14

D = B(I+2) * 17

A(I+2) = D + 14

Now, the variable D has flow, output, and antidependencies. It looks like this loop has no hope of running in parallel. However, there is a simple solution to this problem at the cost of some extra memory space, using a technique called promoting a scalar to a vector. We define D as an array withN elements and rewrite the code as follows:

DO I = 1,N

D(I) = B(I) * 17

A(I) = D(I) + 14

ENDDO

Now the iterations are all independent and can be run in parallel. Within each iteration, the first statement must run before the second statement.

Reductions

The sum of an array of numbers is one example of a reduction — so called because it reduces a vector to a scalar. The following loop to determine the total of the values in an array certainly looks as though it might be able to be run in parallel:

SUM = 0.0

DO I=1,N

SUM = SUM + A(I)

ENDDO

However, if we perform our unrolling trick, it doesn’t look very parallel:

SUM = SUM + A(I)

SUM = SUM + A(I+1)

SUM = SUM + A(I+2)

This loop also has all three types of dependencies and looks impossible to parallelize. If we are willing to accept the potential effect of rounding, we can add some parallelism to this loop as follows (again we did not add the preconditioning loop):

SUM0 = 0.0

SUM1 = 0.0

SUM2 = 0.0

SUM3 = 0.0

DO I=1,N,4

SUM0 = SUM0 + A(I)

SUM1 = SUM1 + A(I+1)

SUM2 = SUM2 + A(I+2)

SUM3 = SUM3 + A(I+3)

ENDDO

SUM = SUM0 + SUM1 + SUM2 + SUM3

Again, this is not precisely the same computation, but all four partial sums can be computed independently. The partial sums are combined at the end of the loop.

Loops that look for the maximum or minimum elements in an array, or multiply all the elements of an array, are also reductions. Likewise, some of these can be reorganized into partial results, as with the sum, to expose more computations.

Online Book Reader

High Performance Computing - Charles Severance [74]

®Online Book Reader