High Performance Computing - Charles Severance [96]
ENDDO
To manually parallelize this loop, we insert a directive at the beginning of the loop:
C$OMP PARALLEL DO
DO I=1,1000000
TMP1 = ( A(I) ** 2 ) + ( B(I) ** 2 )
TMP2 = SQRT(TMP1)
B(I) = TMP2
ENDDO
C$OMP END PARALLEL DO
When this statement is encountered at runtime, the single thread again summons the other threads to join the computation. However, before the threads can start working on the loop, there are a few details that must be handled. The PARALLEL DO directive accepts the data classification and scoping clauses as in the parallel section directive earlier. We must indicate which variables are shared across all threads and which variables have a separate copy in each thread. It would be a disaster to have TMP1 and TMP2 shared across threads. As one thread takes the square root of TMP1, another thread would be resetting the contents of TMP1. A(I) and B(I) come from outside the loop, so they must be shared. We need to augment the directive as follows:
C$OMP PARALLEL DO SHARED(A,B) PRIVATE(I,TMP1,TMP2)
DO I=1,1000000
TMP1 = ( A(I) ** 2 ) + ( B(I) ** 2 )
TMP2 = SQRT(TMP1)
B(I) = TMP2
ENDDO
C$OMP END PARALLEL DO
The iteration variable I also must be a thread-private variable. As the different threads increment their way through their particular subset of the arrays, they don’t want to be modifying a global value for I.
There are a number of other options as to how data will be operated on across the threads. This summarizes some of the other data semantics available:
Firstprivate: These are thread-private variables that take an initial value from the global variable of the same name immediately before the loop begins executing.
Lastprivate: These are thread-private variables except that the thread that executes the last iteration of the loop copies its value back into the global variable of the same name.
Reduction: This indicates that a variable participates in a reduction operation that can be safely done in parallel. This is done by forming a partial reduction using a local variable in each thread and then combining the partial results at the end of the loop.
Each vendor may have different terms to indicate these data semantics, but most support all of these common semantics. Figure 3.23 shows how the different types of data semantics operate.
Now that we have the data environment set up for the loop, the only remaining problem that must be solved is which threads will perform which iterations. It turns out that this is not a trivial task, and a wrong choice can have a significant negative impact on our overall performance.
Iteration scheduling
There are two basic techniques (along with a few variations) for dividing the iterations in a loop between threads. We can look at two extreme examples to get an idea of how this works:
C VECTOR ADD
DO IPROB=1,10000
A(IPROB) = B(IPROB) + C(IPROB)
ENDDO
C PARTICLE TRACKING
DO IPROB=1,10000
RANVAL = RAND(IPROB)
CALL ITERATE_ENERGY(RANVAL) ENDDO
ENDDO
Figure 3.23. Variables during a parallel region
In both loops, all the computations are independent, so if there were 10,000 processors, each processor could execute a single iteration. In the vector-add example, each iteration would be relatively short, and the execution time would be relatively constant from iteration to iteration. In the particle tracking example, each iteration chooses a random number for an initial particle position and iterates to find the minimum energy. Each iteration takes a relatively long time to complete, and there will be a wide variation of completion times from iteration to iteration.
These two examples are effectively the ends of a continuous spectrum of the iteration scheduling challenges facing the FORTRAN parallel runtime environment:
Static
At the beginning of a parallel loop, each thread takes a fixed continuous portion of iterations of the loop based on the number of threads executing the loop.
Dynamic
With dynamic scheduling, each thread processes a chunk of data and when it has completed processing, a new chunk