High Performance Computing - Charles Severance [95]
C$ASSERT NO_SIDE_EFFECTS
DO I=1,N
CALL BIGSTUFF (A,B,C,I,J,K)
END DO
Even if the compiler has all the source code, use of common variables or equivalences may mask call independence.
Manual Parallelism
At some point, you get tired of giving the compiler advice and hoping that it will reach the conclusion to parallelize your loop. At that point you move into the realm of manual parallelism. Luckily the programming model provided in FORTRAN insulates you from much of the details of exactly how multiple threads are managed at runtime. You generally control explicit parallelism by adding specially formatted comment lines to your source code. There are a wide variety of formats of these directives. In this section, we use the syntax that is part of the OpenMP (see http://cnx.org/content/m32814/1.3/www.openmp.org) standard. You generally find similar capabilities in each of the vendor compilers. The precise syntax varies slightly from vendor to vendor. (That alone is a good reason to have a standard.)
The basic programming model is that you are executing a section of code with either a single thread or multiple threads. The programmer adds a directive to summon additional threads at various points in the code. The most basic construct is called the parallel region.
Parallel regions
In a parallel region, the threads simply appear between two statements of straight-line code. A very trivial example might be the following using the OpenMP directive syntax:
PROGRAM ONE
EXTERNAL OMP_GET_THREAD_NUM, OMP_GET_MAX_THREADS
INTEGER OMP_GET_THREAD_NUM, OMP_GET_MAX_THREADS
IGLOB = OMP_GET_MAX_THREADS()
PRINT *,’Hello There’
C$OMP PARALLEL PRIVATE(IAM), SHARED(IGLOB)
IAM = OMP_GET_THREAD_NUM()
PRINT *, ’I am ’, IAM, ’ of ’, IGLOB
C$OMP END PARALLEL
PRINT *,’All Done’
END
The C$OMP is the sentinel that indicates that this is a directive and not just another comment. The output of the program when run looks as follows:
% setenv OMP_NUM_THREADS 4
% a.out
Hello There
I am 0 of 4
I am 3 of 4
I am 1 of 4
I am 2 of 4
All Done
%
Execution begins with a single thread. As the program encounters the PARALLEL directive, the other threads are activated to join the computation. So in a sense, as execution passes the first directive, one thread becomes four. Four threads execute the two statements between the directives. As the threads are executing independently, the order in which the print statements are displayed is somewhat random. The threads wait at the END PARALLEL directive until all threads have arrived. Once all threads have completed the parallel region, a single thread continues executing the remainder of the program.
In Figure 3.22, the PRIVATE(IAM) indicates that the IAM variable is not shared across all the threads but instead, each thread has its own private version of the variable. The IGLOB variable is shared across all the threads. Any modification of IGLOB appears in all the other threads instantly, within the limitations of the cache coherency.
Figure 3.22. Data interactions during a parallel region
During the parallel region, the programmer typically divides the work among the threads. This pattern of going from single-threaded to multithreaded execution may be repeated many times throughout the execution of an application.
Because input and output are generally not thread-safe, to be completely correct, we should indicate that the print statement in the parallel section is only to be executed on one processor at any one time. We use a directive to indicate that this section of code is a critical section. A lock or other synchronization mechanism ensures that no more than one processor is executing the statements in the critical section at any one time:
C$OMP CRITICAL
PRINT *, ’I am ’, IAM, ’ of ’, IGLOB
C$OMP END CRITICAL
Parallel loops
Quite often the areas of the code that are most valuable to execute in parallel are loops. Consider the following loop:
DO I=1,1000000
TMP1 = ( A(I) ** 2 ) + ( B(I) ** 2 )
TMP2 = SQRT(TMP1)
B(I)