High Performance Computing - Charles Severance [123]
+ 0,MPI_COMM_WORLD,IERR)
Now we perform the subset computation on each process. Note that we are using global coordinates because the array has the same shape on each of the processes. All we need to do is make sure we set up our particular strip of columns according to S and E:
* Perform the flow on our subset
DO C=S,E
DO R=1,ROWS
RED(R,C) = ( BLACK(R,C) +
+ BLACK(R,C-1) + BLACK(R-1,C) +
+ BLACK(R+1,C) + BLACK(R,C+1) ) / 5.0
ENDDO
ENDDO
Now we need to gather the appropriate strips from the processes into the appropriate strip in the master array for rebroadcast in the next time step. We could change the loop in the master to receive the messages in any order and check the STATUS variable to see which strip it received:
* Gather back up into the BLACK array in master (INUM = 0)
IF ( INUM .EQ. 0 ) THEN
DO C=S,E
DO R=1,ROWS
BLACK(R,C) = RED(R,C)
ENDDO
ENDDO
DO I=1,NPROC-1
CALL MPE_DECOMP1D(COLS, NPROC, I, LS, LE, IERR)
MYLEN = ( LE - LS ) + 1
SRC = I TAG = 0
CALL MPI_RECV(BLACK(0,LS),MYLEN*(ROWS+2),
+ MPI_DOUBLE_PRECISION, SRC, TAG,
+ MPI_COMM_WORLD, STATUS, IERR)
* Print *,’Recv’,I,MYLEN
ENDDO
ELSE
MYLEN = ( E - S ) + 1
DEST = 0
TAG = 0
CALL MPI_SEND(RED(0,S),MYLEN*(ROWS+2),MPI_DOUBLE_PRECISION,
+ DEST, TAG, MPI_COMM_WORLD, IERR)
Print *,’Send’,INUM,MYLEN
ENDIF
ENDDO
We use MPE_DECOMP1D to determine which strip we’re receiving from each process.
In some applications, the value that must be gathered is a sum or another single value. To accomplish this, you can use one of the MPI reduction routines that coalesce a set of distributed values into a single value using a single call.
Again at the end, we dump out the data for testing. However, since it has all been gathered back onto the master process, we only need to dump it on one process:
* Dump out data for verification
IF ( INUM .EQ.0 .AND. ROWS .LE. 20 ) THEN
FNAME = ’/tmp/mheatout’
OPEN(UNIT=9,NAME=FNAME,FORM=’formatted’)
DO C=1,COLS
WRITE(9,100)(BLACK(R,C),R=1,ROWS)
100 FORMAT(20F12.6)
ENDDO
CLOSE(UNIT=9)
ENDIF
CALL MPI_FINALIZE(IERR)
END
When this program executes with four processes, it produces the following output:
% mpif77 -c mheat.f
mheat.f:
MAIN mheat:
% mpif77 -o mheat mheat.o -lmpe
% mheat -np 4
Calling MPI_INIT
My Share 1 4 51 100
My Share 0 4 1 50
My Share 3 4 151 200
My Share 2 4 101 150
%
The ranks of the processes and the subsets of the computations for each process are shown in the output.
So that is a somewhat contrived example of the broadcast/gather approach to parallelizing an application. If the data structures are the right size and the amount of computation relative to communication is appropriate, this can be a very effective approach that may require the smallest number of code modifications compared to a single-processor version of the code.
MPI Summary
Whether you chose PVM or MPI depends on which library the vendor of your system prefers. Sometimes MPI is the better choice because it contains the newest features, such as support for hardware-supported multicast or broadcast, that can significantly improve the overall performance of a scatter-gather application.
A good text on MPI is Using MPI — Portable Parallel Programming with the Message-Passing Interface, by William Gropp, Ewing Lusk, and Anthony Skjellum (MIT Press). You may also want to retrieve and print the MPI specification from http://www.netlib.org/mpi/.
Closing Notes*
In this chapter we have looked at the “assembly language” of parallel programming. While it can seem daunting to rethink your application, there are often some simple changes you can make to port your code to message passing. Depending on the application, a master-slave, broadcast-gather, or decomposed data approach might be most appropriate.
It’s important to realize that some applications just don’t decompose into message passing very well. You may be working with just such an application. Once you have some experience with message passing, it becomes easier to identify the critical points where data must be communicated