High Performance Computing - Charles Severance [101]
A person familiar with the theory of heat flow will also point out that the above loop doesn't exactly implement the heat flow model. The problem is that the values on the right side of the assignment in the ROD loop are supposed to be from the previous time step, and that the value on the left side is the next time step. Because of the way the loop is written, the ROD(I-1) value is from the next time step, as shown in The Section Called “Data-Parallel Problem: Heat Flow”.
This can be solved using a technique called red-black, where we alternate between two arrays. Figure 4.3 shows how the red-black version of the computation operates. This kills two birds with one stone! Now the mathematics is precisely correct, and there is no recurrence. Sounds like a real win-win situation.
Figure 4.2. Computing the new value for a cell
Figure 4.3. Using two arrays to eliminate a dependency
The only downside to this approach is that it takes twice the memory storage and twice the memory bandwidth.[65] The modified code is as follows:
PROGRAM HEATRED
PARAMETER(MAXTIME=200)
INTEGER TICKS,I,MAXTIME
REAL*4 RED(10),BLACK(10)
RED(1) = 100.0
BLACK(1) = 100.0
DO I=2,9
RED(I) = 0.0
ENDDO
RED(10) = 0.0
BLACK(10) = 0.0
DO TICKS=1,MAXTIME,2
IF ( MOD(TICKS,20) .EQ. 1 ) PRINT 100,TICKS,(RED(I),I=1,10)
DO I=2,9
BLACK(I) = (RED(I-1) + RED(I+1) ) / 2
ENDDO
DO I=2,9
RED(I) = (BLACK(I-1) + BLACK(I+1) ) / 2
ENDDO
ENDDO
100 FORMAT(I4,10F7.2)
END
The output for the modified program is:
% f77 heatred.f
heatred.f:
MAIN heatred:
% a.out
1 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
21 100.00 82.38 66.34 50.30 38.18 26.06 18.20 10.35 5.18 0.00
41 100.00 87.04 74.52 61.99 50.56 39.13 28.94 18.75 9.38 0.00
61 100.00 88.36 76.84 65.32 54.12 42.91 32.07 21.22 10.61 0.00
81 100.00 88.74 77.51 66.28 55.14 44.00 32.97 21.93 10.97 0.00
101 100.00 88.84 77.70 66.55 55.44 44.32 33.23 22.14 11.07 0.00
121 100.00 88.88 77.76 66.63 55.52 44.41 33.30 22.20 11.10 0.00
141 100.00 88.89 77.77 66.66 55.55 44.43 33.32 22.22 11.11 0.00
161 100.00 88.89 77.78 66.66 55.55 44.44 33.33 22.22 11.11 0.00
181 100.00 88.89 77.78 66.67 55.55 44.44 33.33 22.22 11.11 0.00
%
Interestingly, the modified program takes longer to converge than the first version. It converges at Time step 181 rather than 101. If you look at the first version, because of the recurrence, the heat ended up flowing up faster from left to right because the left element of each average was the next-time-step value. It may seem nifty, but it's wrong.[66] Generally, in this problem, either approach converges to the same eventual values within the limits of floating-point representation.
This heat flow problem is extremely simple, and in its red-black form, it's inherently very parallel with very simple data interactions. It's a good model for a wide range of problems where we are discretizing two-dimensional or three-dimensional space and performing some simple simulations in that space.
This problem can usually be scaled up by making a finer grid. Often, the benefit of scalable processors is to allow a finer grid rather than a faster time to solution. For example, you might be able to to a worldwide weather simulation using a 200-mile grid in four hours on one processor. Using 100 processors, you may be able to do the simulation using a 20-mile grid in four hours with much more accurate results. Or, using 400 processors, you can do the finer grid simulation in one hour.
Explicity Parallel Languages*
As we've seen throughout this book, one of biggest tuning challenges is getting the compiler to recognize that a particular code segment can be parallelized. This is particularly true for numerical codes, where the potential payback is greatest. Think about this: if you know that something is parallel, why should there be any difficulty getting the compiler to recognize it? Why can't you just write it down, and