High Performance Computing - Charles Severance [105]
The output of this program is as follows:
E6000: f90 heat90.f
E6000:a.out
1 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
21 100.00 82.38 66.34 50.30 38.18 26.06 18.20 10.35 5.18 0.00
41 100.00 87.04 74.52 61.99 50.56 39.13 28.94 18.75 9.38 0.00
61 100.00 88.36 76.84 65.32 54.12 42.91 32.07 21.22 10.61 0.00
81 100.00 88.74 77.51 66.28 55.14 44.00 32.97 21.93 10.97 0.00
101 100.00 88.84 77.70 66.55 55.44 44.32 33.23 22.14 11.07 0.00
121 100.00 88.88 77.76 66.63 55.52 44.41 33.30 22.20 11.10 0.00
141 100.00 88.89 77.77 66.66 55.55 44.43 33.32 22.22 11.11 0.00
161 100.00 88.89 77.78 66.66 55.55 44.44 33.33 22.22 11.11 0.00
181 100.00 88.89 77.78 66.67 55.55 44.44 33.33 22.22 11.11 0.00
E6000:
If you look closely, this output is the same as the red-black implementation. That is because in FORTRAN 90:
ROD(2:9) = (ROD(1:8) + ROD(3:10) ) / 2
is a single assignment statement. As shown in Figure 4.4, the right side is completely evaluated before the resulting array section is assigned into ROD(2:9). For a moment, that might seem unnatural, but consider the following statement:
I = I + 1
We know that if I starts with 5, it's incremented up to six by this statement. That happens because the right side (5+1) is evaluated before the assignment of 6 into I is performed. In FORTRAN 90, a variable can be an entire array. So, this is a red-black operation. There is an "old" ROD on the right side and a "new" ROD on the left side!
To really "think" FORTRAN 90, it's good to pretend you are on an SIMD system with millions of little CPUs. First we carefully align the data, sliding it around, and then— wham— in a single instruction, we add all the aligned values in an instant. Figure 4.4 shows graphically this act of "aligning" the values and then adding them. The data flow graph is extremely simple. The top two rows are read-only, and the data flows from top to bottom. Using the temporary space eliminates the seeming dependency. This approach of "thinking SIMD" is one of the ways to force ourselves to focus our thoughts on the data rather than the control. SIMD may not be a good architecture for your problem but if you can express it so that SIMD could work, a good SPMD environment can take advantage of the data parallelism that you have identified.
The above example actually highlights one of the challenges in producing an efficient implementation of FORTRAN 90. If these arrays contained 10 million elements, and the compiler used a simple approach, it would need 30 million elements for the old "left" values, the old "right" values, and for the new values. Data flow optimization is needed to determine just how much extra data must be maintained to give the proper results. If the compiler is clever, the extra memory can be quite small:
Figure 4.4. Data alignment and computations
SAVE1 = ROD(1)
DO I=2,9
SAVE2 = ROD(I)
ROD(I) = (SAVE1 + ROD(I+1) ) / 2
SAVE1 = SAVE2
ENDDO
This does not have the parallelism that the full red-black implementation has, but it does produce the correct results with only two extra data elements. The trick is to save the old "left" value just before you wipe it out. A good FORTRAN 90 compiler uses data flow analysis, looking at a template of how the computation moves across the data to see if it can save a few elements for a short period of time to alleviate the need for a complete extra copy of the data.
The advantage of the FORTRAN 90 language is that it's up to the compiler whether it uses a complete copy of the array or a few data elements to insure that the program executes properly. Most importantly, it can change its approach as you move from one architecture to another.
FORTRAN 90 Versus FORTRAN 77
Interestingly, FORTRAN 90 has never been fully embraced by the high performance community. There are a few reasons why:
There is a concern that the use of pointers and dynamic data