High Performance Computing - Charles Severance [138]
fmoves a1@(-4,d0:l:4),fp0 ! Load of B(I)
To compute its memory address, this instruction multiplies d0 by 4, adds the contents of a1, and then subtracts 4. The resulting address is used to load 4 bytes into floating-point register fp0. This is almost a literal translation of fetching B(I). You can see how the assembly is set up to track high-level constructs.
It is almost as if the compiler were “trying” to show off and make use of the nifty assembly language instructions.
Like the Intel, this is not a load-store architecture. The fadds instruction adds a value from memory to a value in a register (fp0) and leaves the result of the addition in the register. Unlike the Intel 8088, we have enough registers to store quite a few of the values used throughout the loop (I, N, the address of A, B, and C) in registers to save memory operations.
C on the MC68020
In the next example, we compiled the C version of the loop with the normal optimization (-O) turned on. We see the C perspective on arrays in this code. C views arrays as extensions to pointers in C; the loop index advances as an offset from a pointer to the beginning of the array:
! d3 = I
! d1 = Address of A
! d2 = Address of B
! d0 = Address of C
! a6@(20) = N
moveq #0,d3 ! Initialize I
bras L5 ! Jump to End of the loop
L1: movl d3,a1 ! Make copy of I
movl a1,d4 ! Again
asll #2,d4 ! Multiply by 4 (word size)
movl d4,a1 ! Put back in an address register
fmoves a1@(0,d2:l),fp0 ! Load B(I)
movl a6@(16),d0 ! Get address of C
fadds a1@(0,d0:l),fp0 ! Add C(I)
fmoves fp0,a1@(0,d1:l) ! Store into A(I)
addql #1,d3 ! Increment I
L5:
cmpl a6@(20),d3
bits L1
We first see the value of I being copied into several registers and multiplied by 4 (using a left shift of 2, strength reduction). Interestingly, the value in register a1 is I multiplied by 4. Registers d0, d1, and d2 are the addresses of C, B, and A respectively. In the load, add, and store, a1 is the base of the address computation and d0, d1, and d2 are added as an offset to a1 to compute each address.
This is a simplistic optimization that is primarily trying to maximize the values that are kept in registers during loop execution. Overall, it’s a relatively literal translation of the C language semantics from C to assembly. In many ways, C was designed to generate relatively efficient code without requiring a highly sophisticated optimizer.
More optimization
In this example, we are back to the FORTRAN version on the MC68020. We have compiled it with the highest level of optimization (-OLM) available on this compiler. Now we see a much more aggressive approach to the loop:
! a0 = Address of C(I)
! a1 = Address of B(I)
! a2 = Address of A(I)
L3:
fmoves a1@,fp0 ! Load B(I)
fadds a0@,fp0 ! Add C(I)
fmoves fp0,a2@ ! Store A(I)
addql #4,a0 ! Advance by 4
addql #4,a1 ! Advance by 4
addql #4,a2 ! Advance by 4
subql #1,d0 ! Decrement I
tstl d0
bnes L3
First off, the compiler is smart enough to do all of its address adjustment outside the loop and store the adjusted addresses of A, B, and C in registers. We do the load, add, and store in quick succession. Then we advance the array addresses by 4 and perform the subtraction to determine when the loop is complete.
This is very tight code and bears little resemblance to the original FORTRAN code.
SPARC Architecture
These next examples were performed using a SPARC architecture system using FORTRAN. The SPARC architecture is a classic RISC processor using load-store access to memory, many registers and delayed branching. We first examine the code at the lowest optimization:
.L18: ! Top of the loop
ld [%fp-4],%l2 ! Address of B
sethi %hi(GPB.addem.i),%l0 ! Address of I in %l0
or %l0,%lo(GPB.addem.i),%l0
ld [%l0+0],%l0 ! Load I
sll %l0,2,%l1 ! Multiply by 4
add %l2,%l1,%l0 ! Figure effective