High Performance Computing - Charles Severance [139]
ld [%l0+0],%f3 ! Load B(I)
ld [%fp-8],%l2 ! Address of C
sethi %hi(GPB.addem.i),%l0 ! Address of I in %l0
or %l0,%lo(GPB.addem.i),%l0
ld [%l0+0],%l0 ! Load I
sll %l0,2,%l1 ! Multiply by 4
add %l2,%l1,%l0 ! Figure effective address of B(I)
ld [%l0+0],%f2 ! Load C(I)
fadds %f3,%f2,%f2 ! Do the Floating Point Add
ld [%fp-12],%l2 ! Address of A
sethi %hi(GPB.addem.i),%l0 ! Address of i in %l0
or %l0,%lo(GPB.addem.i),%l0
ld [%l0+0],%l0 ! Load I
sll %l0,2,%l1 ! Multiply by 4
add %l2,%l1,%l0 ! Figure effective address of A(I)
st %f2,[%l0+0] ! Store A(I)
sethi %hi(GPB.addem.i),%l0 ! Address of i in %l0
or %l0,%lo(GPB.addem.i),%l0
ld [%l0+0],%l0 ! Load I
add %l0,1,%l1 ! Increment I
sethi %hi(GPB.addem.i),%l0 ! Address of I in %l0
or %l0,%lo(GPB.addem.i),%l0
st %l1,[%l0+0] ! Store I
sethi %hi(GPB.addem.i),%l0 ! Address of I in %l0
or %l0,%lo(GPB.addem.i),%l0
ld [%l0+0],%l1 ! Load I
ld [%fp-20],%l0 ! Load N
cmp %l1,%l0 ! Compare
ble .L18
nop ! Branch Delay Slot
This is some pretty poor code. We don’t need to go through it line by line, but there are a few quick observations we can make. The value for I is loaded from memory five times in the loop. The address of I is computed six times throughout the loop (each time takes two instructions). There are no tricky memory addressing modes, so multiplying I by 4 to get a byte offset is done explicitly three times (at least they use a shift). To add insult to injury, they even put a NO-OP in the branch delay slot.
One might ask, “Why do they ever generate code this bad?” Well, it’s not because the compiler isn’t capable of generating efficient code, as we shall see below. One explanation is that in this optimization level, it simply does a one-to-one translation of the tuples (intermediate code) into machine language. You can almost draw lines in the above example and precisely identify which instructions came from which tuples.
One reason to generate the code using this simplistic approach is to guarantee that the program will produce the correct results. Looking at the above code, it’s pretty easy to argue that it indeed does exactly what the FORTRAN code does. You can track every single assembly statement directly back to part of a FORTRAN statement.
It’s pretty clear that you don’t want to execute this code in a high performance production environment without some more optimization.
Moderate optimization
In this example, we enable some optimization (-O1):
save %sp,-120,%sp ! Rotate the register window
add %i0,-4,%o0 ! Address of A(0)
st %o0,[%fp-12] ! Store on the stack
add %i1,-4,%o0 ! Address of B(0)
st %o0,[%fp-4] ! Store on the stack
add %i2,-4,%o0 ! Address of C(0)
st %o0,[%fp-8] ! Store on the stack
sethi %hi(GPB.addem.i),%o0 ! Address of I (top portion)
add %o0,%lo(GPB.addem.i),%o2 ! Address of I (lower portion)
ld [%i3],%o0 ! %o0 = N (fourth parameter)
or %g0,1,%o1 ! %o1 = 1 (for addition)
st %o0,[%fp-20] ! store N on the stack
st %o1,[%o2] ! Set memory copy of I to 1
ld [%o2],%o1 ! o1 = I (kind of redundant)
cmp %o1,%o0 ! Check I > N (zero-trip?)
bg .L12 ! Don’t do loop at all
nop ! Delay Slot
ld [%o2],%o0 ! Pre-load for Branch Delay Slot
.L900000110: ! Top of the loop
ld [%fp-4],%o1 ! o1 = Address of B(0)
sll %o0,2,%o0 ! Multiply I by 4
ld [%o1+%o0],%f2 ! f2 = B(I)
ld [%o2],%o0 ! Load I from memory
ld [%fp-8],%o1 ! o1 = Address of C(0)
sll %o0,2,%o0 ! Multiply I by 4
ld [%o1+%o0],%f3 ! f3 = C(I)
fadds %f2,%f3,%f2 ! Register-to-register add
ld [%o2],%o0 ! Load I from memory (not again!)
ld [%fp-12],%o1 ! o1 = Address of A(0)
sll %o0,2,%o0 ! Multiply I by 4 (yes, again)
st %f2,[%o1+%o0] ! A(I) = f2
ld [%o2],%o0 ! Load I from memory
add %o0,1,%o0 ! Increment I in register
st %o0,[%o2] ! Store I back into memory
ld [%o2],%o0 ! Load I back into a register
ld [%fp-20],%o1 ! Load N into a register
cmp %o0,%o1 ! I > N ??
ble,a .L900000110
ld [%o2],%o0 ! Branch Delay Slot
This is a significant improvement from the previous example. Some loop constant computations (subtracting 4) were