Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

High Performance Computing - Charles Severance [139]

By Root 1401 0

address of B(I)

ld [%l0+0],%f3 ! Load B(I)

ld [%fp-8],%l2 ! Address of C

sethi %hi(GPB.addem.i),%l0 ! Address of I in %l0

or %l0,%lo(GPB.addem.i),%l0

ld [%l0+0],%l0 ! Load I

sll %l0,2,%l1 ! Multiply by 4

add %l2,%l1,%l0 ! Figure effective address of B(I)

ld [%l0+0],%f2 ! Load C(I)

fadds %f3,%f2,%f2 ! Do the Floating Point Add

ld [%fp-12],%l2 ! Address of A

sethi %hi(GPB.addem.i),%l0 ! Address of i in %l0

or %l0,%lo(GPB.addem.i),%l0

ld [%l0+0],%l0 ! Load I

sll %l0,2,%l1 ! Multiply by 4

add %l2,%l1,%l0 ! Figure effective address of A(I)

st %f2,[%l0+0] ! Store A(I)

sethi %hi(GPB.addem.i),%l0 ! Address of i in %l0

or %l0,%lo(GPB.addem.i),%l0

ld [%l0+0],%l0 ! Load I

add %l0,1,%l1 ! Increment I

sethi %hi(GPB.addem.i),%l0 ! Address of I in %l0

or %l0,%lo(GPB.addem.i),%l0

st %l1,[%l0+0] ! Store I

sethi %hi(GPB.addem.i),%l0 ! Address of I in %l0

or %l0,%lo(GPB.addem.i),%l0

ld [%l0+0],%l1 ! Load I

ld [%fp-20],%l0 ! Load N

cmp %l1,%l0 ! Compare

ble .L18

nop ! Branch Delay Slot

This is some pretty poor code. We don’t need to go through it line by line, but there are a few quick observations we can make. The value for I is loaded from memory five times in the loop. The address of I is computed six times throughout the loop (each time takes two instructions). There are no tricky memory addressing modes, so multiplying I by 4 to get a byte offset is done explicitly three times (at least they use a shift). To add insult to injury, they even put a NO-OP in the branch delay slot.

One might ask, “Why do they ever generate code this bad?” Well, it’s not because the compiler isn’t capable of generating efficient code, as we shall see below. One explanation is that in this optimization level, it simply does a one-to-one translation of the tuples (intermediate code) into machine language. You can almost draw lines in the above example and precisely identify which instructions came from which tuples.

One reason to generate the code using this simplistic approach is to guarantee that the program will produce the correct results. Looking at the above code, it’s pretty easy to argue that it indeed does exactly what the FORTRAN code does. You can track every single assembly statement directly back to part of a FORTRAN statement.

It’s pretty clear that you don’t want to execute this code in a high performance production environment without some more optimization.

Moderate optimization

In this example, we enable some optimization (-O1):

save %sp,-120,%sp ! Rotate the register window

add %i0,-4,%o0 ! Address of A(0)

st %o0,[%fp-12] ! Store on the stack

add %i1,-4,%o0 ! Address of B(0)

st %o0,[%fp-4] ! Store on the stack

add %i2,-4,%o0 ! Address of C(0)

st %o0,[%fp-8] ! Store on the stack

sethi %hi(GPB.addem.i),%o0 ! Address of I (top portion)

add %o0,%lo(GPB.addem.i),%o2 ! Address of I (lower portion)

ld [%i3],%o0 ! %o0 = N (fourth parameter)

or %g0,1,%o1 ! %o1 = 1 (for addition)

st %o0,[%fp-20] ! store N on the stack

st %o1,[%o2] ! Set memory copy of I to 1

ld [%o2],%o1 ! o1 = I (kind of redundant)

cmp %o1,%o0 ! Check I > N (zero-trip?)

bg .L12 ! Don’t do loop at all

nop ! Delay Slot

ld [%o2],%o0 ! Pre-load for Branch Delay Slot

.L900000110: ! Top of the loop

ld [%fp-4],%o1 ! o1 = Address of B(0)

sll %o0,2,%o0 ! Multiply I by 4

ld [%o1+%o0],%f2 ! f2 = B(I)

ld [%o2],%o0 ! Load I from memory

ld [%fp-8],%o1 ! o1 = Address of C(0)

sll %o0,2,%o0 ! Multiply I by 4

ld [%o1+%o0],%f3 ! f3 = C(I)

fadds %f2,%f3,%f2 ! Register-to-register add

ld [%o2],%o0 ! Load I from memory (not again!)

ld [%fp-12],%o1 ! o1 = Address of A(0)

sll %o0,2,%o0 ! Multiply I by 4 (yes, again)

st %f2,[%o1+%o0] ! A(I) = f2

ld [%o2],%o0 ! Load I from memory

add %o0,1,%o0 ! Increment I in register

st %o0,[%o2] ! Store I back into memory

ld [%o2],%o0 ! Load I back into a register

ld [%fp-20],%o1 ! Load N into a register

cmp %o0,%o1 ! I > N ??

ble,a .L900000110

ld [%o2],%o0 ! Branch Delay Slot

This is a significant improvement from the previous example. Some loop constant computations (subtracting 4) were

Online Book Reader

High Performance Computing - Charles Severance [139]

®Online Book Reader