Online Book Reader

Home Category

High Performance Computing - Charles Severance [140]

By Root 1407 0
hoisted out of the loop. We only loaded I 4 times during a loop iteration. Strangely, the compiler didn’t choose to store the addresses of A(0), B(0), and C(0) in registers at all even though there were plenty of registers. Even more perplexing is the fact that it loaded a value from memory immediately after it had stored it from the exact same register!

But one bright spot is the branch delay slot. For the first iteration, the load was done before the loop started. For the successive iterations, the first load was done in the branch delay slot at the bottom of the loop.

Comparing this code to the moderate optimization code on the MC68020, you can begin to get a sense of why RISC was not an overnight sensation. It turned out that an unsophisticated compiler could generate much tighter code for a CISC processor than a RISC processor. RISC processors are always executing extra instructions here and there to compensate for the lack of slick features in their instruction set. If a processor has a faster clock rate but has to execute more instructions, it does not always have better performance than a slower, more efficient processor.

But as we shall soon see, this CISC advantage is about to evaporate in this particular example.


Higher optimization

We now increase the optimization to -O2. Now the compiler generates much better code. It’s important you remember that this is the same compiler being used for all three examples.

At this optimization level, the compiler looked through the code sufficiently well to know it didn’t even need to rotate the register windows (no save instruction). Clearly the compiler looked at the register usage of the entire routine:

! Note, didn’t even rotate the register Window

! We just use the %o registers from the caller

! %o0 = Address of first element of A (from calling convention)

! %o1 = Address of first element of B (from calling convention)

! %o2 = Address of first element of C (from calling convention)

! %o3 = Address of N (from calling convention)

addem_:

ld [%o3],%g2 ! Load N

cmp %g2,1 ! Check to see if it is <1

bl .L77000006 ! Check for zero trip loop

or %g0,1,%g1 ! Delay slot - Set I to 1

.L77000003:

ld [%o1],%f0 ! Load B(I) First time Only

.L900000109:

ld [%o2],%f1 ! Load C(I)

fadds %f0,%f1,%f0 ! Add

add %g1,1,%g1 ! Increment I

add %o1,4,%o1 ! Increment Address of B

add %o2,4,%o2 ! Increment Address of C

cmp %g1,%g2 ! Check Loop Termination

st %f0,[%o0] ! Store A(I)

add %o0,4,%o0 ! Increment Address of A

ble,a .L900000109 ! Branch w/ annul

ld [%o1],%f0 ! Load the B(I)

.L77000006:

retl ! Leaf Return (No window)

nop ! Branch Delay Slot

This is tight code. The registers o0, o1, and o2 contain the addresses of the first elements of A, B, and C respectively. They already point to the right value for the first iteration of the loop. The value for I is never stored in memory; it is kept in global register g1. Instead of multiplying I by 4, we simply advance the three addresses by 4 bytes each iteration.

The branch delay slots are utilized for both branches. The branch at the bottom of the loop uses the annul feature to cancel the following load if the branch falls through.

The most interesting observation regarding this code is the striking similarity to the code and the code generated for the MC68020 at its top optimization level:

L3:

fmoves a1@,fp0 ! Load B(I)

fadds a0@,fp0 ! Add C(I)

fmoves fp0,a2@ ! Store A(I)

addql #4,a0 ! Advance by 4

addql #4,a1 ! Advance by 4

addql #4,a2 ! Advance by 4

subql #1,d0 ! Decrement I

tstl d0

bnes L3

The two code sequences are nearly identical! For the SPARC, it does an extra load because of its load-store architecture. On the SPARC, I is incremented and compared to N, while on the MC68020, I is decremented and compared to zero.

This aptly shows how the advancing compiler optimization capabilities quickly made the “nifty” features of the CISC architectures rather useless. Even on the CISC processor, the post-optimization code used the simple forms of the instructions because they produce they fastest

Return Main Page Previous Page Next Page

®Online Book Reader