Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

High Performance Computing - Charles Severance [137]

By Root 1353 0

learn from looking at the output of the compiler.

Intel 8088

The Intel 8088 processor used in the original IBM Personal Computer is a very traditional CISC processing system with features severely limited by its transistor count. It has very few registers, and the registers generally have rather specific functions. To support a large memory model, it must set its segment register leading up to each memory operation. This limitation means that every memory access takes a minimum of three instructions. Interestingly, a similar pattern occurs on RISC processors.

You notice that at one point, the code moves a value from the ax register to the bx register because it needs to perform another computation that can only be done in the ax register. Note that this is only an integer computation, as the Intel

mov word ptr -2[bp],0 # bp is I

$11:

mov ax,word ptr -2[bp] # Load I

cmp ax,word ptr 18[bp] # Check I>=N

bge $10

shl ax,1 # Multiply I by 2

mov bx,ax # Done - now move to bx

add bx,word ptr 10[bp] # bx = Address of B + Offset

mov es,word ptr 12[bp] # Top part of address

mov ax,es: word ptr [bx] # Load B(i)

mov bx,word ptr -2[bp] # Load I

shl bx,1 # Multiply I by 2

add bx,word ptr 14[bp] # bx = Address of C + Offset

mov es,word ptr 16[bp] # Top part of address

add ax,es: word ptr [bx] # Load C(I)

mov bx,word ptr -2[bp] # Load I

shl bx,1 # Multiply I by 2

add bx,word ptr 6[bp] # bx = Address of A + Offset

mov es,word ptr 8[bp] # Top part of address

mov es: word ptr [bx],ax # Store

$9:

inc word ptr -2[bp] # Increment I in memory

jmp $11

$10:

Because there are so few registers, the variable I is kept in memory and loaded several times throughout the loop. The inc instruction at the end of the loop actually updates the value in memory. Interestingly, at the top of the loop, the value is then reloaded from memory.

In this type of architecture, the available registers put such a strain on the flexibility of the compiler, there is often not much optimization that is practical.

Motorola MC68020

In this section, we examine another classic CISC processor, the Motorola MC68020, which was used to build Macintosh computers and Sun workstations. We happened to run this code on a BBN GP-1000 Butterfly parallel processing system made up of 96 MC68020 processors.

The Motorola architecture is relatively easy to program in assembly language. It has plenty of 32-bit registers, and they are relatively easy to use. It has a CISC instruction set that keeps assembly language programming quite simple. Many instructions can perform multiple operations in a single instruction.

We use this example to show a progression of optimization levels, using a f77 compiler on a floating-point version of the loop. Our first example is with no optimization:

! Note d0 contains the value I

L5:

movl d0,L13 ! Store I to memory if loop ends

lea a1@(-4),a0 ! a1 = address of B

fmoves a0@(0,d0:l:4),fp0 ! Load of B(I)

lea a3@(-4),a0 ! a3 = address of C

fadds a0@(0,d0:l:4),fp0 ! Load of C(I) (And Add)

lea a2@(-4),a0 ! a2 = address of A

fmoves fp0,a0@(0,d0:l:4) ! Store of A(I)

addql #1,d0 ! Increment I

subql #1,d1 ! Decrement "N"

tstl d1

bnes L5

The value for I is stored in the d0 register. Each time through the loop, it’s incremented by 1. At the same time, register d1 is initialized to the value for N and decremented each time through the loop. Each time through the loop, I is stored into memory, so the proper value for I ends up in memory when the loop terminates. Registers a1, a2, and a3 are preloaded to be the first address of the arrays B, A, and C respectively. However, since FORTRAN arrays begin at 1, we must subtract 4 from each of these addresses before we can use I as the offset. The lea instructions are effectively subtracting 4 from one address register and storing it in another.

The following instruction performs an address computation that is almost a one-to- one translation of an array reference:

fmoves a0@(0,d0:l:4),fp0 ! Load of B(I)

This instruction retrieves a floating-point value from the

Online Book Reader

High Performance Computing - Charles Severance [137]

®Online Book Reader