Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

High Performance Computing - Charles Severance [141]

By Root 1413 0

execution time.

Note that these code sequences were generated on an MC68020. An MC68060 should be able to eliminate the three addql instructions by using post-increment, saving three instructions. Add a little loop unrolling, and you have some very tight code. Of course, the MC68060 was never a broadly deployed workstation processor, so we never really got a chance to take it for a test drive.

Convex C-240

This section shows the results of compiling on the Convex C-Series of parallel/vector supercomputers. In addition to their normal registers, vector computers have vector registers that contain up to 256 64-bit elements. These processors can perform operations on any subset of these registers with a single instruction.

It is hard to claim that these vector supercomputers are more RISC or CISC. They have simple lean instruction sets and, hence, are RISC-like. However, they have instructions that implement loops, and so they are somewhat CISC-like.

The Convex C-240 has scalar registers (s2), vector registers (v2), and address registers (a3). Each vector register has 128 elements. The vector length register controls how many of the elements of each vector register are processed by vector instructions. If vector length is above 128, the entire register is processed.

The code to implement our loop is as follows:

L4: mov.ws 2,vl ; Set the Vector length to N

ld.w 0(a5),v0 ; Load B into Vector Register

ld.w 0(a2),v1 ; Load C into Vector Register

add.s v1,v0,v2 ; Add the vector registers

st.w v2,0(a3) ; Store results into A

add.w #-128,s2 ; Decrement "N"

add.w #512,a2 ; Advance address for A

add.w #512,a3 ; Advance address for B

add.w #512,a5 ; Advance address for C

lt.w #0,s2 ; Check to see if "N" is < 0

jbrs.t L4

Initially, the vector length register is set to N. We assume that for the first iteration, N is greater than 128. The next instruction is a vector load instruction into register v0. This loads 128 32-bit elements into this register. The next instruction also loads 128 elements, and the following instruction adds those two registers and places the results into a third vector register. Then the 128 elements in Register v2 are stored back into memory. After those elements have been processed, N is decremented by 128 (after all, we did process 128 elements). Then we add 512 to each of the addresses (4 bytes per element) and loop back up. At some point, during the last iteration, if N is not an exact multiple of 128, the vector length register is less than 128, and the vector instructions only process those remaining elements up to N.

One of the challenges of vector processors is to allow an instruction to begin executing before the previous instruction has completed. For example, once the load into Register v1 has partially completed, the processor could actually begin adding the first few elements of v0 and v1 while waiting for the rest of the elements of v1 to arrive. This approach of starting the next vector instruction before the previous vector instruction has completed is called chaining. Chaining is an important feature to get maximum performance from vector processors.

IBM RS-6000

The IBM RS-6000 is generally credited as the first RISC processor to have cracked the Linpack 100×100 benchmark. The RS-6000 is characterized by strong floating-point performance and excellent memory bandwidth among RISC workstations. The RS-6000 was the basis for IBM’s scalable parallel processor: the IBM-SP1 and SP2.

When our example program is run on the RS-6000, we can see the use of a CISC- style instruction in the middle of a RISC processor. The RS-6000 supports a branch- on-count instruction that combines the decrement, test, and branch operations into a single instruction. Moreover, there is a special register (the count register) that is part of the instruction fetch unit that stores the current value of the counter. The fetch unit also has its own add unit to perform the decrements for this instruction.

These types of features creeping into RISC architectures are occuring because there is plenty of

Online Book Reader

High Performance Computing - Charles Severance [141]

®Online Book Reader