Beautiful Code [150]
There is another way to use the CSA operation that results in a more efficient and slightly more compact program. This is shown in Example 10-4. It takes (nc + np + 1) / 2 = 10.5 instructions per word (ignoring loop control and loads).
Example 10-4. Array population count, processing elements in groups of two
#define CSA(h,l, a,b,c) \
{unsigned u = a ^ b; unsigned v = c; \
h = (a & b) | (u & v); l = u ^ v;}
int popArray(unsigned A[], int n) {
int tot, i;
unsigned ones, twos;
tot = 0; // Initialize.
ones = 0;
for (i = 0; i <= n - 2; i = i + 2) {
CSA(twos, ones, ones, A[i], A[i+1])
tot = tot + pop(twos);
}
tot = 2*tot + pop(ones);
if (n & 1) // If there's a last one,
tot = tot + pop(A[i]); // add it in.
return tot;
}
When Example 10-4 is compiled, the CSA operation expands into:
u = ones ^ A[i];
v = A[i+1];
twos = (ones & A[i]) | (u & v);
ones = u ^ v;
The code relies on the compiler to omit subsequent loads of a quantity that has already been loaded (a process known as commoning).
There are ways to use the CSA operation to further reduce the number of instructions required to compute the population count of an array. They are most easily understood by means of a circuit diagram. For example, Figure 10-2 illustrates a way to code a loop that takes array elements eight at a time and compresses them into four quantities, labeled eights, fours, twos, and ones. The fours, twos, and ones are fed back into the CSAs on the next loop iteration, the 1-bits in eights are counted by an execution of the word-level population count function, and this count is accumulated. When the entire array has been processed, the total population count is:
8 x pop(eights) + 4 x pop(fours) + 2 x pop(twos) + pop(ones)
The code is shown in Example 10-5, which uses the CSA macro defined in Example 10-4. The numbering of the CSA blocks in Figure 10-2 corresponds to the order of the CSA macro calls in Example 10-5. The execution time of the loop, exclusive of array loads and loop control, is (7nc + np + 1) / 8 = 6.375 instructions per word of the array.
Example 10-5. Array population count, processing elements in groups of eight
int popArray(unsigned A[], int n) {
int tot, i;
unsigned ones, twos, twosA, twosB,
fours, foursA, foursB, eights;
tot = 0; // Initialize.
fours = twos = ones = 0;
for (i = 0; i <= n - 8; i = i + 8) {
CSA(twosA, ones, ones, A[i], A[i+1])
CSA(twosB, ones, ones, A[i+2], A[i+3])
CSA(foursA, twos, twos, twosA, twosB)
CSA(twosA, ones, ones, A[i+4], A[i+5])
CSA(twosB, ones, ones, A[i+6], A[i+7])
CSA(foursB, twos, twos, twosA, twosB)
CSA(eights, fours, fours, foursA, foursB)
tot = tot + pop(eights);
}
tot = 8*tot + 4*pop(fours) + 2*pop(twos) + pop(ones);
for (i = i; i < n; i++) // Simply add in the last
tot = tot + pop(A[i]); // 0 to 7 elements.
return tot;
}
The CSAs may be connected in many arrangements other than that shown in Figure 10-2. For example, increased instruction-level parallelism might result from feeding the first three array elements into one CSA, and the next three into a second CSA, which allows the instructions of these two CSAs to execute in parallel. One might also be able to permute the three input operands of the CSA macros for increased parallelism. With the plan shown in Figure 10-2, one can easily see how to use only the first three CSAs to construct a program that processes array elements in groups of four, and also how to expand it to construct programs that process array elements in groups of 16 or more. The plan shown also spreads out the loads somewhat, which is advantageous