Beautiful Code [149]
Another way is to use the first two executable lines of Example 10-1 on groups of three words in the array, adding the three partial results. Because each partial result has a maximum value of 4 in each 4-bit field, the sum of the three has a maximum value of 12 in each 4-bit field, so no overflow occurs. This idea can be applied to the 8-and 16-bit fields. Coding and compiling this method indicates that it gives about a 20 percent reduction over the naíve method in total number of instructions executed on a basic RISC. Much of the savings is canceled by the additional housekeeping instructions required. We will not dwell on this method because there is a much better way to do it.
The better way seems to have been invented by Robert Harley and David Seal in about 1996.[||]It is based on a circuit called a carry-save adder (CSA) or 3:2 compressor. A CSA is simply a sequence of independent full adders[#] and is often used in binary multiplier circuits.
[||] David Seal, Newsgroup comp.arch.arithmetic, May 13, 1997. Robert Harley was the first person known to this writer to apply the CSA to this problem, and David Seal showed a particularly good way to use it for counting the bits in a large array (as illustrated in Figure 10-2 and Example 10-5), and also for an array of size 7 (similar to the plan in Figure 10-3).
[#] A full adder is a circuit with three 1-bit inputs (the bits to be added) and two 1-bit outputs (the sum and carry). See John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1990.
In Boolean algebra notation (juxtaposition denotes and, + denotes or, and denotes exclusive or), the logic for each full adder is:
h ab + ac + bc = ab + (a + b)c = ab + (a b)c
l (a b) c
where a, b, and c are the 1-bit inputs, l is the low-bit output (sum), and h is the high-bit output (carry). Changing a + b on the first line to a b is justified because when a and b are both 1, the term ab makes the value of the whole expression 1. By first assigning a b to a temporary, the full adder logic can be evaluated in five logical instructions, each operating on 32 bits in parallel (on a 32-bit machine). We will refer to these five instructions as CSA(h, l, a, b, c). This is a "macro," with h and l being outputs.
One way to use the CSA operation is to process elements of the array A in groups of three, reducing each group of three words to two and applying the population count operation to these two words. In the loop, these two population counts are summed. After executing the loop, the total population count of the array is twice the accumulated population count of the CSA's high-bit outputs plus the accumulated population count of the low-bit outputs.
The following sequence illustrates the process for a 16-bit word:
a = 0110 1001 1110 0101 9
b = 1000 1000 0100 0111 6
c = 1100 1010 0011 0101 8
-------------------------
l = 0010 1011 1001 0111 9
h = 1100 1000 0110 0101 7*2 = 14
Observe that in each column, the (h, l) pair, written in that order, is a two-bit binary number whose value is the number of 1-bits in a, b, and c, in the column. Thus each 1-bit in h represents two 1-bits in a, b, and c, and each 1-bit in l represents one 1-bit in a, b, and c. Therefore the total population (shown at the right) is twice the number of 1-bits in h plus the number of 1-bits in l, which totals to 23 in the illustration.
Let nc be the number of instructions required for the CSA steps, and