Beautiful Code [145]
This chapter discusses how to compute the population count function on a machine that does not have that instruction, but which has the fundamental instructions generally found on a RISC or CISC computer: shift, add, and, load, conditional branch, and so forth. For illustration, we assume the computer has a 32-bit word size, but most of the techniques discussed here can be easily adapted to other word sizes.
Two problems in population counting are addressed: counting the 1-bits in a single computer word, and counting the 1-bits in a large number of words, perhaps arranged in an array. In each case, we show that the obvious solution, even if carefully honed, can be beaten substantially by very different algorithms that take some imagination to find. The first is an application of the divide-and-conquer strategy, and the second is an application of a certain logic circuit that is familiar to computer logic designers but not so familiar to programmers.
10.1. Basic Methods
As a first cut, a programmer might count the 1-bits in a word x, as illustrated in the following C-language solution. Here x is an unsigned integer, so the right shift is with 0-fill:
pop = 0;
for (i = 0; i < 32; i++){
if (x & 1) pop = pop + 1;
x = x >> 1;
}
On a typical RISC computer, the loop might compile into about seven instructions, two of which are conditional branches. (One of the conditional branches is for loop control.) These seven instructions are executed 32 times, but one of them is bypassed about half the time (we might presume), so that it executes about 32 x 6.5 = 208 instructions.
It would probably not take the programmer long to realize that this code can be easily improved. For one thing, on many computers, counting down from 31 to 0 is more efficient than counting up, because it saves a compare instruction. Better yet, why count at all? Just let the loop go until x is 0. This eliminates some iterations if x has high-order 0-bits. Another optimization is to replace the if test with code that simply adds the rightmost bit of x to the count. This leads to the code:
pop = 0;
while (x) {
pop = pop + (x & 1);
x = x >> 1;
}
This has only four or five RISC instructions in the loop, depending upon whether or not a compare of x to 0 is required, and only one branch. (We assume the compiler rearranges the loop so that the conditional branch is at the bottom.) Thus, it takes a maximum of 128 to 160 instructions. The maximum occurs if x begins with a 1-bit, but it will take far fewer instructions if x has many leading 0s.
Some readers may recall that the simple expression x&(x-1) is x with its least significant 1-bit turned off, or is 0 if x= 0. Thus, to count the 1-bits in x, one can turn them off one at a time until the result is 0, keeping count of how many were turned off. This leads to the code:
pop = 0;
while (x) {
pop = pop + 1;
x = x & (x - 1);
}
Like the preceding code, this takes four or five instructions in the loop, but the loop runs only the same number of times as the number of 1s in x. This is surely an improvement.
A complementary approach, applicable if the number of 1-bits is expected to be large, is to keep turning on the rightmost 0-bit with x = x|(x+1) until the result is all 1s (-1). Count the number of iterations executed in a variable n, and return 32 - n. (Alternatively, the original number x can be complemented, or n can be initialized to 32 and counted down.)
The first program in this series is rather dull, but the others might be considered to have some beauty to an eye that values efficiency, conciseness, and useful cleverness. The first program can be made to run substantially faster by unrolling the loop, but the other two programs would be improved very little, if at all, by this change.
One can also employ table lookup, translating perhaps a byte of x at a time to the number of 1-bits in that byte. The code is