Beautiful Code [151]
Figure 10-2. A circuit for the array population count
Table 10-1 summarizes the number of instructions executed by generalizations of the plan of Figure 10-2 for various group sizes. The values in the middle two columns ignore loads and loop control. The third column gives the total loop instruction execution count per word of the input array, produced by a compiler for a basic RISC machine that does not have indexed loads.
Table 10-1. Instructions per word for the array population count
Program Instructions exclusive of loads and loop control All instructions in loop (compiler output)
Formula For nc = 5, np = 15
Naive method np + 1 16 21
Groups of 2 (nc + np + 1 ) / 2 10.5 14
Groups of 4 (3nc + np + 1 ) / 4 7.75 10
Groups of 8 (7nc + np + 1 ) / 8 6.38 8
Groups of 16 15nc + np + 1 ) / 16 5.69 7
Groups of 32 31nc + np + 1 ) / 32 5.34 6.5
Groups of 2n
It is a pleasant surprise that in the limit, the number of computational instructions required to compute the population count of n words is reduced from the naïve method's 16n to the CSA method's 5n, where the 5 is the number of instructions required to implement one CSA circuit.
For small arrays, there are better plans than that of Figure 10-2. For example, for an array of seven words, the plan of Figure 10-3 is quite efficient.[**]It executes in 4nc +3np +4=69 instructions, or 9.86 instructions per word. Similar plans exist that apply to arrays of size 2k–1 words, for any positive integer k. The plan for 15 words executes in 11nc +4np +6= 121 instructions, or 8.07 instructions per word.
[**] Seal, op. cit.
The Quest for an Accelerated Population Count > Applications
10.7. Applications
The population count instruction has a miscellany of uses. As mentioned at the beginning of this chapter, one use is to compute the size of a set when sets are represented by bit strings. In this representation, there is a "universe" set whose members are numbered sequentially. A set is represented by a bit string in which bit i is 1 if and only if member i is in the set.
Figure 10-3. A circuit for the total population count of seven words
Another simple application is to compute the Hamming distance between two bit vectors, a concept from the theory of error-correcting codes. The Hamming distance is simply the number of places where the vectors differ, that is:[]
[] See, for example, the chapter on error-correcting codes in A. K. Dewdney, The Turing Omnibus. Computer Science Press, 1989.
dist(x, y) = pop(x y)
The population count instruction may be used to compute the number of trailing 0s in a word, using relations such as:
ntz(x) = pop(¬x & (x – 1)) = 32 – pop(x | –x)
(The reader who is not familiar with these mixtures of arithmetic and logical operations might pause for a few moments to discover why they work.) The function ntz(x) also has a miscellany of uses. For example, some early computers, upon interrupt, would store a "reason for interrupt" bit in a special register. The bits were placed in a position that identified which type of interrupt occurred. The positions were chosen in priority order, usually with the higher-priority interrupts in the less significant positions. Two or more bits could be set at the same time. To determine which interrupt to process, the supervisor program would execute the ntz function on the quantity in the special register.
Another application of population count is to allow reasonably fast direct indexed access to a moderately sparse array A that is represented in a certain compact way. In the compact representation, only the defined, or nonzero, elements of the array are stored. There is an auxiliary bit string bits that has a 1-bit for each bit position i for which A[i] is defined. Since bits is generally quite long, it is broken up into 32-bit words, with the first bit of the long string being at bit 0 (the least significant bit) of the first word of bits.
As a speedup device, there