Mastering Algorithms With C - Kyle Loudon [160]
* *
* Shift the current byte to the left. *
* *
********************************************************************/
bits[i] = bits[i] << 1;
}
/***********************************************************************
* *
* Set the rightmost bit of the buffer to the bit shifted off the *
* first byte. *
* *
***********************************************************************/
bit_set(bits, size - 1, fbit);
}
}
return;
}
Description of Huffman Coding
One of the oldest and most elegant forms of data compression is Huffman coding, an algorithm based on minimum redundancy coding. Minimum redundancy coding suggests that if we know how often different symbols occur in a set of data, we can represent the symbols in a way that makes the data require less space. The idea is to encode symbols that occur more frequently with fewer bits than those that occur less frequently. It is important to realize that a symbol is not necessarily a character of text: a symbol can be any amount of data we choose, but it is often one byte's worth.
Entropy and Minimum Redundancy
To begin, let's revisit the concept of entropy introduced at the beginning of the chapter. Recall that every set of data has some informational content, which is called its entropy. The entropy of a set of data is the sum of the entropies of each of its symbols. The entropy S of a symbol z is defined as:
Sz = -lgPz
where Pz is the probability of z being found in the data. If it is known exactly how many times z occurs, Pz is referred to as the frequency of z. As an example, if z occurs 8 times in 32 symbols, or one-fourth of the time, the entropy of z is:
-lg(1/4) = 2 bits
This means that using any more than two bits to represent z is more than we need. If we consider that normally we represent a symbol using eight bits (one byte), we see that compression here has the potential to improve the representation a great deal.
Table 14.1 presents an example of calculating the entropy of some data containing 72 instances of five different symbols. To do this, we sum the entropies contributed by each symbol. Using "U" as an example, the total entropy for a symbol is computed as follows. Since "U" occurs 12 times out of the 72 total, each instance of "U" has an entropy that is calculated as:
-lg(12/72) = 2.584963 bits
Consequently, because "U" occurs 12 times in the data, its contribution to the entropy of the data is calculated as:
(2.584963)(12) = 31.01955 bits
In order to calculate the overall entropy of the data, we sum the total entropies contributed by each symbol. To do this for the data in Table 14.1, we have:
31.01955+36.000000+23.53799+33.94552+36.95994 = 161.46300 bits
If using 8 bits to represent each symbol yields a data size of (72)(8) = 576 bits, we should be able to compress this data, in theory, by up to:
1-(161.463000/576) = 72.0%
Table 14.1. The Entropy of a Set of Data Containing 72 Instances of 5 Different Symbols
Symbol
Probability
Entropy of Each Instance
Total Entropy
U
12/72
2.584963
31.01955
V
18/72
2.000000
36.00000
W
7/72
3.362570
23.53799
X
15/72
2.263034
33.94552
Y
20/72
1.847997
36.95994
Building a Huffman Tree
Huffman coding presents a way to approximate the optimal representation of data based on its entropy. It works by building a data structure called a Huffman tree, which is a binary tree (see Chapter 9) organized to generate Huffman codes. Huffman codes are the codes assigned to symbols in the data to achieve compression. However, Huffman codes result in compression that only approximates the data's entropy because, as you may have noticed in Table 14.1, the entropies of symbols often come out to be fractions of bits. Since the actual number of bits used in Huffman codes cannot be fractions in practice, some codes end up with slightly too many bits to be optimal.
Figure 14.1 illustrates the process of building a Huffman tree from the data in Table 14.1. Building a Huffman tree proceeds from its leaf nodes upward. To begin, we place each symbol