Mastering Algorithms With C - Kyle Loudon [161]
Figure 14.1. Building a Huffman tree from the symbols and frequencies in Table 14.1
Compressing and Uncompressing Data
Building a Huffman tree is part of both compressing and uncompressing data. To compress data using a Huffman tree, given a specific symbol, we start at the root of the tree and trace a path to the symbol's leaf. As we descend along the path, whenever we move to the left, we append to the current code; whenever we move to the right, we append 1. Thus, in Figure 14.1, step 6, to determine the Huffman code for "U" we move to the right (1), then to the left (10), and then to the right again (101). The Huffman codes for all of the symbols in the figure are:
U = 101, V = 01, W = 100, X = 00, Y = 11
To uncompress data using a Huffman tree, we read the compressed data bit by bit. Starting at the tree's root, whenever we encounter in the data, we move to the left in the tree; whenever we encounter 1, we move to the right. Once we reach a leaf node, we generate the symbol it contains, move back to the root of the tree, and repeat the process until we exhaust the compressed data. Uncompressing data in this manner is possible because Huffman codes are prefix free, which means that no code is a prefix of any other. This ensures that once we encounter a sequence of bits that matches a code, there is no ambiguity as to the symbol it represents. For example, notice that 01, the code for "V," is not a prefix of any of the other codes. Thus, as soon as we encounter 01 in the compressed data, we know that the code must represent "V."
Effectiveness of Huffman Coding
To determine the reduced size of data compressed using Huffman coding, we calculate the product of each symbol's frequency times the number of bits in its Huffman code, then add them together. Thus, to calculate the compressed size of the data presented in Table 14.1 and Figure 14.1, we have:
(12)(3)+(18)(2)+(7)(3)+(15)(2)+(20)(2) = 163 bits
Assuming that without compression each of the 72 symbols would be represented with 8 bits, for a total data size of 576 bits, we end up with the following compression ratio:
1-(163/576)=71.7%
Once again, considering the fact that we cannot take into account fractional bits in Huffman coding, in many cases this value will not be quite as good as the data's entropy suggests, although in this case it is very close.
In general, Huffman coding is not the most effective form of compression, but it runs fast both when compressing and uncompressing data. Generally, the most time-consuming aspect of compressing data with Huffman coding is the need to scan the data twice: once to gather frequencies, and a second time actually to compress the data. Uncompressing the data is particularly efficient because decoding the sequence of bits for each symbol requires only a brief scan of the Huffman tree, which is bounded.
Interface for Huffman Coding
Name
huffman_compress
Synopsis
int huffman_compress(const unsigned char *original, unsigned char **compressed,
int size);
Return Value
Number of bytes in the compressed data if compressing the data is successful, or -1 otherwise.
Description
Uses Huffman coding to compress a buffer of data specified by original, which contains size bytes. The compressed data is written to a buffer returned in compressed. Since the amount of storage required in compressed is unknown to the caller, huffman_compress