Mastering Algorithms With C - Kyle Loudon [169]
free(compressed);
return 0;
}
/*****************************************************************************
* *
* ------------------------------- recv_comp ------------------------------ *
* *
*****************************************************************************/
int recv_comp(int s, unsigned char **data, int *size, int flags) {
unsigned char *compressed;
int size_comp;
/*****************************************************************************
* *
* Receive the compressed data preceded by its size. *
* *
*****************************************************************************/
if (recv(s, (char *)&size_comp, sizeof(int), flags) != sizeof(int))
return -1;
if ((compressed = (unsigned char *)malloc(size_comp)) == NULL)
return -1;
if (recv(s, (char *)compressed, size_comp, flags) != size_comp) {
free(compressed);
return -1;
}
/*****************************************************************************
* *
* Uncompress the data. *
* *
*****************************************************************************/
if ((*size = huffman_uncompress(compressed, data)) < 0)
return -1;
/*****************************************************************************
* *
* Free the buffer of compressed data. *
* *
*****************************************************************************/
free(compressed);
return 0;
}
Description of LZ77
LZ77 ( Lempel-Ziv-1977) is a simple but surprisingly effective form of data compression that takes an entirely different approach from Huffman coding. LZ77 is a dictionary-based method, which means that it tries to compress data by encoding long strings of symbols, called phrases, as small tokens that reference entries in a dictionary. Compression is achieved by using relatively small tokens in place of longer phrases that appear several times in the data. As with Huffman coding, it is important to realize that a symbol is not necessarily a character of text: a symbol can be any amount of data we choose, but it is often one byte's worth.
Maintaining a Dictionary of Phrases
Different dictionary-based compression methods use various approaches for maintaining their dictionaries. LZ77 uses a look-ahead buffer and a sliding window . LZ77 works by first loading a portion of the data into the look-ahead buffer. To understand how the look-ahead buffer stores phrases that effectively form a dictionary, picture the buffer as a sequence of symbols s 1, . . . , sn , and Pb as a set of phrases constructed from the symbols. From the sequence s 1, . . . , sn , we form n phrases, defined as:
Pb = {(s1), (s1, s2), . . . ,(s1, . . . ,sn )}
This means that if the look-ahead buffer contains the symbols (A, B, D), for example, the phrases in the buffer are {(A), (A, B), (A, B, D)}. Once data passes through the look-ahead buffer, it moves into the sliding window and becomes part of the dictionary. To understand how phrases are represented in the sliding window, consider the window to be a sequence of symbols s 1, . . ., sm , and Pw to be a set of phrases constructed from these symbols. From the sequence s 1, . . ., sm , we form the set of phrases as follows:
Pw = {p1, p2, . . . , pm}, where pi = {(si), (si, si+1), . . . , si, si+1, . . . , sm)}
Thus, if the sliding window contains the symbols (A, B, C), the phrases in the window, and hence the dictionary, are {(A), (A, B), (A, B, C), (B), (B, C), (C)}. The main idea behind LZ77 is to look continually for the longest phrase in the look-ahead buffer that matches a phrase currently in the dictionary. In the look-ahead buffer and sliding window just described, the longest match is the two-symbol phrase (A, B).
Compressing and Uncompressing Data
As we compress the data, two situations can exist between the look-ahead buffer and the sliding window at any given moment: there can either be a phrase of some length that matches, or there may be no match at all. When there is at least one match, we encode the longest match as a phrase token . Phrase tokens contain