Mastering Algorithms With C - Kyle Loudon [171]
Complexity
O (n), where n is the number of symbols in the original data.
Implementation and Analysis of LZ77
With LZ77, we try to compress data by encoding phrases from a look-ahead buffer as tokens referencing phrases in a sliding window. To uncompress the data, we decode each token into the phrase or symbol it represents. To do this, we must continually update the sliding window so that at any one time it looks the same as it did during the compression process. In the implementation presented here, a symbol in the original data is one byte.
lz77_compress
The lz77_compress operation (see Example 14.5) compresses data using LZ77. It begins by writing the number of symbols in the data to the buffer of compressed data and initializing the sliding window and look-ahead buffer. The look-ahead buffer is then loaded with symbols.
Compression takes place inside of a loop that iterates until there are no more symbols to process. We use ipos to keep track of the current byte being processed in the original data, and opos to keep track of the current bit we are writing to the buffer of compressed data. During each iteration of the loop, we call compare_win to determine the longest phrase in the look-ahead buffer that matches one in the sliding window. The compare_win function returns the length of the longest match.
When a match is found, compare_win sets offset to the position of the match in the sliding window and next to the symbol in the look-ahead buffer immediately after the match. In this case, we write a phrase token to the compressed data (see Figure 14.4a). Phrase tokens in the implementation presented here require 12 bits for offsets because the size of the sliding window is 4K (4096 bytes). Phrase tokens require 5 bits for lengths because no match will exceed the length of the look-ahead buffer, which is 32 bytes. If a match is not found, compare_win returns and sets next to the unmatched symbol at the start of the look-ahead buffer. In this case, we write a symbol token to the compressed data (see Figure 14.4b). Whether we write a phrase or symbol token to the compressed data, before actually writing the token, we call the network function htonl as a convenient way to ensure that the token is in big-endian format. This is the format required when we actually store the compressed data as well as when we uncompress it.
Figure 14.4. The structure of (a) a phrase token and (b) a symbol token in LZ77
Once we write the appropriate token to the buffer of compressed data, we adjust the sliding window and the look-ahead buffer. To move the data through the sliding window, we shift data in from the right side of the window and out the left. We do the same for the look-ahead buffer. The number of bytes we move is equal to the number of symbols we encode in the token.
The runtime complexity of lz77_compress is O (n), where n is the number of symbols in the original data. This is because for each of the n/c tokens in which the data is encoded, where 1/c is a constant factor that represents how efficiently symbols are encoded in phrase tokens, we call compare_win once. The compare_win function runs in a constant amount of time because the size of the sliding window and look-ahead buffer are both constant. However, these constants are large and contribute significantly to the overall running time of lz77_compress. Thus, the runtime complexity of lz77_compress is O (n), but its actual running time is greatly affected by constant factors. This explains the generally slow performance of LZ77 when compressing data.
lz77_uncompress
The lz77_uncompress operation (see Figure 14.4) uncompresses data previously compressed with lz77_compress. It begins by reading the number of symbols in the compressed data and initializing the sliding window and look-ahead buffer.
Uncompressing the data takes place inside a loop that iterates