Webbots, Spiders, and Screen Scrapers - Michael Schrenk [35]
$uncompressed_file = gzuncompress($compressed_file);
Listing 6-9: Decompressing a compressed file
Compressing Files on Your Hard Drive
PHP provides a variety of built-in functions for compressing data. Listing 6-10 demonstrates these functions. This script downloads the default HTML file from http://www.schrenk.com, compresses the file, and shows the difference between the compressed and uncompressed files. The PHP sections of this script appear in bold.
# Demonstration of PHP file compression
# Include cURL library
include("LIB_http.php");
# Get web page
$target = "http://www.schrenk.com";
$ref = "";
$method = "GET";
$data_array = "";
$web_page = http_get($target, $ref, $method, $data_array, EXCL_HEAD);
# Get sizes of compressed and uncompressed versions of web page
$uncompressed_size = strlen($web_page['FILE']);
$compressed_size = strlen(gzcompress($web_page['FILE'], $compression_value = 9));
$noformat_size = strlen(strip_tags($web_page['FILE']));
# Report the sizes of compressed and uncompressed versions of web page
?>
| Compression report for echo $target?> | ||
|---|---|---|
| Uncompressed | Compressed | |
| bytes | bytes | |
Listing 6-10: Compressing a downloaded file
Running the script from Listing 6-10 in a browser provides the results shown in Figure 6-6.
Before you start compressing everything your webbot finds, you should be aware of the disadvantages of file compression. In this example, compression resulted in files roughly 20 percent of the original size. While this is impressive, the biggest drawback to compression is that you can't do much with a compressed file. You can't perform searches, sorts, or comparisons on the contents of a compressed file. Nor can you modify the contents of a file while it's compressed. Furthermore, while text files (like HTML files) compress effectively, many media files like JPG, GIF, WMF, or MOV are already compressed and will not compress much further. If your webbot application needs to analyze or manipulate downloaded files, file compression may not be for you.
Figure 6-6. The script from Listing 6-10, showing the value of compressing files
Removing Formatting
Removing unneeded HTML formatting instructions is much more useful for reducing the size of a downloaded file than compressing it, since it still facilitates access to the useful information in the file. Formatting instructions like
$noformat = strip_tags($web_page['FILE']); // Remove HTML tags
$noformat_size = strlen($noformat); // Get size of new string
Listing 6-11: Removing HTML formatting using the strip_tags() function
If you run the program in Listing 6-10 again and modify the output to also show the size of the unformatted file, you will see that the unformatted web page is nearly as compact as the compressed version. The results are shown in Figure 6-7.
Figure 6-7. Comparison of uncompressed, compressed, and unformatted file sizes
Unlike the compressed data, the unformatted data can still be sorted, modified, and searched. You can make the file even smaller by removing excessive spaces, line feeds, and other white space with a simple PHP function called trim(), without reducing your ability to manipulate the data later. As an added