Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [35]

By Root 309 0
data was compressed by the server, you can decompress the files with the function gzuncompress() in PHP, as shown in Listing 6-9.

$uncompressed_file = gzuncompress($compressed_file);

Listing 6-9: Decompressing a compressed file

Compressing Files on Your Hard Drive

PHP provides a variety of built-in functions for compressing data. Listing 6-10 demonstrates these functions. This script downloads the default HTML file from http://www.schrenk.com, compresses the file, and shows the difference between the compressed and uncompressed files. The PHP sections of this script appear in bold.

# Demonstration of PHP file compression

# Include cURL library

include("LIB_http.php");

# Get web page

$target = "http://www.schrenk.com";

$ref = "";

$method = "GET";

$data_array = "";

$web_page = http_get($target, $ref, $method, $data_array, EXCL_HEAD);

# Get sizes of compressed and uncompressed versions of web page

$uncompressed_size = strlen($web_page['FILE']);

$compressed_size = strlen(gzcompress($web_page['FILE'], $compression_value = 9));

$noformat_size = strlen(strip_tags($web_page['FILE']));

# Report the sizes of compressed and uncompressed versions of web page

?>

Compression report for
UncompressedCompressed
bytes bytes

Listing 6-10: Compressing a downloaded file

Running the script from Listing 6-10 in a browser provides the results shown in Figure 6-6.

Before you start compressing everything your webbot finds, you should be aware of the disadvantages of file compression. In this example, compression resulted in files roughly 20 percent of the original size. While this is impressive, the biggest drawback to compression is that you can't do much with a compressed file. You can't perform searches, sorts, or comparisons on the contents of a compressed file. Nor can you modify the contents of a file while it's compressed. Furthermore, while text files (like HTML files) compress effectively, many media files like JPG, GIF, WMF, or MOV are already compressed and will not compress much further. If your webbot application needs to analyze or manipulate downloaded files, file compression may not be for you.

Figure 6-6. The script from Listing 6-10, showing the value of compressing files

Removing Formatting

Removing unneeded HTML formatting instructions is much more useful for reducing the size of a downloaded file than compressing it, since it still facilitates access to the useful information in the file. Formatting instructions like

are of little use to a webbot because they only control format and not content, and because they can be removed without harming your data. Removing formatting reduces the size of downloaded HTML files while still leaving the option of manipulating the data later. Fortunately, PHP provides strip_tags_(), a built-in function that automatically strips HTML tags from a document. For example, if I add the lines shown in Listing 6-11 to the previous script, we can see the affect of stripping the HTML formatting.

$noformat = strip_tags($web_page['FILE']); // Remove HTML tags

$noformat_size = strlen($noformat); // Get size of new string

Listing 6-11: Removing HTML formatting using the strip_tags() function

If you run the program in Listing 6-10 again and modify the output to also show the size of the unformatted file, you will see that the unformatted web page is nearly as compact as the compressed version. The results are shown in Figure 6-7.

Figure 6-7. Comparison of uncompressed, compressed, and unformatted file sizes

Unlike the compressed data, the unformatted data can still be sorted, modified, and searched. You can make the file even smaller by removing excessive spaces, line feeds, and other white space with a simple PHP function called trim(), without reducing your ability to manipulate the data later. As an added

®Online Book Reader