Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [36]

By Root 345 0
benefit, unformatted pages may be easier to manipulate, since parsing routines won't confuse HTML for the content you're acting on. Remember that removing the HTML tags removes all links, JavaScript, image references, and CSS information. If any of that is important, you should extract it before removing a page's formatting.

* * *

[22] For more information on agent name spoofing, please review Chapter 3.

Thumbnailing Images

The most effective method of decreasing the size of an image is to create smaller versions, or thumbnails, of the original. You may easily create thumbnails with the LIB_thumbnail library, which you can download from this book's website. To use this library, you will have to verify that your configuration uses the gd (revision 2.0 or higher) graphics module.[23] The script in Listing 6-12 demonstrates how to use LIB_thumbnail to create a miniature version of a larger image. The PHP sections of this script appear in bold.

# Demonstration of LIB_thumbnail.php

# Include libraries

include("LIB_http.php");

include("LIB_thumbnail.php");

# Get image from the Internet

$target = "http://www.schrenk.com/north_beach.jpg";

$ref = "";

$method = "GET";

$data_array = "";

$image = http_get($target, $ref, $method, $data_array, EXCL_HEAD);

# Store captured image file to local hard drive

$handle = fopen("test.jpg", "w");

fputs($handle, $image['FILE']);

fclose($handle);

# Create thumbnail image from image that was just stored locally

$org_file = "test.jpg";

$new_file_name = "thumbnail.jpg";

# Set the maximum width and height of the thumbnailed image

$max_width = 90;

$max_height= 90;

# Create the thumbnailed image

create_thumbnail($org_file, $new_file_name, $max_width, $max_height);

?>

Full-size image

Thumbnail image

Listing 6-12: Demonstration of how LIB_thumbnail may create a thumbnailed image

The script in Listing 6-12 fetches an image from the Internet, writes a copy of the original to a local file, defines the maximum dimensions of the thumbnail, creates the thumbnail, and finally displays both the original image and the thumbnail image.

The product of running the script in Listing 6-12 is shown in Figure 6-8.

The thumbnailed image shown in Figure 6-8 consumes roughly 30 percent as much space as the original file. If the original file was larger or the specification for the thumbnailed image was smaller, the savings would be even greater.

Figure 6-8. Output of Listing 6-12, making thumbnails with LIB_thumbnail

* * *

[23] If the gd module is not installed in your configuration, please reference http://www.php.net/manual/en/ref.image.php for further instructions.

Final Thoughts

When storing information, you need to consider what is being stored and how that information will be used later. Furthermore, if the data isn't going to be used later, you need to ask yourself why you need to save it.

Sometimes it is easier to parse text before the HTML tags are removed. This is especially true with regard to data in tables, where rows and columns are parsed.

While unformatted pages are stripped of presentation, colors, and images, the remaining text is enough to represent the original file. Without the HTML, it is actually easier to characterize, manipulate, or search for the presence of keywords.

Before you continue, this is a good time to download LIB_mysql, LIB_http, and LIB_thumbnail from this book's website. You will need all of these libraries to program later examples in this book.

Part II. PROJECTS

This section expands on the concepts you learned in the previous section with simple yet demonstrative projects. Any of these projects, with further development, could be transformed from a simple webbot concept into a potentially marketable product.

Chapter 7

The first project describes webbots that collect and analyze online prices from a mock store that exists on this book's website. The prices change periodically, creating an opportunity for your webbots to analyze and make purchase

Return Main Page Previous Page Next Page

®Online Book Reader