Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [40]

By Root 314 0

echo"PROCESSED: Item #$product_count\n";

}

Listing 7-7: Assigning parsed data to an array

Once the prices are parsed into an array, the webbot script can do anything it wants with the data. In this case, it simply displays what it collected, as shown in Listing 7-8.

# Display the collected data

for($xx=0; $xx{

echo "$xx. ";

echo "ID: ".$product_array[$xx]['ID'].", ";

echo "NAME: ".$product_array[$xx]['NAME'].", ";

echo "PRICE: ".$product_array[$xx]['PRICE']."\n";

}

Listing 7-8: Displaying the parsed product pricing data

As shown in Figure 7-2, the webbot indicates when it finds landmarks and prices. This not only tells the operator how the webbot is running, but also provides important diagnostic information, making both debugging and maintenance easier.

Since prices are almost always in HTML tables, you will usually parse price information in a manner that is similar to that shown here. Occasionally, pricing information may be contained in other tags, (like

tags, for example), but this is less likely. When you encounter
tags, you can easily parse the data they contain into arrays using similar methods.

Figure 7-2. The price-monitoring webbot, as run in a shell

Further Exploration

Now you know how to parse pricing information from a web page—what you do with this information is up to you. If you are so inclined, you can expand your experience with some of the following suggestions.

Since the prices in the example store change on a daily basis, monitor the daily price changes for a month and save your parsed results in a database.

Develop scripts that graph price fluctuations.

Read about sending email with webbots in Chapter 16, and develop scripts that notify you when prices hit preset high or low thresholds.

While this chapter covers monitoring prices online, you can use similar parsing techniques to monitor and parse other types of data found in HTML tables. Consider using the techniques you learned here to monitor things like baseball scores, stock prices, weather forecasts, census data, banner ad rotation statistics,[27] and more.

* * *

[27] You can use webbots to perform a variety of diagnostic functions. For example, a webbot may repeatedly download a web page to gather metrics on how banner ads are rotated.

Chapter 8. IMAGE-CAPTURING WEBBOTS

In this chapter, I'll describe a webbot that identifies and downloads all of the images on a web page. This webbot also stores images in a directory structure similar to the directory structure on the target website. This project will show how a seemingly simple webbot can be made more complex by addressing these common problems:

Finding the page base, or the address that defines the address from which all relative addresses are referenced

Dealing with changes to the page base, caused by page redirection

Converting relative addresses into fully resolved URLs

Replicating complex directory structures

Properly downloading image files with binary formats

In Chapter 18, you'll expand on these concepts to develop a spider that downloads images from an entire website, not just one page.

Example Image-Capturing Webbot

Our image-capturing webbot downloads a target web page (in this case, the Viking Mission web page on the NASA website) and parses all references to images on the page. The webbot downloads each image, echoes the image's name and size to the console, and stores the file on the local hard drive. Figure 8-1 shows what the webbot's output looks like when executed from a shell.

Figure 8-1. The image-capturing bot, when executed from a shell

On this website, like many others, several unique images share the same filename but have different file paths. For example, the image /templates/logo.gif may represent a different graphic than /templates/affiliate/logo.gif. To solve this problem, the webbot re-creates a local copy of the directory structure that exists on the target web page. Figure 8-2 shows the directory structure the webbot created when it saved these images it downloaded

®Online Book Reader