Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [38]

By Root 409 0
be the target for our price-monitoring webbot. A screenshot of the store is shown in Figure 7-1.

Figure 7-1. The e-commerce website that is monitored by the price-monitoring webbot

This practice store provides a controlled environment that is ideal for this exercise. For example, by targeting the example store you can do the following:

Experiment with price-monitoring webbots without the possibility of interfering with an actual business

Control the content of this target, so you don't run the risk of someone modifying the web page, which could break the example scripts[26]

The prices change on a daily basis, so you can also use it to practice writing webbots that track and graph prices over time.

* * *

[24] Chapter 16 describes how webbots send email and text messages.

[25] The URL for this store is found at http://www.schrenk.com/nostarch/webbots.

[26] The example scripts are resistant to most changes in the target store.

Designing the Parsing Script

Our webbot's objective is to download the target web page, parse the price variables, and place the data into an array for processing. The price-monitoring webbot is largely an exercise in parsing data that appears in tables, since useful online data usually appears as such. When tables aren't used,

tags are generally applied and can be parsed in a similar manner.

While we know that the test target for this example won't change, we don't know that about targets in the wild. Therefore, we don't want to be too specific when telling our parsing routines where to look for pricing information. In this example, the parsing script won't look for data in specific locations; instead, it will look for the desired data relative to easy-to-find text that tells us where the desired information is located. If the position of the pricing information on the target web page changes, our parsing script will still find it.

Let's look at a script that downloads the target web page, parses the prices, and displays the data it parsed. This script is available in its entirety from this book's website. The script is broken into sections here; however, iterative loops are simplified for clarity.

Initialization and Downloading the Target

The example script initializes by including the LIB_http and LIB_parse libraries you read about earlier. It also creates an array where the parsed data is stored, and it sets the product counter to zero, as shown in Listing 7-1.

# Initialization

include("LIB_http.php");

include("LIB_parse.php");

$product_array=array();

$product_count=0;

# Download the target (practice store) web page

$target = "http://www.schrenk.com/webbots/example_store";

$web_page = http_get($target, "");

Listing 7-1: Initializing the price-monitoring webbot

After initialization, the script proceeds to download the target web page with the get_http() function described in Chapter 3.

After downloading the web page, the script parses all the page's tables into an array, as shown in Listing 7-2.

# Parse all the tables on the web page into an array

$table_array = parse_array($web_page['FILE'], "");

Listing 7-2: Parsing the tables into an array

The script does this because the product pricing data is in a table. Once we neatly separate all the tables, we can look for the table with the product data. Notice that the script uses , as the leading indicator for a table. It does this because Next, the script looks for the first landmark, or text that identifies the table where the product data exists. Since the landmark represents text that identifies the desired data, that text must be exclusive to our task. For example, by examining the page's source code we can see that we cannot use the word origin as a landmark because it appears in both the description of this week's auction and the list of products for sale. The example script uses the words Products for Sale, because that phrase only exists in the heading of

®Online Book Reader