Webbots, Spiders, and Screen Scrapers - Michael Schrenk [38]
Figure 7-1. The e-commerce website that is monitored by the price-monitoring webbot
This practice store provides a controlled environment that is ideal for this exercise. For example, by targeting the example store you can do the following:
Experiment with price-monitoring webbots without the possibility of interfering with an actual business
Control the content of this target, so you don't run the risk of someone modifying the web page, which could break the example scripts[26]
The prices change on a daily basis, so you can also use it to practice writing webbots that track and graph prices over time.
* * *
[24] Chapter 16 describes how webbots send email and text messages.
[25] The URL for this store is found at http://www.schrenk.com/nostarch/webbots.
[26] The example scripts are resistant to most changes in the target store.
Designing the Parsing Script
Our webbot's objective is to download the target web page, parse the price variables, and place the data into an array for processing. The price-monitoring webbot is largely an exercise in parsing data that appears in tables, since useful online data usually appears as such. When tables aren't used,
While we know that the test target for this example won't change, we don't know that about targets in the wild. Therefore, we don't want to be too specific when telling our parsing routines where to look for pricing information. In this example, the parsing script won't look for data in specific locations; instead, it will look for the desired data relative to easy-to-find text that tells us where the desired information is located. If the position of the pricing information on the target web page changes, our parsing script will still find it.
Let's look at a script that downloads the target web page, parses the prices, and displays the data it parsed. This script is available in its entirety from this book's website. The script is broken into sections here; however, iterative loops are simplified for clarity.
Initialization and Downloading the Target
The example script initializes by including the LIB_http and LIB_parse libraries you read about earlier. It also creates an array where the parsed data is stored, and it sets the product counter to zero, as shown in Listing 7-1.
# Initialization
include("LIB_http.php");
include("LIB_parse.php");
$product_array=array();
$product_count=0;
# Download the target (practice store) web page
$target = "http://www.schrenk.com/webbots/example_store";
$web_page = http_get($target, "");
Listing 7-1: Initializing the price-monitoring webbot
After initialization, the script proceeds to download the target web page with the get_http() function described in Chapter 3.
After downloading the web page, the script parses all the page's tables into an array, as shown in Listing 7-2.
# Parse all the tables on the web page into an array
$table_array = parse_array($web_page['FILE'], "