Webbots, Spiders, and Screen Scrapers - Michael Schrenk [105]
Table column headings may also be used as landmarks to identify data in tables. For example, assume you have a table like Table 25-1, which presents statistics for three baseball players.
Table 25-1. Use Table Headers to Identify Data Within Columns
Player
Team
Hits
Home Runs
Average
Zoe
Marsupials
78
15
.327
Cullen
Wombats
56
16
.331
Kade
Wombats
58
17
.324
In this example you could parse all the tables from the web page and isolate the table containing the landmark Player Statistics. In that table, your webbot could then use the column names as secondary landmarks to identify players and their statistics.
Look for Landmarks That Are Least Likely to Change
You achieve additional fault tolerance when you choose landmarks that are least likely to change. From my experience, the things in web pages that change with the lowest frequency are those that are related to server applications or back-end code. In most cases, names of form elements and values for hidden form fields seldom change. For example, in Listing 25-8 it's very easy to find the names and breeds of dogs because the form handler needs to see them in a well-defined manner. Webbot developers generally don't look for data values in forms because they aren't visible in rendered HTML. However, if you're lucky enough to find the data values you're looking for within a form definition, that's where you should get them, even if they appear in other visible places on the website.
Listing 25-8: Finding data values in form variables
Similarly, you should avoid landmarks that are subject to frequent changes, like dynamically generated content, HTML comments (which Macromedia Dreamweaver and other page-generation software programs automatically insert into HTML pages), and information that is time or calendar derived.
Adapting to Changes in Forms
Form tolerance defines your webbot's ability to verify that it is sending the correct form information to the correct form handler. When your webbot detects that a form has changed, it is usually best to terminate your webbot, rather than trying to adapt to the changes on the fly. Form emulation is complicated, and it's too easy to make embarrassing mistakes—like submitting nonexistent forms. You should also use the form diagnostic page on the book's website (described in Chapter 5) to analyze forms before writing form emulation scripts.
Before emulating a form, a webbot should verify that the form variables it plans to submit are still in use in the submitted form. This check should verify the data pair names submitted to the form handler and the form's method and action. Listing 25-9 parses this information on a test page on the book's website. You can use similar scripts to isolate individual form elements, which can be compared to the variables in form emulation scripts.
# Import libraries
include("LIB_http.php");
include("LIB_parse.php");
include("LIB_resolve_addresses.php");
# Identify location of form and page base address
$page_base ="http://www.schrenk.com/nostarch/webbots/";
$target = "http://www.schrenk.com/nostarch/webbots/easy_form.php";
$web_page = http_get($target, "");
# Find the forms in the web page
$form_array = parse_array($web_page['FILE'], $open_tag="