Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [39]

By Root 400 0
the product table and is not likely to exist elsewhere if the web page is updated. The script looks at each table until it finds the one that contains the landmark text, Products for Sale, as shown in Listing 7-3.

# Look for the table that contains the product information

for($xx=0; $xx{

$table_landmark = "Products For Sale";

if(stristr($table_array[$xx], $table_landmark)) // Process this table

{

echo "FOUND: Product table\n";

Listing 7-3: Examining each table for the existence of the landmark text

Once the table containing the product pricing data is found, that table is parsed into an array of table rows, as shown in Listing 7-4.

# Parse table into an array of table rows

$product_row_array = parse_array($table_array[$xx], "");

Listing 7-4: Parsing the table into an array of table rows

Then, once an array of table rows from the product data table is available, the script looks for the product table heading row. The heading row is useful for two reasons: It tells the webbot where the data begins within the table, and it provides the column positions for the desired data. This is important because in the future, the order of the data columns could change (as part of a web page update, for example). If the webbot uses column names to identify data, the webbot will still parse data correctly if the order changes, as long as the column names remain the same.

Here again, the script relies on a landmark to find the table heading row. This time, the landmark is the word Condition, as shown in Listing 7-5. Once the landmark identifies the table heading, the positions of the desired table columns are recorded for later use.

for($table_row=0; $table_row{

# Detect the beginning of the desired data (heading row)

$heading_landmark = "Condition";

if((stristr($product_row_array[$table_row], $heading_landmark)))

{

echo "FOUND: Table heading row\n";

# Get the position of the desired headings

$table_cell_array = parse_array($product_row_array[$table_row], "");

for($heading_cell=0; $heading_cell{

if(stristr(strip_tags(trim($table_cell_array[$heading_cell])), "ID#"))

$id_column=$heading_cell;

if(stristr(strip_tags(trim($table_cell_array[$heading_cell])),

"Product name"))

$name_column=$heading_cell;

if(stristr(strip_tags(trim($table_cell_array[$heading_cell])), "Price"))

$price_column=$heading_cell;

}

echo "FOUND: id_column=$id_column\n";

echo "FOUND: price_column=$price_column\n";

echo "FOUND: name_column=$name_column\n";

# Save the heading row for later use

$heading_row = $table_row;

}

Listing 7-5: Detecting the table heading and recording the positions of desired columns

As the script loops through the table containing the desired data, it must also identify where the pricing data ends. A landmark is used again to identify the end of the desired data. The script looks for the landmark Calculate, from the form's submit button, to identify when it has reached the end of the data. Once found, it breaks the loop, as shown in Listing 7-6.

# Detect the end of the desired data table

$ending_landmark = "Calculate";

if((stristr($product_row_array[$table_row], $ending_landmark)))

{

echo "PARSING COMPLETE!\n";

break;

}

Listing 7-6: Detecting the end of the table

If the script finds the headers but doesn't find the end of the table, it assumes that the rest of the table rows contain data. It parses these table rows, using the column position data gleaned earlier, as shown in Listing 7-7.

# Parse product and price data

if(isset($heading_row) && $heading_row<$table_row)

{

$table_cell_array = parse_array($product_row_array[$table_row], "");

$product_array[$product_count]['ID'] =

strip_tags(trim($table_cell_array[$id_column]));

$product_array[$product_count]['NAME'] =

strip_tags(trim($table_cell_array[$name_column]));

$product_array[$product_count]['PRICE'] =

strip_tags(trim($table_cell_array[$price_column]));

$product_count++;

Return Main Page Previous Page Next Page

®Online Book Reader