Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [74]

By Root 316 0
the session is assigned dynamically, the webbot must first make a page request to get the session value before it can submit form values. This actually mimics normal browser use, as the browser first must download the form before submitting it. The webbot captures the session variable with the script described in Listing 17-3.

# Start interface describe_zipcode($zipcode)

describe_zipcode($zipcode)

{

# Get required libraries and declare the target

include ("LIB_http.php");

include("LIB_parse.php");

$target = "http://www.schrenk.com/nostarch/webbots/zip_code_form.php";

# Download the target

$page = http_get($target, $ref="");

# Parse the session hidden tag from the downloaded page

#

$session_tag = return_between($string = $page['FILE'] ,

$start = ""session\"",

$end = ">",

$type = EXCL

);

# Remove the "'s and "value=" text to reveal the session value

$session_value = str_replace("\"", "", $session_tag);

$session_value = str_replace("value=", "", $session_value);

Listing 17-3: Downloading the target to get the session variable

The script in Listing 17-3 is a classic screen scraper. It downloads the page and parses the session value from the form tag. The str_replace() function is later used to remove superfluous quotes and the tag's value attribute. Notice that the webbot uses LIB_parse and LIB_http, described in previous chapters, to download and parse the web page.[58]

Submitting the Form

Once you know the session value, the script in Listing 17-4 may be used to submit the form. Notice the use of http_post_form() to emulate the submission of a form with the POST method. The form fields are conveniently passed to the target webserver in $data_array[].

$data_array['session'] = $session_value;

$data_array['zipcode'] = $zipcode;

$data_array['Submit'] = "Submit";

$form_result = http_post_form($target, $ref=$target, $data_array);

Listing 17-4: Emulating the form

Parsing and Returning the Result

The remaining step is to parse the desired city, county, state, and geo coordinates from the web page obtained from the form submission in the previous listing. The script that does this is shown in Listing 17-5.

$landmark = "Information about ".$zipcode;

$table_array = parse_array($form_result['FILE'], "");

for($xx=0; $xx

{

# Parse the table containing the parsing landmark

if(stristr($table_array[$xx], $landmark))

{

$ret['CITY'] = return_between($table_array[$xx], "CITY", "", EXCL);

$ret['CITY'] = strip_tags($ret['CITY']);

$ret['STATE'] = return_between($table_array[$xx], "STATE", "", EXCL);

$ret['STATE'] = strip_tags($ret['STATE']);

$ret['COUNTY'] = return_between($table_array[$xx], "COUNTY", "", EXCL);

$ret['COUNTY'] = strip_tags($ret['COUNTY']);

$ret['LATITUDE'] = return_between($table_array[$xx], "LATITUDE", "", EXCL);

$ret['LATITUDE'] = strip_tags($ret['LATITUDE']);

$ret['LONGITUDE'] = return_between($table_array[$xx], "LONGITUDE", "", EXCL);

$ret['LONGITUDE'] = strip_tags($ret['LONGITUDE']);

}

}

# Return the parsed data

return $ret;

} # End Interface describe_zipcode($zipcode)

Listing 17-5: Parsing and returning the data

This script first uses parse_array() to create an array containing all the tables in the downloaded web page, which is returned in $form_result['FILE']. The script then looks for the table that contains the parsing landmark Information about . . . . Once the webbot finds the table that holds the data we're looking for, it parses the data using unique strings that identify the beginning and end of the desired data. The parsed data is then cleaned up with strip_tags() and returned in the array we described earlier. Once the data is parsed and placed into an array, that array is returned to the calling program.

* * *

[57] Traditional methods for executing webbots are described in Chapter 23.

[58] LIB_http and LIB_parse are described in Chapters 3 and 4, respectively.

Return Main Page Previous Page Next Page

®Online Book Reader