Webbots, Spiders, and Screen Scrapers - Michael Schrenk [74]
# Start interface describe_zipcode($zipcode)
describe_zipcode($zipcode)
{
# Get required libraries and declare the target
include ("LIB_http.php");
include("LIB_parse.php");
$target = "http://www.schrenk.com/nostarch/webbots/zip_code_form.php";
# Download the target
$page = http_get($target, $ref="");
# Parse the session hidden tag from the downloaded page
#
$session_tag = return_between($string = $page['FILE'] ,
$start = ""session\"",
$end = ">",
$type = EXCL
);
# Remove the "'s and "value=" text to reveal the session value
$session_value = str_replace("\"", "", $session_tag);
$session_value = str_replace("value=", "", $session_value);
Listing 17-3: Downloading the target to get the session variable
The script in Listing 17-3 is a classic screen scraper. It downloads the page and parses the session value from the form tag. The str_replace() function is later used to remove superfluous quotes and the tag's value attribute. Notice that the webbot uses LIB_parse and LIB_http, described in previous chapters, to download and parse the web page.[58]
Submitting the Form
Once you know the session value, the script in Listing 17-4 may be used to submit the form. Notice the use of http_post_form() to emulate the submission of a form with the POST method. The form fields are conveniently passed to the target webserver in $data_array[].
$data_array['session'] = $session_value;
$data_array['zipcode'] = $zipcode;
$data_array['Submit'] = "Submit";
$form_result = http_post_form($target, $ref=$target, $data_array);
Listing 17-4: Emulating the form
Parsing and Returning the Result
The remaining step is to parse the desired city, county, state, and geo coordinates from the web page obtained from the form submission in the previous listing. The script that does this is shown in Listing 17-5.
$landmark = "Information about ".$zipcode;
$table_array = parse_array($form_result['FILE'], "