Webbots, Spiders, and Screen Scrapers - Michael Schrenk [73]
Figure 17-1. Target website, which returns information about a ZIP code
The sole purpose of the web page in Figure 17-1 is to be a target for your webbots. (A link to this page is available at this book's website.) This target web page uses a standard form to capture a ZIP code. Once you submit that form, the web page returns a variety of information about the ZIP code you entered in a table below the form.
Defining the Interface
This example function uses the interface shown in Listing 17-2, where a function named decode_zipcode() accepts a five-digit ZIP code as a input parameter and returns an array, which describes the area serviced by the ZIP code.
array $zipcode_array = decode_zipcode(int $zipcode);
input:
$zipcode is a five-digit USPS ZIP code
output:
$zipcode_array['CITY']
$zipcode_array['COUNTY']
$zipcode_array['STATE']
$zipcode_array['LATITUDE']
$zipcode_array['LONGITUDE']
Listing 17-2: decode_zipcode() interface
Analyzing the Target Web Page
Since this webbot needs to submit a ZIP code to a form, you will need to use the techniques you learned in Chapter 5 to emulate someone manually submitting the form. As you learned, you should always pass even simple forms through a form analyzer (similar to the one used in Chapter 5) to ensure that you will submit the form in the manner the server expects. This is important because web pages commonly insert dynamic fields or values into forms that can be hard to detect by just looking at a page.
To use the form analyzer, simply load the web page into a browser and view the source code, as shown in Figure 17-2.
Figure 17-2. Displaying the form's source code
Figure 17-3. Saving the form's source code
Once you have the target's source code, save the HTML to your hard drive, as done in Figure 17-3.
Once the form's HTML is on your hard drive, you must edit it to make the form submit its content to the form analyzer instead of the target server. You do this by changing the form's action attribute to the location of the form analyzer, as shown in Figure 17-4.
Figure 17-4. Changing the form's action attribute to the form analyzer
Now you have a copy of the target form on your hard drive, with the form's original action attribute replaced with the web address of the form analyzer. The final step is to load this local copy of the form into a browser, manually fill in the form, and submit it to the analyzer. Once submitted, you should see the analysis performed by the form analyzer, as shown in Figure 17-5.
Figure 17-5. Analyzing the target form
The analysis tells us that the method is POST and that there are three required data fields. In addition to the zipcode field, there is also a hidden session field (which looks suspiciously like a Unix timestamp) and a Submit field, which is actually the name of the Submit button. To emulate the form submission, it is vitally important to correctly use all the field names (with appropriate values) as well as the same method used by the original form.
Once you write your webbot, it's a good idea to test it by using the form analyzer as a target to ensure that the webbot submits the form as the target webserver expects it to. This is also a good time to verify the agent name your webbot uses.
Using describe_zipcode()
The script that interfaces the target web page to a PHP function, called describe_zipcode(), is available in its entirety at this book's website. It is broken into smaller pieces and annotated here for clarity.
Getting the Session Value
It is uncommon to find dynamically assigned values, like the session value employed by this target, in forms. Since