Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [73]

By Root 404 0
uses a web page that decodes ZIP codes and converts that operation into a function, which is available from a PHP program. This particular web page finds the city, county, state, and geo coordinates for the post office located in a specific ZIP code. Theoretically, you could use this function to validate ZIP codes or use the latitude and longitude information to plot locations on a map. Figure 17-1 shows the target website for this project.

Figure 17-1. Target website, which returns information about a ZIP code

The sole purpose of the web page in Figure 17-1 is to be a target for your webbots. (A link to this page is available at this book's website.) This target web page uses a standard form to capture a ZIP code. Once you submit that form, the web page returns a variety of information about the ZIP code you entered in a table below the form.

Defining the Interface

This example function uses the interface shown in Listing 17-2, where a function named decode_zipcode() accepts a five-digit ZIP code as a input parameter and returns an array, which describes the area serviced by the ZIP code.

array $zipcode_array = decode_zipcode(int $zipcode);

input:

$zipcode is a five-digit USPS ZIP code

output:

$zipcode_array['CITY']

$zipcode_array['COUNTY']

$zipcode_array['STATE']

$zipcode_array['LATITUDE']

$zipcode_array['LONGITUDE']

Listing 17-2: decode_zipcode() interface

Analyzing the Target Web Page

Since this webbot needs to submit a ZIP code to a form, you will need to use the techniques you learned in Chapter 5 to emulate someone manually submitting the form. As you learned, you should always pass even simple forms through a form analyzer (similar to the one used in Chapter 5) to ensure that you will submit the form in the manner the server expects. This is important because web pages commonly insert dynamic fields or values into forms that can be hard to detect by just looking at a page.

To use the form analyzer, simply load the web page into a browser and view the source code, as shown in Figure 17-2.

Figure 17-2. Displaying the form's source code

Figure 17-3. Saving the form's source code

Once you have the target's source code, save the HTML to your hard drive, as done in Figure 17-3.

Once the form's HTML is on your hard drive, you must edit it to make the form submit its content to the form analyzer instead of the target server. You do this by changing the form's action attribute to the location of the form analyzer, as shown in Figure 17-4.

Figure 17-4. Changing the form's action attribute to the form analyzer

Now you have a copy of the target form on your hard drive, with the form's original action attribute replaced with the web address of the form analyzer. The final step is to load this local copy of the form into a browser, manually fill in the form, and submit it to the analyzer. Once submitted, you should see the analysis performed by the form analyzer, as shown in Figure 17-5.

Figure 17-5. Analyzing the target form

The analysis tells us that the method is POST and that there are three required data fields. In addition to the zipcode field, there is also a hidden session field (which looks suspiciously like a Unix timestamp) and a Submit field, which is actually the name of the Submit button. To emulate the form submission, it is vitally important to correctly use all the field names (with appropriate values) as well as the same method used by the original form.

Once you write your webbot, it's a good idea to test it by using the form analyzer as a target to ensure that the webbot submits the form as the target webserver expects it to. This is also a good time to verify the agent name your webbot uses.

Using describe_zipcode()

The script that interfaces the target web page to a PHP function, called describe_zipcode(), is available in its entirety at this book's website. It is broken into smaller pieces and annotated here for clarity.

Getting the Session Value

It is uncommon to find dynamically assigned values, like the session value employed by this target, in forms. Since

Return Main Page Previous Page Next Page

®Online Book Reader