Webbots, Spiders, and Screen Scrapers - Michael Schrenk [28]
This simple diagnosis isn't perfect—use it at your own risk. However, it does allow a webbot developer to verify the form method, agent name, and GET and POST variables as they are presented to the actual form handler. For example, in this particular exercise, it is evident that the form handler expects a POST method with the variables sessionid, email, message, status, gender, and vol.
Forms with a session ID point out the importance of downloading and analyzing the form before emulating it. In this typical case, the session ID is assigned by the server and cannot be predicted. The webbot can accurately use session IDs only by first downloading and parsing the web page containing the form.
Figure 5-3. Using a form analyzer
If you were to write a script that emulates the form submitted and analyzed in Figure 5-3, it would look something like Listing 5-9.
include("LIB_http.php");
# Initiate addresses
$action="http://www.schrenk.com/nostarch/webbots/form_analyzer.php";
$ref = "" ;
# Set submission method
$method="POST";
# Set form data and values
$data_array['sessionid'] = "sdfg73453845";
$data_array['email'] = "sales@schrenk.com";
$data_array['message'] = "This is a test message";
$data_array['status'] = "in school";
$data_array['gender'] = "M";
$data_array['vol'] = "on";
$response = http($target=$action, $ref, $method, $data_array, EXCL_HEAD);
Listing 5-9: Using LIB_http to emulate the form analysis in Figure 5-3
After you write a form-emulation script, it's a good idea to use the analyzer to verify that the form method and variables match the original form you are attempting to emulate. If you're feeling ambitious, you could improve on this simple form analyzer by designing one that accepts both the submitted and emulated forms and compares them for you.
The script in Listing 5-10 is similar to the one running at http://www .schrenk.com/nostarch/webbots/form_analyzer.php. This script is for reference only. You can download the latest copy from this book's website. Note that the PHP sections of this script appear in bold.
setcookie("SET BY THIS PAGE", "This is a diagnostic cookie.");
?>
Webbot Diagnostic Page
This web page is a tool to diagnose webbot functionality by
examining what the webbot sends to webservers.
| Variable | Value sent to server |
|---|---|
| HTTP Request Method | |
| Your IP Address | |
| Server Port | |
| Referer | if(isset($_SERVER['HTTP_REFERER'])) echo $_SERVER['HTTP_REFERER']; else echo "Null ?> |
| Agent Name | if(isset($_SERVER['HTTP_USER_AGENT'])) echo $_SERVER['HTTP_USER_AGENT']; else echo "Null ?> |
| Get Variables | if(count($_GET)>0) var_dump($_GET); else echo "Null"; ?> |
| Post Variables | if(count($_POST)>0) var_dump($_POST); else echo "Null"; ?> |
| Cookies | if(count($_COOKIE)>0) var_dump($_COOKIE); else echo "Null"; ?> |
This web page also sets a diagnostic