Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [106]

By Root 406 0

// If no action, use this page as action

if(strlen(trim($form_action))==0)

$form_action = $target;

$fully_resolved_form_action = resolve_address($form_action, $page_base);

// Default to GET method if no method specified

if(strtolower(get_attribute($form_beginning_tag, "method")=="post"))

$form_method="POST";

else

$form_method="GET";

$form_element_array = parse_array($form_array[$xx], "");

echo "Form Method=$form_method
";

echo "Form Action=$fully_resolved_form_action
";

# Parse each element in this form

for($yy=0; $yy{

$element_name = get_attribute($form_element_array[$yy], "name");

$element_value = get_attribute($form_element_array[$yy], "value");

echo "Element Name=$element_name, value=$element_value
";

}

}

?>

Listing 25-9: Parsing form values

Listing 25-9 finds and parses the values of all forms in a web page. When run, it also finds the form's method and creates a fully resolved URL for the form action, as shown in Figure 25-1.

Figure 25-1. Results of running the script in Listing 25-9

Adapting to Changes in Cookie Management

Cookie tolerance involves saving the cookies written by websites and making them available when fetching successive pages from the same website. Cookie management should happen automatically if you are using the LIB_http library and have the COOKIE_FILE pointing to a file your webbots can access.

One area of concern is that the LIB_http library (and PHP/CURL, for that matter) will not delete expired cookies or cookies without an expiration date, which are supposed to expire when the browser is closed. In these cases, it's important to manually delete cookies in order to simulate new browser sessions. If you don't delete expired cookies, it will eventually look like you're using a browser that has been open continuously for months or even years, which can look pretty suspicious.

Adapting to Network Outages and Network Congestion

Unless you plan accordingly, your webbots and spiders will hang, or become nonresponsive, when a targeted website suffers from a network outage or an unusually high volume of network traffic. Webbots become nonresponsive when they request and wait for a page that they never receive. While there's nothing you can do about getting data from nonresponsive target websites, there's also no reason your webbot needs to be hung up when it encounters one. You can avoid this problem by inserting the command shown in Listing 25-10 when configuring your PHP/CURL sessions.

curl_setopt($curl_session, CURLOPT_TIME, $timeout_value);

Listing 25-10: Setting time-out values in PHP/CURL

CURLOPT_TIME defines the number of seconds PHP/CURL waits for a targeted website to respond. This happens automatically if you use the LIB_http library featured in this book. By default, page requests made by LIB_http wait a maximum of 25 seconds for any target website to respond. If there's no response within the allotted time, the PHP/CURL session returns an empty result.

While on the subject of time-outs, it's important to recognize that PHP, by default, will time-out if a script executes longer than 30 seconds. In normal use, PHP's time-out ensures that if a script takes too long to execute, the webserver will return a server error to the browser. The browser, in turn, informs the user that a process has timed-out. The default time-out works great for serving web pages, but when you use PHP to build webbot or spider scripts, PHP must facilitate longer execution times. You can extend (or eliminate) the default PHP script-execution time with the commands shown in Listing 25-11.

You should exercise extreme caution when eliminating PHP's time-out, as shown in the second example in Listing 25-11. If you eliminate the time-out, your script may hang permanently if it encounters a problem.

set_time_limit(60); // Set PHP time-out to 60 seconds

set_time_limit(0); // Completely remove PHP script time-out

Listing 25-11: Adjusting the default PHP script time-out

Always try to avoid time-outs by designing webbots

Return Main Page Previous Page Next Page

®Online Book Reader