Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [103]

By Root 367 0

he or she had been successfully authenticated by the website. If the webbot doesn't find the validation point, it assumes there is a problem and it reports the situation with an error handler.

Follow Page Redirections

Page redirections are instructions sent by the server that tell a browser that it should download a page other than the one originally requested. Web developers use page redirection techniques to tell browsers that the page they're looking for has changed and that they should download another page in its place. This allows people to access correct pages even when obsolete addresses are bookmarked by browsers or listed by search engines. As you'll discover, there are several methods for redirecting browsers. The more web redirection techniques your webbots understand, the more fault tolerant your webbot becomes.

Header redirection is the oldest method of page redirection. It occurs when the server places a Location: URL line in the HTTP header, where URL represents the web page the browser should download (in place of the one requested). When a web agent sees a header redirection, it's supposed to download the page defined by the new location. Your webbot could look for redirections in the headers of downloaded pages, but it's easier to configure PHP/CURL to follow header redirections automatically.[71] Listing 25-3 shows the PHP/CURL options you need to make automatic redirection happen.

curl_setopt($curl_session, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects

curl_setopt($curl_session, CURLOPT_MAXREDIRS, 4); // Only follow 4

redirects

Listing 25-3: Configuring PHP/CURL to follow up to four header redirections

The first option in Listing 25-3 tells PHP/CURL to follow all page redirections as they are defined by the target server. The second option limits the number of redirections your webbot will follow. Limiting the number of redirections defeats webbot traps where servers redirect agents to the page they just downloaded, causing an endless number of requests for the same page and an endless loop.

In addition to header redirections, you should also be prepared to identify and accommodate page redirections made between the and tags, as shown in Listing 25-4.

Listing 25-4: Page redirection between the and tags

In Listing 25-4, the web page tells the browser to download http://www.nostarch.com instead of the intended page. Detecting these kinds of redirections is accomplished with a script like the one in Listing 25-5. This script looks for redirections between the and tags in a test page on the book's website.

# Include http, parse, and address resolution libraries

include("LIB_http.php");

include("LIB_parse.php");

include("LIB_resolve_addresses.php");

# Identify the target web page and the page base

$target = "http://www.schrenk.com/nostarch/webbots/head_redirection_test.php";

$page_base = "http://www.schrenk.com/nostarch/webbots/";

# Download the web page

$page = http_get($target, $ref="");

# Parse the

$head_section = return_between($string=$page['FILE'], $start="", $end="",

$type=EXCL);

# Create an array of all the meta tags

$meta_tag_array = parse_array($head_section, $beg_tag="

# Examine each meta tag for a redirection command

for($xx=0; $xx{

# Look for http-equiv attribute

$meta_attribute = get_attribute($meta_tag_array[$xx], $attribute="http-equiv");

if(strtolower($meta_attribute)=="refresh")

{

$new_page = return_between($meta_tag_array[$xx], $start="URL", $end=">",

$type=EXCL);

# Clean up URL

$new_page = trim(str_replace("", "", $new_page));

$new_page = str_replace("=", "", $new_page);

$new_page = str_replace("\"", "", $new_page);

$new_page = str_replace("'", "", $new_page);

# Create fully resolved URL

$new_page = resolve_address($new_page, $page_base);

}

break;

}

# Echo results of script

echo "HTML

Online Book Reader

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [103]

®Online Book Reader