Webbots, Spiders, and Screen Scrapers - Michael Schrenk [103]
Follow Page Redirections
Page redirections are instructions sent by the server that tell a browser that it should download a page other than the one originally requested. Web developers use page redirection techniques to tell browsers that the page they're looking for has changed and that they should download another page in its place. This allows people to access correct pages even when obsolete addresses are bookmarked by browsers or listed by search engines. As you'll discover, there are several methods for redirecting browsers. The more web redirection techniques your webbots understand, the more fault tolerant your webbot becomes.
Header redirection is the oldest method of page redirection. It occurs when the server places a Location: URL line in the HTTP header, where URL represents the web page the browser should download (in place of the one requested). When a web agent sees a header redirection, it's supposed to download the page defined by the new location. Your webbot could look for redirections in the headers of downloaded pages, but it's easier to configure PHP/CURL to follow header redirections automatically.[71] Listing 25-3 shows the PHP/CURL options you need to make automatic redirection happen.
curl_setopt($curl_session, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects
curl_setopt($curl_session, CURLOPT_MAXREDIRS, 4); // Only follow 4
redirects
Listing 25-3: Configuring PHP/CURL to follow up to four header redirections
The first option in Listing 25-3 tells PHP/CURL to follow all page redirections as they are defined by the target server. The second option limits the number of redirections your webbot will follow. Limiting the number of redirections defeats webbot traps where servers redirect agents to the page they just downloaded, causing an endless number of requests for the same page and an endless loop.
In addition to header redirections, you should also be prepared to identify and accommodate page redirections made between the
and tags, as shown in Listing 25-4.Listing 25-4: Page redirection between the
and tagsIn Listing 25-4, the web page tells the browser to download http://www.nostarch.com instead of the intended page. Detecting these kinds of redirections is accomplished with a script like the one in Listing 25-5. This script looks for redirections between the
and tags in a test page on the book's website.# Include http, parse, and address resolution libraries
include("LIB_http.php");
include("LIB_parse.php");
include("LIB_resolve_addresses.php");
# Identify the target web page and the page base
$target = "http://www.schrenk.com/nostarch/webbots/head_redirection_test.php";
$page_base = "http://www.schrenk.com/nostarch/webbots/";
# Download the web page
$page = http_get($target, $ref="");
# Parse the
$head_section = return_between($string=$page['FILE'], $start="
", $end="",$type=EXCL);
# Create an array of all the meta tags
$meta_tag_array = parse_array($head_section, $beg_tag="");
# Examine each meta tag for a redirection command
for($xx=0; $xx # Look for http-equiv attribute $meta_attribute = get_attribute($meta_tag_array[$xx], $attribute="http-equiv"); if(strtolower($meta_attribute)=="refresh") { $new_page = return_between($meta_tag_array[$xx], $start="URL", $end=">", $type=EXCL); # Clean up URL $new_page = trim(str_replace("", "", $new_page)); $new_page = str_replace("=", "", $new_page); $new_page = str_replace("\"", "", $new_page); $new_page = str_replace("'", "", $new_page); # Create fully resolved URL $new_page = resolve_address($new_page, $page_base); } break; } # Echo results of script echo "HTML