Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [102]

By Root 375 0
is URL tolerance, or a webbot's ability to make valid requests for web pages under changing conditions. URL tolerance ensures that your webbot does the following:

Download pages that are available on the target site

Follow header redirections to updated pages

Use referer values to indicate that you followed a link from a page that is still on the website

Avoid Making Requests for Pages That Don't Exist

Before you determine that your webbot downloaded a valid web page, you should verify that you made a valid request. Your webbot can verify successful page requests by examining the HTTP code, a status code returned in the header of every web page. If the request was successful, the resulting HTTP code will be in the 200 series—meaning that the HTTP code will be a three-digit number beginning with a two. Any other value for the HTTP code may indicate an error. The most common HTTP code is 200, which says that the request was valid and that the requested page was sent to the web agent. The script in Listing 25-1 shows how to use the LIB_http library's http_get() function to validate the returned page by looking at the returned HTTP code. If the webbot doesn't detect the expected HTTP code, an error handler is used to manage the error and the webbot stops.

include("LIB_http.php");

# Get the web page

$page = http_get($target="www.schrenk.com", $ref="");

# Vector to error handler if error code detected

if($page['STATUS']['http_code']!="200")

error_handler("BAD RESULT", $page['STATUS']['http_code']);

?>

Listing 25-1: Detecting a bad page request

Before using the method described in Listing 25-1, review a list of HTTP codes and decide which codes apply to your situation.[70]

If the page no longer exists, the fetch will return a 404 Not Found error. When this happens, it's imperative that the webbot stop and not download any more pages until you find the cause of the error. Not proceeding after detecting an error is a far better strategy than continuing as if nothing is wrong.

Web developers don't always remove obsolete web pages from their websites—sometimes they just link to an updated page without removing the old one. Therefore, webbots should start at the web page's home page and verify the existence of each page between the home page and the actual targeted web page. This process does two things. It helps your webbot maintain stealth, as it simulates the browsing habits of a person using a browser. Moreover, by validating that there are links to subsequent pages, you verify that the pages you are targeting are still in use. In contrast, if your webbot targets a page within a site without verifying that other pages still link to it, you risk targeting an obsolete web page.

The fact that your webbot made a valid page request does not indicate that the page you've downloaded is the one you intended to download or that it contains the information you expected to receive. For that reason, it is useful to find a validation point, or text that serves as an indication that the newly downloaded web page contains the expected information. Every situation is different, but there should always be some text on every page that validates that the page contains the content you're expecting. For example, suppose your webbot submits a form to authenticate itself to a website. If the next web page contains a message that welcomes the member to the website, you may wish to use the member's name as a validation point to verify that your webbot successfully authenticated, as shown in Listing 25-2.

$username = "GClasemann";

$page = http_get($target, $ref="");

if(!stristr($page['FILE'], "$username")

{

echo "authentication error";

error_handler("BAD AUTHENTICATION for ".$username, $target);

}

Listing 25-2: Using a username as a validation point to confirm the result of submitting a form

The script in Listing 25-2 verifies that a validation point, in this case a username, exists as anticipated on the fetched page. This strategy works because the only way that the user's name would appear on the web page is if

Return Main Page Previous Page Next Page

®Online Book Reader