Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [107]

By Root 408 0
that execute quickly, even if that means your webbot needs to run more than once to accomplish a task. For example, if a webbot needs to download and parse 50 web pages, it's usually best to write the bot in such a way that it can process pages one at a time and know where it left off; then you can schedule the webbot to execute every minute or so for an hour. Webbot scripts that execute quickly are easier to test, resemble normal network traffic more closely, and use fewer system resources.

* * *

[70] A full list of HTTP codes is available in Appendix B.

[71] LIB_http does this for you.

Error Handlers

When a webbot cannot adjust to changes, the only safe thing to do is to stop it. Not stopping your webbot may otherwise result in odd performance and suspicious entries in the target server's access and error log files. It's a good idea to write a routine that handles all errors in a prescribed manner. Such an error handler should send you an email that indicates the following:

Which webbot failed

Why it failed

The date and time it failed

A simple script like the one in Listing 25-12 works well for this purpose.

function webbot_error_handler($failure_mode)

{

# Initialization

$email_address = "your.account@someserver.com";

$email_subject = "Webbot Failure Notification";

# Build the failure message

$email_message = "Webbot T-Rex encountered a fatal error
";

$email_message = $email_message . $failure_more . "
";

$email_message = $email_message . "at".date("r") . "
";

# Send the failure message via email

mail($email_address, $email_subject, $email_message);

# Don't return, force the webbot script to stop

exit;

}

Listing 25-12: Simple error-reporting script

The trick to effectively using error handlers is to anticipate cases in which things may go wrong and then test for those conditions. For example, the script in Listing 25-13 checks the size of a downloaded web page and calls the function in the previous listing if the web page is smaller than expected.

# Download web page

$target = "http://www.somedomain.com/somepage.html";

$downloaded_page = http_get($target, $ref="");

$web_page_size = strlen($downloaded_page['FILE']);

if($web_page_size < 1500)

webbot_error_handler($target." smaller than expected, actual size="

.$web_page_size);

Listing 25-13: Anticipating and reporting errors

In addition to reporting the error, it's important to turn off the scheduler when an error is found if the webbot is scheduled to run again in the future. Otherwise, your webbot will keep bumping up against the same problem, which may leave odd records in server logs. The easiest way to disable a scheduler is to write error handlers that record the webbot's status in a database. Before a scheduled webbot runs, it can first query the database to determine if an unaddressed error occurred earlier. If the query reveals that an error has occurred, the webbot can ignore the requests of the scheduler and simply terminate its execution until the problem is addressed.

Chapter 26. DESIGNING WEBBOT-FRIENDLY WEBSITES

I'll start this chapter with suggestions that help make web pages accessible to the most widely used webbots—the spiders that download, analyze, and rank web pages for search engines, a process often called search engine optimization (SEO).

Finally, I'll conclude the chapter by explaining the occasional importance of special-purpose web pages, formatted to send data directly to webbots instead of browsers.

Optimizing Web Pages for Search Engine Spiders

The most important thing to remember when designing a web page for SEO is that spiders rely on you, the developer, to provide context for the information they find. This is important because web pages using HTML mix content with display format commands. To add complexity to the spider's task, a spider has to examine words in the web page's content to determine how relevant the words are to the web page's main topic. You can improve a spider's ability to index and rank your web pages, as well as improve your search ranking by

Return Main Page Previous Page Next Page

®Online Book Reader