Webbots, Spiders, and Screen Scrapers - Michael Schrenk [107]
* * *
[70] A full list of HTTP codes is available in Appendix B.
[71] LIB_http does this for you.
Error Handlers
When a webbot cannot adjust to changes, the only safe thing to do is to stop it. Not stopping your webbot may otherwise result in odd performance and suspicious entries in the target server's access and error log files. It's a good idea to write a routine that handles all errors in a prescribed manner. Such an error handler should send you an email that indicates the following:
Which webbot failed
Why it failed
The date and time it failed
A simple script like the one in Listing 25-12 works well for this purpose.
function webbot_error_handler($failure_mode)
{
# Initialization
$email_address = "your.account@someserver.com";
$email_subject = "Webbot Failure Notification";
# Build the failure message
$email_message = "Webbot T-Rex encountered a fatal error
";
$email_message = $email_message . $failure_more . "
";
$email_message = $email_message . "at".date("r") . "
";
# Send the failure message via email
mail($email_address, $email_subject, $email_message);
# Don't return, force the webbot script to stop
exit;
}
Listing 25-12: Simple error-reporting script
The trick to effectively using error handlers is to anticipate cases in which things may go wrong and then test for those conditions. For example, the script in Listing 25-13 checks the size of a downloaded web page and calls the function in the previous listing if the web page is smaller than expected.
# Download web page
$target = "http://www.somedomain.com/somepage.html";
$downloaded_page = http_get($target, $ref="");
$web_page_size = strlen($downloaded_page['FILE']);
if($web_page_size < 1500)
webbot_error_handler($target." smaller than expected, actual size="
.$web_page_size);
Listing 25-13: Anticipating and reporting errors
In addition to reporting the error, it's important to turn off the scheduler when an error is found if the webbot is scheduled to run again in the future. Otherwise, your webbot will keep bumping up against the same problem, which may leave odd records in server logs. The easiest way to disable a scheduler is to write error handlers that record the webbot's status in a database. Before a scheduled webbot runs, it can first query the database to determine if an unaddressed error occurred earlier. If the query reveals that an error has occurred, the webbot can ignore the requests of the scheduler and simply terminate its execution until the problem is addressed.
Chapter 26. DESIGNING WEBBOT-FRIENDLY WEBSITES
I'll start this chapter with suggestions that help make web pages accessible to the most widely used webbots—the spiders that download, analyze, and rank web pages for search engines, a process often called search engine optimization (SEO).
Finally, I'll conclude the chapter by explaining the occasional importance of special-purpose web pages, formatted to send data directly to webbots instead of browsers.
Optimizing Web Pages for Search Engine Spiders
The most important thing to remember when designing a web page for SEO is that spiders rely on you, the developer, to provide context for the information they find. This is important because web pages using HTML mix content with display format commands. To add complexity to the spider's task, a spider has to examine words in the web page's content to determine how relevant the words are to the web page's main topic. You can improve a spider's ability to index and rank your web pages, as well as improve your search ranking by