Webbots, Spiders, and Screen Scrapers - Michael Schrenk [76]
The best known spiders are those used by the major search engine companies (Google, Yahoo!, and MSN) to identify online content. And while spiders are synonymous with search engines for many people, the potential utility of spiders is much greater. You can write a spider that does anything any other webbot does, with the advantage of targeting the entire Internet. This creates a niche for developers that design specialized spiders that do very specific work. Here are some potential ideas for spider projects:
Discover sales of original copies of 1963 Spider-Man comics. Design your spider to email you with links to new findings or price reductions.
Periodically create an archive of your competitors' websites.
Invite every MySpace member living in Cleveland, Ohio to be your friend.[59]
Send a text message when your spider finds jobs for Miami-based fashion photographers who speak Portuguese.
Maintain an updated version of your local newspaper on your PDA.
Validate that all the links on your website point to active web pages.
Perform a statistical analysis of noun usage across the Internet.
Search the Internet for musicians that recorded new versions of your favorite songs.
Purchase collectible Bibles when your spider detects one with a price substantially below the collectible price listed on Amazon.com.
This list could go on, but you get the idea. To a business, a well-purposed spider is like additional staff, easily justifying the one-time development cost.
How Spiders Work
Spiders begin harvesting links at the seed URL, the address of the initial target web page. The spider uses these links as references to the next set of pages to process, and as it downloads each of those web pages, the spider harvests more links. The first page the spider downloads is known as the first penetration level. In each successive level of penetration, additional web pages are downloaded as directed by the links harvested in the previous level. The spider repeats this process until it reaches the maximum penetration level. Figure 18-1 shows a typical spider process.
Figure 18-1. A simple spider
* * *
[59] This is only listed here to show the potential for what spiders can do. Please don't actually do this! Automated agents like this violate MySpace's terms of use. Develop webbots responsibly.
Example Spider
Our example spider will reuse the image harvester (described in Chapter 8) that downloads images for an entire website. The image harvester is this spider's payload—the task that it will perform on every web page it visits. While this spider performs a useful task, its primary purpose is to demonstrate how spiders work, so design compromises were made that affect the spider's scalability for use on larger tasks. After we explore this example spider, I'll conclude with recommendations for making a scalable spider suitable for larger projects.
Listings 18-1 and 18-2 are the main scripts for the example spider. Initially, the spider is limited to collecting links. Since the payload adds complexity, we'll include it after you've had an opportunity to understand how the basic spider works.
# Initialization
include("LIB_http.php"); // http library
include("LIB_parse.php"); // parse library
include("LIB_resolve_addresses.php"); // Address resolution library
include("LIB_exclusion_list.php"); // List of excluded keywords
include("LIB_simple_spider.php"); // Spider routines used by this app
set_time_limit(3600); // Don't let PHP time out
$SEED_URL = "http://www.YourSiteHere.com";
$MAX_PENETRATION = 1; // Set spider penetration depth
$FETCH_DELAY = 1; // Wait 1 second between page fetches
$ALLOW_OFFISTE = false; // Don't let spider roam from seed domain
$spider_array = array(); // Initialize