Webbots, Spiders, and Screen Scrapers - Michael Schrenk [77]
Listing 18-1: Main spider script, initialization
The script in Listing 18-1 loads the required libraries and initializes settings that tell the spider how to operate. This project introduces two new libraries: an exclusion list (LIB_exclusion_list.php) and the spider library used for this exercise (LIB_simple_spider.php). We'll explain both of these new libraries as we use them.
In any PHP spider design, the default script time-out of 30 seconds needs to be set to a period more appropriate for spiders, since script execution may take minutes or even hours. Since spiders may have notoriously long execution times, the script in Listing 18-1 sets the PHP script time-out to one hour (3,600 seconds) with the set_time_limit(3600) command.
The example spider is configured to collect enough information to demonstrate how spiders work but not so much that the sheer volume of data distracts from the demonstration. You can set these settings differently once you understand the effects they have on the operation of your spider. For now, the maximum penetration level is set to 1. This means that the spider will harvest links from the seed URL and the pages that the links on the seed URL reference, but it will not download any pages that are more than one link away from the seed URL. Even when you tie the spider's hands—as we've done here—it still collects a ridiculously large amount of data. When limited to one penetration level, the spider still harvested 583 links when pointed at http://www.schrenk.com. This number excludes redundant links, which would otherwise raise the number of harvest links to 1,930. For demonstration purposes, the spider also rejects links that are not on the parent domain.
The main spider script, shown in Listing 18-2, is quite simple. Much of this simplicity, however, comes at the cost of storing links in an array, instead of a more scalable (and more complicated) database. As you can see, the functions in the libraries make it easy to download web pages, harvest links, exclude unwanted links, and fully resolve addresses.
# Get links from $SEED_URL
echo "Harvesting Seed URL\n";
$temp_link_array = harvest_links($SEED_URL);
$spider_array = archive_links($spider_array, 0, $temp_link_array);
# Spider links from remaining penetration levels
for($penetration_level=1; $penetration_level<=$MAX_PENETRATION; $penetration_level++)
{
$previous_level = $penetration_level - 1;
for($xx=0; $xx unset($temp_link_array); $temp_link_array = harvest_links($spider_array[$previous_level][$xx]); echo "Level=$penetration_level, xx=$xx of ".count($spider_array[$previous_level])." \n"; $spider_array = archive_links($spider_array, $penetration_level, $temp_link_array); } } Listing 18-2: Main spider script, harvesting links When the spider uses www.schrenk.com as a seed URL, it harvests and rejects links, as shown in Figure 18-2. Now that you've seen the main spider script, an exploration of the routines in LIB_simple_spider will provide insight to how it really works. LIB_simple_spider Special spider functions are found in the LIB_simple_spider library. This library provides functions that parse links from a web page when given a URL, archive harvested links in an array, identify the root domain for a URL, and identify links that should be excluded from the archive. This library, as well as the other scripts featured in this chapter, is available for download at this book's website. Figure 18-2. Running the simple spider from Listings 18-1 and 18-2 harvest_links() The harvest_links() function downloads the specified web page and returns all the links in an array. This function, shown in Listing 18-3, uses the $DELAY setting to keep the spider from sending too many requests to the server over too short a period.[60] function harvest_links($url) { # Initialize global $DELAY; $link_array = array(); # Get page base for $url (used to create fully resolved URLs for the links) $page_base = get_base_page_address($url);