Webbots, Spiders, and Screen Scrapers - Michael Schrenk [52]
# Initialization
// Include libraries
include("LIB_http.php");
include("LIB_parse.php");
// Identify the search term and URL combination
$desired_site = "www.loremianam.com";
$search_term = "webbots";
// Initialize other miscellaneous variables
$page_index = 0;
$url_found = false;
$previous_target = "";
// Define the target website and the query string for the search term
$target = "http://www.schrenk.com/nostarch/webbots/search
$target = $target."?q=".urlencode(trim($search_term));
# End: Initialization
Listing 11-1: Initializing the search-ranking webbot
The target is the page we're going to download, which in this case is a demonstration search page on this book's website. That URL also includes the search term in the query string. The webbot URL encodes the search term to guarantee that none of the characters in the search term conflict with reserved URL character combinations. For example, the PHP built-in function urlencode() changes Karen Susan Terri to Karen+Susan+Terri. If the search term contains characters that are illegal in a URL—for example, the comma or ampersand in Karen, Susan & Terri—it would be safely encoded to Karen%2C+Susan+%26+Terri.
Starting the Loop
The webbot loops through the main section of the code, which requests pages of search results and searches within those pages for the desired site, as shown in Listing 11-2.
# Initialize loop
while($url_found==false)
{
$page_index++;
echo "Searching for ranking on page #$page_index\n";
Listing 11-2: Starting the main loop
Within the loop, the script removes any HTML special characters from the target to ensure that the values passed to the target web page only include legal characters, as shown in Listing 11-3. In particular, this step replaces & with the preferred & character.
// Verify that there are no illegal characters in the URLs
$target = html_entity_decode($target);
$previous_target = html_entity_decode($previous_target);
Listing 11-3: Formatting characters to create properly formatted URLs
This particular step should not be confused with URL encoding, because while & is a legal character to have in a URL, it will be interpreted as $_GET['amp'] and return invalid results.
Fetching the Search Results
The webbot tries to simulate the action of a person who is manually looking for a website in a set of search results. The webbot uses two techniques to accomplish this trick. The first is the use of a random delay of three to six seconds between fetches, as shown in Listing 11-4.
sleep(rand(3, 6));
Listing 11-4: Implementing a random delay
Taking this precaution will make it less obvious that a webbot is parsing the search results. This a good practice for all webbots you design.
The second technique simulates a person manually clicking the Next button at the bottom of the search result page to see the next page of search results. Our webbot "clicks" on the link by specifying a referer variable, which in our case is always the target used in the previous iteration of the loop, as shown in Listing 11-5. On the initial fetch, this value is an empty string.
$result = http_get($target, $ref=$previous_target, GET, "", EXCL_HEAD);
$page = $result['FILE'];
Listing 11-5: Downloading the next page of search results from the target and specifying a referer variable
The actual contents of the fetch are returned in the FILE element of the returned $result array.
Parsing the Search Results
This webbot uses a parsing technique referred to as an insertion parse because it inserts special parsing tags into the fetched web page to facilitate an easy parse (and easy debug). Consider using the insertion parse technique when you need to parse multiple blocks of data that share common separators. The insertion parse is particularly useful when web pages change frequently or when the information you need is buried deep within a complicated HTML table structure. The insertion technique also makes your code much easier to debug, because by viewing