Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [52]

By Root 318 0

website and search criteria, as shown in Listing 11-1.

# Initialization

// Include libraries

include("LIB_http.php");

include("LIB_parse.php");

// Identify the search term and URL combination

$desired_site = "www.loremianam.com";

$search_term = "webbots";

// Initialize other miscellaneous variables

$page_index = 0;

$url_found = false;

$previous_target = "";

// Define the target website and the query string for the search term

$target = "http://www.schrenk.com/nostarch/webbots/search

$target = $target."?q=".urlencode(trim($search_term));

# End: Initialization

Listing 11-1: Initializing the search-ranking webbot

The target is the page we're going to download, which in this case is a demonstration search page on this book's website. That URL also includes the search term in the query string. The webbot URL encodes the search term to guarantee that none of the characters in the search term conflict with reserved URL character combinations. For example, the PHP built-in function urlencode() changes Karen Susan Terri to Karen+Susan+Terri. If the search term contains characters that are illegal in a URL—for example, the comma or ampersand in Karen, Susan & Terri—it would be safely encoded to Karen%2C+Susan+%26+Terri.

Starting the Loop

The webbot loops through the main section of the code, which requests pages of search results and searches within those pages for the desired site, as shown in Listing 11-2.

# Initialize loop

while($url_found==false)

{

$page_index++;

echo "Searching for ranking on page #$page_index\n";

Listing 11-2: Starting the main loop

Within the loop, the script removes any HTML special characters from the target to ensure that the values passed to the target web page only include legal characters, as shown in Listing 11-3. In particular, this step replaces & with the preferred & character.

// Verify that there are no illegal characters in the URLs

$target = html_entity_decode($target);

$previous_target = html_entity_decode($previous_target);

Listing 11-3: Formatting characters to create properly formatted URLs

This particular step should not be confused with URL encoding, because while & is a legal character to have in a URL, it will be interpreted as $_GET['amp'] and return invalid results.

Fetching the Search Results

The webbot tries to simulate the action of a person who is manually looking for a website in a set of search results. The webbot uses two techniques to accomplish this trick. The first is the use of a random delay of three to six seconds between fetches, as shown in Listing 11-4.

sleep(rand(3, 6));

Listing 11-4: Implementing a random delay

Taking this precaution will make it less obvious that a webbot is parsing the search results. This a good practice for all webbots you design.

The second technique simulates a person manually clicking the Next button at the bottom of the search result page to see the next page of search results. Our webbot "clicks" on the link by specifying a referer variable, which in our case is always the target used in the previous iteration of the loop, as shown in Listing 11-5. On the initial fetch, this value is an empty string.

$result = http_get($target, $ref=$previous_target, GET, "", EXCL_HEAD);

$page = $result['FILE'];

Listing 11-5: Downloading the next page of search results from the target and specifying a referer variable

The actual contents of the fetch are returned in the FILE element of the returned $result array.

Parsing the Search Results

This webbot uses a parsing technique referred to as an insertion parse because it inserts special parsing tags into the fetched web page to facilitate an easy parse (and easy debug). Consider using the insertion parse technique when you need to parse multiple blocks of data that share common separators. The insertion parse is particularly useful when web pages change frequently or when the information you need is buried deep within a complicated HTML table structure. The insertion technique also makes your code much easier to debug, because by viewing

Online Book Reader

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [52]

®Online Book Reader