Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [54]

By Root 394 0
the web page in the search results.

If the webbot doesn't find the website in this page, it finds the URL for the next page of search results. This URL is the link that contains the string Next. The webbot finds this URL by placing all the links into an array, as shown in Listing 11-9.

// Create an array of links on this page

$search_links = parse_array($result['FILE'], "", EXCL);

Listing 11-9: Parsing the page's links into an array

The webbot then looks at each link until it finds the hyperlink containing the word Next. Once found, it sets the referer variable with the current target and uses the new link as the next target. It also inserts a random three-to-six second delay to simulate human interaction, as shown in Listing 11-10.

for($xx=0; $xx{

if(strstr($search_links[$xx], "Next"))

{

$previous_target = $target;

$target = get_attribute($search_links[$xx], "href");

// Remember that this path is relative to the target page, so add

// protocol and domain

$target = "http://www.schrenk.com/nostarch/webbots/search/".$target;

}

}

Listing 11-10: Looking for the URL for the next page of search results

* * *

[37] Comments are common parsing landmarks, especially when web pages are created with an HTML generator like Adobe Dreamweaver.

Final Thoughts

Now that you know how to write a webbot that determines search rankings and how to perform an insertion parse, here are a few other things to think about.

Be Kind to Your Sources

Remember that search engines do not make money by displaying search results. The search-ranking webbot is a concept study and not a suggestion for a product that you should develop and place in a production environment, where the public uses it. Also—and this is important—you should not violate any search website's Terms of Service agreement when deploying a webbot like this one.

Search Sites May Treat Webbots Differently Than Browsers

Experience has taught me that some search sites serve pages differently if they think they're dealing with an automated web agent. If you leave the default setting for the agent name (in LIB_http) set to Test Webbot, your programs will definitely look like webbots instead of browsers.

Spidering Search Engines Is a Bad Idea

It is not a good idea to spider Google or any other search engine. I once heard (at a hacking conference) that Google limits individual IP addresses to 250 page requests a day, but I have not verified this. Others have told me that if you make the page requests too quickly, Google will stop replying after sending three result pages. Again, this is unverified, but it won't be an issue if you obey Google's Terms of Service agreement.

What I can verify is that I have, in other circumstances, written spiders for clients where websites did limit the number of daily page fetches from a particular IP address to 250. After the 251st fetch within a 24-hour period, the service ignored all subsequent requests coming from that IP address. For one such project, I put a spider on my laptop and ran it in every Wi-Fi-enabled coffee house I could find in South Minneapolis. This tactic involved drinking a lot of coffee, but it also produced a good number of unique IP addresses for my spider, and I was able to complete the job more quickly than if I had run the spider (in a limited capacity) over a period of many days in my office.

Despite Google's best attempts to thwart automated use of its search results, there are rumors indicating that MSN has been spidering Google to collect records for its own search engine.[38]

If you're interested in these issues, you should read Chapter 28, which describes how to respectfully treat your target websites.

Familiarize Yourself with the Google API

If you are interested in pursuing projects that use Google's data, you should investigate the Google developer API, a service (or Application Program Interface), which makes it easier for developers to use Google in noncommercial applications. At the time of this writing, Google provided information about

Return Main Page Previous Page Next Page

®Online Book Reader