Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [51]

By Root 386 0

Most search engines return two sets of results for any given search term, as shown in Figure 11-2. The most prominent search results are paid placements, which are purchased advertisements made to look something like search results. The other set of search results is made up of organic placements (or just organics), which are non-sponsored search results.

This chapter's project focuses on organics because they're the links that people are most likely to follow. Organics are also the search results whose visibility is improved through Search Engine Optimization.

Figure 11-2. Parts of a search results page

The other part of the search result page we'll focus on is the Next link. This is important because it tells our webbot where to find the next page of search results.

For our purposes, the search ranking is determined by counting the number of pages in the search results until the subject web page is first found. The page number is then combined with the position of the subject web page within the organic placements on that page. For example, if a web page is the sixth organic on the first result page, it has a search ranking of 1.6. If a web page is the third organic on the second page, its search ranking is 2.3.

* * *

[35] If you modify this webbot to work on other search services, make sure you are not violating their respective Terms of Service agreements.

What the Search-Ranking Webbot Does

This webbot (actually a specialized spider) submits a search term to a search web page and looks for the subject web page in the search results. If the webbot finds the subject web page within the organic search results, it reports the web page's ranking. If, however, the webbot doesn't find the subject in the organics on that page, it downloads the next page of search results and tries again. The webbot continues searching deeper into the pages of search results until it finds a link to the subject web page. If the webbot can't find the subject web page within a specified number of pages, it will stop looking and report that it could not find the web page within the number of result pages searched.

Running the Search-Ranking Webbot

Figure 11-3 shows the output of our search-ranking webbot. In each case, there must be both a test web page (the page we're looking for in the search results) and a search term. In our test case, the webbot is looking for the ranking of http://www.loremianam.com, with a search term of webbots.[36] Once the webbot is run, it only takes a few seconds to determine the search ranking for this combination of web page and search term.

Figure 11-3. Running the search-ranking webbot

* * *

[36] Unlike a real search service, the demonstration search pages on the book's website return the same page set regardless of the search term used.

How the Search-Ranking Webbot Works

Our search-ranking webbot uses the process detailed in Figure 11-4 to determine the ranking of a website using a specific search term. These are the steps:

Initialize variables for use, including the search criteria and the subject web page.

Fetch the subject web page from the search engine using the search term.

Parse the organic search results from the advertisement and navigation text.

Determine whether or not the desired website appears in this page's search results.If the desired website is not found, keep looking deeper into the search results until the desired web page is found or the maximum number of attempts has been used.

If the desired website is found, record the ranking.

Report the results.

Figure 11-4. Search-ranking webbot at work

The Search-Ranking Webbot Script

The following section describes key aspects of the webbot's script. The latest version of this script is available for download at this book's website.

Note

If you want to experiment with the code, you should download the webbot's script. I have simplified the scripts shown here for demonstration purposes.

Initializing Variables

Initialization consists of including libraries and identifying the subject

Return Main Page Previous Page Next Page

®Online Book Reader