Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [44]

By Root 302 0

and LIB_http_codes. I'll explain these additions as I use them.

Figure 9-1. Link-verification bot flow chart

The webbot downloads the target web page with the http_get() function, which was described in Chapter 3.

# Include libraries

include("LIB_http.php");

include("LIB_parse.php");

include("LIB_resolve_addresses.php");

include("LIB_http_codes.php");

# Identify the target web page and the page base

$target = "http://www.schrenk.com/nostarch/webbots/page_with_broken_links.php";

$page_base = "http://www.schrenk.com/nostarch/webbots/";

# Download the web page

$downloaded_page = http_get($target, $ref="");

Listing 9-1: Initializing the bot and downloading the target web page

Setting the Page Base

In addition to defining the $target, which points to a diagnostic page on the book's website, Listing 9-1 also defines a variable called $page_base. A page base defines the domain and server directory of the target page, which tells the webbot where to find web pages referenced with relative links.

Relative links are references to other files—relative to where the reference is made. For example, consider the relative links in Table 9-1.

Table 9-1. Examples of Relative Links

Link

References a File Located In . . .

Same directory as web page

The page's parent directory (up one level)

The page's parent's parent directory (up 2 levels)

The server's root directory

Your webbot would fail if it tried to download any of these links as is, since your webbot's reference point is the computer it runs on, and not the computer where the links where found. The page base, however, gives your webbot the same reference as the target page. You might think of it this way: The page base is to a webbot as the tag is to a browser. The page base sets the reference for everything referred to on the target web page.

Parsing the Links

You can easily parse all the links and place them into an array with the script in Listing 9-2.

# Parse the links