Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [44]

By Root 302 0
and LIB_http_codes. I'll explain these additions as I use them.

Figure 9-1. Link-verification bot flow chart

The webbot downloads the target web page with the http_get() function, which was described in Chapter 3.

# Include libraries

include("LIB_http.php");

include("LIB_parse.php");

include("LIB_resolve_addresses.php");

include("LIB_http_codes.php");

# Identify the target web page and the page base

$target = "http://www.schrenk.com/nostarch/webbots/page_with_broken_links.php";

$page_base = "http://www.schrenk.com/nostarch/webbots/";

# Download the web page

$downloaded_page = http_get($target, $ref="");

Listing 9-1: Initializing the bot and downloading the target web page

Setting the Page Base

In addition to defining the $target, which points to a diagnostic page on the book's website, Listing 9-1 also defines a variable called $page_base. A page base defines the domain and server directory of the target page, which tells the webbot where to find web pages referenced with relative links.

Relative links are references to other files—relative to where the reference is made. For example, consider the relative links in Table 9-1.

Table 9-1. Examples of Relative Links

Link

References a File Located In . . .

Same directory as web page

The page's parent directory (up one level)

The page's parent's parent directory (up 2 levels)

The server's root directory

Your webbot would fail if it tried to download any of these links as is, since your webbot's reference point is the computer it runs on, and not the computer where the links where found. The page base, however, gives your webbot the same reference as the target page. You might think of it this way: The page base is to a webbot as the tag is to a browser. The page base sets the reference for everything referred to on the target web page.

Parsing the Links

You can easily parse all the links and place them into an array with the script in Listing 9-2.

# Parse the links

$link_array = parse_array($downloaded_page['FILE'], $beg_tag="

Listing 9-2: Parsing the links from the downloaded page

The code in Listing 9-2 uses parse_array() to put everything between every occurrence of into an array.[28] The function parse_array() is not case sensitive, so it doesn't matter if the target web page uses , or a combination of both tags to define links.

Running a Verification Loop

You gain a great deal of convenience when the parsed links are available in an array. The array allows your script to verify the links iteratively through one set of verification instructions, as shown in Listing 9-3. The PHP sections of this script appear in bold.

Listing 9-3 also includes some HTML formatting to create a nice-looking report, which you'll see later. Notice that the contents of the verification loop have been removed for clarity. I'll explain what happens in this loop next.

Status of links on

for($xx=0; $xx{

// Verification and display go here

}

Listing 9-3: The verification loop

Generating Fully Resolved URLs

Since the contents of the $link_array elements are actually complete anchor tags, we need to parse the value of the href attribute out of the tags before we can download and test the pages they reference.

The value of the href attribute is extracted from the anchor tag with the function get_attribute(), as shown in Listing 9-4.

// Parse the HTTP attribute from link

$link = get_attribute($tag=$link_array[$xx], $attribute="href");

Listing 9-4: Parsing the referenced address from the anchor tag

Once you have the href address, you need to combine the previously defined $page_base with the relative address to create a fully resolved

®Online Book Reader

URLHTTP CODEMESSAGEDOWNLOAD TIME (seconds)