Webbots, Spiders, and Screen Scrapers - Michael Schrenk [44]
Figure 9-1. Link-verification bot flow chart
The webbot downloads the target web page with the http_get() function, which was described in Chapter 3.
# Include libraries
include("LIB_http.php");
include("LIB_parse.php");
include("LIB_resolve_addresses.php");
include("LIB_http_codes.php");
# Identify the target web page and the page base
$target = "http://www.schrenk.com/nostarch/webbots/page_with_broken_links.php";
$page_base = "http://www.schrenk.com/nostarch/webbots/";
# Download the web page
$downloaded_page = http_get($target, $ref="");
Listing 9-1: Initializing the bot and downloading the target web page
Setting the Page Base
In addition to defining the $target, which points to a diagnostic page on the book's website, Listing 9-1 also defines a variable called $page_base. A page base defines the domain and server directory of the target page, which tells the webbot where to find web pages referenced with relative links.
Relative links are references to other files—relative to where the reference is made. For example, consider the relative links in Table 9-1.
Table 9-1. Examples of Relative Links
Link
References a File Located In . . .
Same directory as web page The page's parent directory (up one level) The page's parent's parent directory (up 2 levels) The server's root directory Your webbot would fail if it tried to download any of these links as is, since your webbot's reference point is the computer it runs on, and not the computer where the links where found. The page base, however, gives your webbot the same reference as the target page. You might think of it this way: The page base is to a webbot as the Parsing the Links You can easily parse all the links and place them into an array with the script in Listing 9-2. # Parse the links $link_array = parse_array($downloaded_page['FILE'], $beg_tag=""); Listing 9-2: Parsing the links from the downloaded page The code in Listing 9-2 uses parse_array() to put everything between every occurrence of into an array.[28] The function parse_array() is not case sensitive, so it doesn't matter if the target web page uses , or a combination of both tags to define links. Running a Verification Loop You gain a great deal of convenience when the parsed links are available in an array. The array allows your script to verify the links iteratively through one set of verification instructions, as shown in Listing 9-3. The PHP sections of this script appear in bold. Listing 9-3 also includes some HTML formatting to create a nice-looking report, which you'll see later. Notice that the contents of the verification loop have been removed for clarity. I'll explain what happens in this loop next. Status of links on
| URL | HTTP CODE | MESSAGE | DOWNLOAD TIME (seconds) |
|---|