Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [45]

By Root 299 0
URL, which your webbot can use to download pages. A fully resolved URL is any URL that describes not only the file to download, but also the server and directory where that file is located and the protocol required to access it. Table 9-2 shows the fully resolved addresses for the links in Table 9-1, assuming the links are on a page in the directory, http://www.schrenk.com/nostarch/webbots.

Table 9-2. Examples of Fully Resolved URLs (for links on http://www.schrenk.com/nostarch/book)

Link

Fully Resolved URL

http://www.schrenk.com/nostarch/webbots/linked_page.html

http://www.schrenk.com/nostarch/linked_page.html

http://www.schrenk.com/linked_page.html

http://www.schrenk.com/linked_page.html

Fully resolved URLs are made with the resolve_address() function (see Listing 9-5), which is in the LIB_resolve_addresses library. This library is a set of routines that converts all possible methods of referencing web pages in HTML into fully resolved URLs.

// Create a fully resolved URL

$fully_resolved_link_address = resolve_address($link, $page_base);

Listing 9-5: Creating fully resolved addresses with resolve_address()

Downloading the Linked Page

The webbot verifies the status of each page referenced by the links on the target page by downloading each page and examining its status. It downloads the pages with http_get(), just as you downloaded the target web page earlier (see Listing 9-6).

// Download the page referenced by the link and evaluate

$downloaded_link = http_get($fully_resolved_link_address, $target);

Listing 9-6: Downloading a page referenced by a link

Notice that the second parameter in http_get() is set to the address of the target web page. This sets the page's referer variable to the target page. When executed, the effect is the same as telling the server that someone requested the page by clicking a link from the target web page.

Displaying the Page Status

Once the linked page is downloaded, the webbot relies on the STATUS element of the downloaded array to analyze the HTTP code, which is provided by PHP/CURL. (For your future projects, keep in mind that PHP/CURL also provides total download time and other diagnostics that we're not using in this project.)

HTTP status codes are standardized, three-digit numbers that indicate the status of a page fetch.[29] This webbot uses these codes to determine if a link is broken or valid. These codes are divided into ranges that define the type of errors or status, as shown in Table 9-3.

Table 9-3. HTTP Code Ranges and Related Categories

HTTP Code Range

Category

Meaning

100-199

Informational

Not generally used

200-299

Successful

Your page request was successful

300-399

Redirection

The page you're looking for has moved or has been removed

400-499

Client error

Your web client made a incorrect or illogical page request

500-599

Server error

A server error occurred, generally associated with a bad form submission

The $status_code_array was created when the LIB_http_codes library was imported. When you use the HTTP code as an index into $status_code_array, it returns a human-readable status message, as shown in Listing 9-7. (PHP script is in bold.)

['http_code']]?>

Listing 9-7: Displaying the status of linked web pages

As an added feature, the webbot displays the amount of time (in seconds) required to download pages referenced by the links on the target web page. This period is automatically measured and recorded by PHP/CURL when the page is downloaded. The period required to download the page is available in the array element: $downloaded_link['STATUS']['total_time'].

* * *

Return Main Page Previous Page Next Page

®Online Book Reader