http://www.schrenk.com/linked_page.html
Fully resolved URLs are made with the resolve_address() function (see Listing 9-5), which is in the LIB_resolve_addresses library. This library is a set of routines that converts all possible methods of referencing web pages in HTML into fully resolved URLs.
// Create a fully resolved URL
$fully_resolved_link_address = resolve_address($link, $page_base);
Listing 9-5: Creating fully resolved addresses with resolve_address()
Downloading the Linked Page
The webbot verifies the status of each page referenced by the links on the target page by downloading each page and examining its status. It downloads the pages with http_get(), just as you downloaded the target web page earlier (see Listing 9-6).
// Download the page referenced by the link and evaluate
$downloaded_link = http_get($fully_resolved_link_address, $target);
Listing 9-6: Downloading a page referenced by a link
Notice that the second parameter in http_get() is set to the address of the target web page. This sets the page's referer variable to the target page. When executed, the effect is the same as telling the server that someone requested the page by clicking a link from the target web page.
Displaying the Page Status
Once the linked page is downloaded, the webbot relies on the STATUS element of the downloaded array to analyze the HTTP code, which is provided by PHP/CURL. (For your future projects, keep in mind that PHP/CURL also provides total download time and other diagnostics that we're not using in this project.)
HTTP status codes are standardized, three-digit numbers that indicate the status of a page fetch.[29] This webbot uses these codes to determine if a link is broken or valid. These codes are divided into ranges that define the type of errors or status, as shown in Table 9-3.
Table 9-3. HTTP Code Ranges and Related Categories
HTTP Code Range
Category
Meaning
100-199
Informational
Not generally used
200-299
Successful
Your page request was successful
300-399
Redirection
The page you're looking for has moved or has been removed
400-499
Client error
Your web client made a incorrect or illogical page request
500-599
Server error
A server error occurred, generally associated with a bad form submission
The $status_code_array was created when the LIB_http_codes library was imported. When you use the HTTP code as an index into $status_code_array, it returns a human-readable status message, as shown in Listing 9-7. (PHP script is in bold.)
| | ['http_code']]?> | |
Listing 9-7: Displaying the status of linked web pages
As an added feature, the webbot displays the amount of time (in seconds) required to download pages referenced by the links on the target web page. This period is automatically measured and recorded by PHP/CURL when the page is downloaded. The period required to download the page is available in the array element: $downloaded_link['STATUS']['total_time'].
* * *