Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [45]

By Root 299 0

URL, which your webbot can use to download pages. A fully resolved URL is any URL that describes not only the file to download, but also the server and directory where that file is located and the protocol required to access it. Table 9-2 shows the fully resolved addresses for the links in Table 9-1, assuming the links are on a page in the directory, http://www.schrenk.com/nostarch/webbots.

Table 9-2. Examples of Fully Resolved URLs (for links on http://www.schrenk.com/nostarch/book)

Link

Fully Resolved URL

http://www.schrenk.com/nostarch/webbots/linked_page.html

http://www.schrenk.com/nostarch/linked_page.html

http://www.schrenk.com/linked_page.html

Fully resolved URLs are made with the resolve_address() function (see Listing 9-5), which is in the LIB_resolve_addresses library. This library is a set of routines that converts all possible methods of referencing web pages in HTML into fully resolved URLs.

// Create a fully resolved URL

$fully_resolved_link_address = resolve_address($link, $page_base);

Listing 9-5: Creating fully resolved addresses with resolve_address()

Downloading the Linked Page

The webbot verifies the status of each page referenced by the links on the target page by downloading each page and examining its status. It downloads the pages with http_get(), just as you downloaded the target web page earlier (see Listing 9-6).