Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [42]

By Root 391 0

something they want to share with the PHP community. In this case, it's a function that expands on mkdir() by creating complete directory structures with multiple directories at once. I modified the function slightly for our purposes. This function, shown in Listing 8-3, creates any file path that doesn't already exist on the hard drive and, if needed, it will create multiple directories for a single file path. For example, if the image's file path is images/templates/November, this function will create all three directories—images, templates, and November—to satisfy the entire file path.

function mkpath($path)

{

# Make sure that the slashes are all single and lean the correct way

$path=preg_replace('/(\/){2,}|(\\\){1,}/','/',$path);

# Make an array of all the directories in path

$dirs=explode("/",$path);

# Verify that each directory in path exists and create if necessary

$path="";

foreach ($dirs as $element)

{

$path.=$element."/";

if(!is_dir($path)) // Directory verified here

mkdir($path); // Created if it doesn't exist

}

Listing 8-3: Re-creating file paths for downloaded images

This script in Listing 8-3 places all the path directories into an array and attempts to re-create that array, one directory at a time, on the local filesystem. Only nonexistent directories are created.

The Main Script

The main function for this webbot, download_images_for_page(), is broken down into highlights and explained below. As mentioned earlier, this function—and the entire LIB_download_images library—is available at this book's website.

Initialization and Target Validation

Since $target is used later for resolving the web address of the images, the $target value must be validated after the web page is downloaded. This is important because the server may redirect the webbot to an updated web page. That updated URL is the actual URL for the target page and the one that all relative files are referenced from in the next step. The script in Listing 8-4 verifies that the $target is the actual URL that was downloaded and not the product of a redirection.

function download_images_for_page($target)

{

echo "target = $target\n";

# Download the web page

$web_page = http_get($target, $referer="");

# Update the target in case there was a redirection

$target = $web_page['STATUS']['url'];

Listing 8-4: Downloading the target web page and responding to page redirection

Defining the Page Base

Much like the HTML tag, the webbot uses $page_base to define the directory address of the target web page. This address becomes the reference for all images with relative addresses. For example, if $target is http://www.schrenk.com/april/index.php, then $page_base becomes http://www.schrenk.com/april.

This task, which is shown in Listing 8-5, is performed by the function get_base_page_address() and is actually in LIB_resolve_address and included by LIB_download_images.

# Strip filename off target for use as page base

$page_base=get_base_page_address($target);

Listing 8-5: Creating the page base for the target web page

As an example, if the webbot finds an image with the relative address 14/logo.gif, and the page base is http://www.schrenk.com/april, the webbot will use the page base to derive the fully resolved address for the image. In this case, the fully resolved address is http://www.schrenk.com/april/14/logo.gif. In contrast, if the image's file path is /march/14/logo.gif, the address will resolve to http://www.schrenk.com/march/14/logo.gif.

Creating a Root Directory for Imported File Structure

Since this webbot may download images from a number of domains, it creates a directory structure for each (see Listing 8-6). The root directory of each imported file structure is based on the page base.

# Identify the directory where images are to be saved

$save_image_directory = "saved_images_".str_replace("http://", "", $page_base);

Listing 8-6: Creating a root directory for the imported file structure

Parsing Image Tags from the Downloaded Web Page

The webbot uses techniques described in Chapter

Online Book Reader

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [42]

®Online Book Reader