Webbots, Spiders, and Screen Scrapers - Michael Schrenk [42]
function mkpath($path)
{
# Make sure that the slashes are all single and lean the correct way
$path=preg_replace('/(\/){2,}|(\\\){1,}/','/',$path);
# Make an array of all the directories in path
$dirs=explode("/",$path);
# Verify that each directory in path exists and create if necessary
$path="";
foreach ($dirs as $element)
{
$path.=$element."/";
if(!is_dir($path)) // Directory verified here
mkdir($path); // Created if it doesn't exist
}
}
Listing 8-3: Re-creating file paths for downloaded images
This script in Listing 8-3 places all the path directories into an array and attempts to re-create that array, one directory at a time, on the local filesystem. Only nonexistent directories are created.
The Main Script
The main function for this webbot, download_images_for_page(), is broken down into highlights and explained below. As mentioned earlier, this function—and the entire LIB_download_images library—is available at this book's website.
Initialization and Target Validation
Since $target is used later for resolving the web address of the images, the $target value must be validated after the web page is downloaded. This is important because the server may redirect the webbot to an updated web page. That updated URL is the actual URL for the target page and the one that all relative files are referenced from in the next step. The script in Listing 8-4 verifies that the $target is the actual URL that was downloaded and not the product of a redirection.
function download_images_for_page($target)
{
echo "target = $target\n";
# Download the web page
$web_page = http_get($target, $referer="");
# Update the target in case there was a redirection
$target = $web_page['STATUS']['url'];
Listing 8-4: Downloading the target web page and responding to page redirection
Defining the Page Base
Much like the
This task, which is shown in Listing 8-5, is performed by the function get_base_page_address() and is actually in LIB_resolve_address and included by LIB_download_images.
# Strip filename off target for use as page base
$page_base=get_base_page_address($target);
Listing 8-5: Creating the page base for the target web page
As an example, if the webbot finds an image with the relative address 14/logo.gif, and the page base is http://www.schrenk.com/april, the webbot will use the page base to derive the fully resolved address for the image. In this case, the fully resolved address is http://www.schrenk.com/april/14/logo.gif. In contrast, if the image's file path is /march/14/logo.gif, the address will resolve to http://www.schrenk.com/march/14/logo.gif.
Creating a Root Directory for Imported File Structure
Since this webbot may download images from a number of domains, it creates a directory structure for each (see Listing 8-6). The root directory of each imported file structure is based on the page base.
# Identify the directory where images are to be saved
$save_image_directory = "saved_images_".str_replace("http://", "", $page_base);
Listing 8-6: Creating a root directory for the imported file structure
Parsing Image Tags from the Downloaded Web Page
The webbot uses techniques described in Chapter