Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [41]

By Root 300 0

from the NASA example.

Creating the Image-Capturing Webbot

This example webbot relies on a library called LIB_download_images, which is available from this book's website. This library contains the following functions:

download_binary_file(), which safely downloads image files

mkpath(), which makes directory structures on your hard drive

download_images_for_page(), which downloads all the images on a page

Figure 8-2. Re-creating a file structure for stored images

For clarity, I will break down this library into highlights and accompanying explanations.

The first script (Listing 8-1) shows the main webbot used in Figure 8-1 and Figure 8-2.

include("LIB_download_images.php");

$target="http://www.nasa.gov/mission_pages/viking/index.html";

download_images_for_page($target);

Listing 8-1: Executing the image-capturing webbot

This short webbot script loads the LIB_download_images library, defines a target web page, and calls the download_images_for_page() function, which gets the images and stores them in a complementary directory structure on the local drive.

Note

Please be aware that the scripts in this chapter, which are available at http://www .schrenk.com/nostarch/webbots, are created for demonstration purposes only. Although they should work in most cases, they aren't production ready. You may find long or complicated directory structures, odd filenames, or unusual file formats that will cause these scripts to crash.

Binary-Safe Download Routine

Our image-grabbing webbot uses the function download_binary_file(), which is designed to download binary files, like images. Other binary files you may encounter could include executable files, compressed files, or system files. Up to this point, the only file downloads discussed have been ASCII (text) files, like web pages. The distinction between downloading binary and ASCII files is important because they have different formats and can cause confusion when downloaded. For example, random byte combinations in binary files may be misinterpreted as end-of-file markers in ASCII file downloads. If you download a binary file with a script designed for ASCII files, you stand a good chance of downloading a partial or corrupt file.

Even though PHP has its own, built-in binary-safe download functions, this webbot uses a custom download script that leverages PHP/cURL functionality to download images from SSL sites (when the protocol is HTTPS), follow HTTP file redirections, and send referer information to the server.

Sending proper referer information is crucial because many websites will stop other websites from "borrowing" images. Borrowing images from other websites (without hosting the images on your server) is bad etiquette and is commonly called hijacking. If your webbot doesn't include a proper referer value, its activity could be confused with a website that is hijacking images. Listing 8-2 shows the file download script used by this webbot.

function download_binary_file($file, $referer)

{

# Open a PHP/CURL session

$s = curl_init();

# Configure the cURL command

curl_setopt($s, CURLOPT_URL, $file); // Define target site

curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE); // Return file contents in

a string

curl_setopt($s, CURLOPT_BINARYTRANSFER, TRUE); // Indicate binary transfer

curl_setopt($s, CURLOPT_REFERER, $referer); // Referer value

curl_setopt($s, CURLOPT_SSL_VERIFYPEER, FALSE); // No certificate

curl_setopt($s, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects

curl_setopt($s, CURLOPT_MAXREDIRS, 4); // Limit redirections to four

# Execute the cURL command (send contents of target web page to string)

$downloaded_page = curl_exec($s);

# Close PHP/CURL session and return the file

curl_close($s);

return $downloaded_page;

}

Listing 8-2: A binary-safe file download routine, optimized for webbot use

Directory Structure

The script that creates directories (shown in Figure 8-2) is derived from a user-contributed routine found on the PHP website (http://www.php.net). Users commonly submit scripts like this one when they find

Online Book Reader

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [41]

®Online Book Reader