Webbots, Spiders, and Screen Scrapers - Michael Schrenk [41]
Creating the Image-Capturing Webbot
This example webbot relies on a library called LIB_download_images, which is available from this book's website. This library contains the following functions:
download_binary_file(), which safely downloads image files
mkpath(), which makes directory structures on your hard drive
download_images_for_page(), which downloads all the images on a page
Figure 8-2. Re-creating a file structure for stored images
For clarity, I will break down this library into highlights and accompanying explanations.
The first script (Listing 8-1) shows the main webbot used in Figure 8-1 and Figure 8-2.
include("LIB_download_images.php");
$target="http://www.nasa.gov/mission_pages/viking/index.html";
download_images_for_page($target);
Listing 8-1: Executing the image-capturing webbot
This short webbot script loads the LIB_download_images library, defines a target web page, and calls the download_images_for_page() function, which gets the images and stores them in a complementary directory structure on the local drive.
Note
Please be aware that the scripts in this chapter, which are available at http://www .schrenk.com/nostarch/webbots, are created for demonstration purposes only. Although they should work in most cases, they aren't production ready. You may find long or complicated directory structures, odd filenames, or unusual file formats that will cause these scripts to crash.
Binary-Safe Download Routine
Our image-grabbing webbot uses the function download_binary_file(), which is designed to download binary files, like images. Other binary files you may encounter could include executable files, compressed files, or system files. Up to this point, the only file downloads discussed have been ASCII (text) files, like web pages. The distinction between downloading binary and ASCII files is important because they have different formats and can cause confusion when downloaded. For example, random byte combinations in binary files may be misinterpreted as end-of-file markers in ASCII file downloads. If you download a binary file with a script designed for ASCII files, you stand a good chance of downloading a partial or corrupt file.
Even though PHP has its own, built-in binary-safe download functions, this webbot uses a custom download script that leverages PHP/cURL functionality to download images from SSL sites (when the protocol is HTTPS), follow HTTP file redirections, and send referer information to the server.
Sending proper referer information is crucial because many websites will stop other websites from "borrowing" images. Borrowing images from other websites (without hosting the images on your server) is bad etiquette and is commonly called hijacking. If your webbot doesn't include a proper referer value, its activity could be confused with a website that is hijacking images. Listing 8-2 shows the file download script used by this webbot.
function download_binary_file($file, $referer)
{
# Open a PHP/CURL session
$s = curl_init();
# Configure the cURL command
curl_setopt($s, CURLOPT_URL, $file); // Define target site
curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE); // Return file contents in
a string
curl_setopt($s, CURLOPT_BINARYTRANSFER, TRUE); // Indicate binary transfer
curl_setopt($s, CURLOPT_REFERER, $referer); // Referer value
curl_setopt($s, CURLOPT_SSL_VERIFYPEER, FALSE); // No certificate
curl_setopt($s, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects
curl_setopt($s, CURLOPT_MAXREDIRS, 4); // Limit redirections to four
# Execute the cURL command (send contents of target web page to string)
$downloaded_page = curl_exec($s);
# Close PHP/CURL session and return the file
curl_close($s);
return $downloaded_page;
}
Listing 8-2: A binary-safe file download routine, optimized for webbot use
Directory Structure
The script that creates directories (shown in Figure 8-2) is derived from a user-contributed routine found on the PHP website (http://www.php.net). Users commonly submit scripts like this one when they find