Webbots, Spiders, and Screen Scrapers - Michael Schrenk [43]
# Parse the image tags
$img_tag_array = parse_array($web_page['FILE'], "");
if(count($img_tag_array)==0)
{
echo "No images found at $target\n";
exit;
}
Listing 8-7: Parsing the image tags
The Image-Processing Loop
The webbot employs a loop, where each image tag is individually processed. For each image tag, the webbot parses the image file source and creates a fully resolved URL (see Listing 8-8). Creating a fully resolved URL is important because the webbot cannot download an image without its complete URL: the HTTP protocol identifier, the domain, the image's file path, and the image's filename.
$image_path = get_attribute($img_tag_array[$xx], $attribute="src");
echo " image: ".$image_path;
$image_url = resolve_address($image_path, $page_base);
Listing 8-8: Parsing the image source and creating a fully resolved URL
Creating the Local Directory Structure
The webbot verifies that the file path exists in the local file structure. If the directory doesn't exist, the webbot creates the directory path, as shown in Listing 8-9.
if(get_base_domain_address($page_base) == get_base_domain_address($image_url))
{
# Make image storage directory for image, if one doesn't exist
$directory = substr($image_path, 0, strrpos($image_path, "/"));
clearstatcache(); // Clear cache to get accurate directory status
if(!is_dir($save_image_directory."/".$directory)) // See if dir exists
mkpath($save_image_directory."/".$directory); // Create if it
doesn't
Listing 8-9: Creating the local directory structure for each image file
Downloading and Saving the File
Once the path is verified or created, the image is downloaded (using its fully resolved URL) and stored in the local file structure (see Listing 8-10).
# Download the image and report image size
$this_image_file = download_binary_file($image_url, $referer=$target);
echo " size: ".strlen($this_image_file);
# Save the image
$fp = fopen($save_image_directory."/".$image_path, "w");
fputs($fp, $this_image_file);
fclose($fp);
echo "\n";
Listing 8-10: Downloading and storing images
The webbot repeats this process for each image parsed from the target web page.
Further Exploration
You can point this webbot at any web page, and it will generate a copy of each image that page uses, arranged in a directory structure that resembles the original. You can also develop other useful webbots based on this design. If you want to test your skills, consider the following challenges.
Write a similar webbot that detects hijacked images.
Improve the efficiency of the script by reworking it so that it doesn't download an image it has downloaded previously.
Modify this webbot to create local backup copies of web pages.
Adjust the webbot to cache movies or audio files instead of images.
Modify the bot to monitor when images change on a web page.
Final Thoughts
If you attempt to run this webbot on a remote server, remember that your webbot must have write privileges on that server, or it won't be able to create file structures or download images.
Chapter 9. LINK-VERIFICATION WEBBOTS
This webbot project solves a problem shared by all web developers—detecting broken links on web pages. Verifying links on a web page isn't a difficult thing to do, and the associated script is short.
Figure 9-1 shows the simplicity of this webbot.
Creating the Link-Verification Webbot
For clarity, I'll break down the creation of the link-verification webbot into manageable sections, which I'll explain along the way. The code and libraries used in this chapter are available for download at this book's website.
Initializing the Webbot and Downloading the Target
Before validating links on a web page, your webbot needs to load the required libraries and initialize a few key variables. In addition to LIB_http and LIB_parse, this webbot introduces two new libraries: LIB_resolve_addresses