Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [43]

By Root 341 0

4 to parse the image tags from the target web page and put them into an array for easy processing. This is shown in Listing 8-7. The webbot stops if the target web page contains no images.

# Parse the image tags

$img_tag_array = parse_array($web_page['FILE'], "");

if(count($img_tag_array)==0)

{

echo "No images found at $target\n";

exit;

}

Listing 8-7: Parsing the image tags

The Image-Processing Loop

The webbot employs a loop, where each image tag is individually processed. For each image tag, the webbot parses the image file source and creates a fully resolved URL (see Listing 8-8). Creating a fully resolved URL is important because the webbot cannot download an image without its complete URL: the HTTP protocol identifier, the domain, the image's file path, and the image's filename.

$image_path = get_attribute($img_tag_array[$xx], $attribute="src");

echo " image: ".$image_path;

$image_url = resolve_address($image_path, $page_base);

Listing 8-8: Parsing the image source and creating a fully resolved URL

Creating the Local Directory Structure

The webbot verifies that the file path exists in the local file structure. If the directory doesn't exist, the webbot creates the directory path, as shown in Listing 8-9.

if(get_base_domain_address($page_base) == get_base_domain_address($image_url))

{

# Make image storage directory for image, if one doesn't exist

$directory = substr($image_path, 0, strrpos($image_path, "/"));

clearstatcache(); // Clear cache to get accurate directory status

if(!is_dir($save_image_directory."/".$directory)) // See if dir exists

mkpath($save_image_directory."/".$directory); // Create if it

doesn't

Listing 8-9: Creating the local directory structure for each image file

Downloading and Saving the File

Once the path is verified or created, the image is downloaded (using its fully resolved URL) and stored in the local file structure (see Listing 8-10).

# Download the image and report image size

$this_image_file = download_binary_file($image_url, $referer=$target);

echo " size: ".strlen($this_image_file);

# Save the image

$fp = fopen($save_image_directory."/".$image_path, "w");

fputs($fp, $this_image_file);

fclose($fp);

echo "\n";

Listing 8-10: Downloading and storing images

The webbot repeats this process for each image parsed from the target web page.

Further Exploration

You can point this webbot at any web page, and it will generate a copy of each image that page uses, arranged in a directory structure that resembles the original. You can also develop other useful webbots based on this design. If you want to test your skills, consider the following challenges.

Write a similar webbot that detects hijacked images.

Improve the efficiency of the script by reworking it so that it doesn't download an image it has downloaded previously.

Modify this webbot to create local backup copies of web pages.

Adjust the webbot to cache movies or audio files instead of images.

Modify the bot to monitor when images change on a web page.

Final Thoughts

If you attempt to run this webbot on a remote server, remember that your webbot must have write privileges on that server, or it won't be able to create file structures or download images.

Chapter 9. LINK-VERIFICATION WEBBOTS

This webbot project solves a problem shared by all web developers—detecting broken links on web pages. Verifying links on a web page isn't a difficult thing to do, and the associated script is short.

Figure 9-1 shows the simplicity of this webbot.

Creating the Link-Verification Webbot

For clarity, I'll break down the creation of the link-verification webbot into manageable sections, which I'll explain along the way. The code and libraries used in this chapter are available for download at this book's website.

Initializing the Webbot and Downloading the Target

Before validating links on a web page, your webbot needs to load the required libraries and initialize a few key variables. In addition to LIB_http and LIB_parse, this webbot introduces two new libraries: LIB_resolve_addresses

Online Book Reader

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [43]

®Online Book Reader