Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [78]

By Root 371 0

# $DELAY creates a random delay period between 1 second and full delay period

$random_delay = rand(1, rand(1, $DELAY));

# Download webpage

sleep($random_delay);

$downloaded_page = http_get($url, "");

# Parse links

$anchor_tags = parse_array($downloaded_page['FILE'], "", EXCL);

# Get http attributes for each tag into an array

for($xx=0; $xx{

$href = get_attribute($anchor_tags[$xx], "href");

$resolved_addrses = resolve_address($href, $page_base);

$link_array[] = $resolved_address;

echo "Harvested: ".$resolved_addres." \n";

}

return $link_array;

}

Listing 18-3: Harvesting links from a web page with the harvest_links() function

archive_links()

The script in Listing 18-4 uses the link array collected by the previous function to create an archival array. The first element of the archival array identifies the penetration level where the link was found, while the second contains the actual link.

function archive_links($spider_array, $penetration_level, $temp_link_array)

{

for($xx=0; $xx{

# Don't add excluded links to $spider_array

if(!excluded_link($spider_array, $temp_link_array[$xx]))

{

$spider_array[$penetration_level][] = $temp_link_array[$xx];

}

}

return $spider_array;

}

Listing 18-4: Archiving links in $spider_array

get_domain()

The function get_domain() parses the root domain from the target URL. For example, given a target URL like https://www.schrenk.com/store/product_list.php, the root domain is schrenk.com.

The function get_domain() compares the root domains of the links to the root domain of the seed URL to determine if the link is for a URL that is not in the seed URL's domain, as shown in Listing 18-5.

function get_domain($url)

{

// Remove protocol from $url

$url = str_replace("http://", "", $url);

$url = str_replace("https://", "", $url);

// Remove page and directory references

if(stristr($url, "/"))

$url = substr($url, 0, strpos($url, "/"));

return $url;

}

Listing 18-5: Parsing the root domain from a fully resolved URL

This function is only used when the configuration for $ALLOW_OFFSITE is set to false.

exclude_link()

This function examines each link and determines if it should be included in the archive of harvested links. Reasons for excluding a link may include the following:

The link is contained within JavaScript

The link already appears in the archive

The link contains excluded keywords are listed in the exclusion array

The link is to a different domain

function excluded_link($spider_array, $link)

{

# Initialization

global $exclusion_array, $ALLOW_OFFSITE;

$exclude = false;

// Exclude links that are JavaScript commands

if(stristr($link, "javascript"))

{

echo "Ignored JavaScript function: $link\n";

$exclude=true;

}

// Exclude redundant links

for($xx=0; $xx{

$saved_link="";

while(isset($saved_link))

{

$saved_link=array_pop($spider_array[$xx]);

if($link == array_pop($spider_array[$xx]))

{

echo "Ignored redundant link: $link\n";

$exclude=true;

break;

}

}

}

// Exclude links found in $exclusion_array

for($xx=0; $xx{

if(stristr($link, $exclusion_array[$xx]))

{

echo "Ignored excluded link: $link\n";

$exclude=true;

break;

}

}

// Exclude offsite links if requested

if($ALLOW_OFFSITE==false)

{

if(get_domain($link)!=get_domain($SEED_URL))

{

echo "Ignored offsite link: $link\n";

$exclude=true;

break;

}

}

return $exclude;

}

Listing 18-6: Excluding unwanted links

There are several reasons to exclude links. For example, it's best to ignore any links referenced within JavaScript because—without a proper JavaScript interpreter—those links may yield unpredictable results. Removing redundant links makes the spider run faster and reduces the amount of data the spider needs to manage. The exclusion list allows the spider to ignore undesirable links to places like Google AdSense, banner ads, or other places you don't want the spider to go.

* * *

[60]

Return Main Page Previous Page Next Page

®Online Book Reader