{# Don't add excluded links to $spider_array
if(!excluded_link($spider_array, $temp_link_array[$xx]))
{
$spider_array[$penetration_level][] = $temp_link_array[$xx];
}
}
return $spider_array;
}
Listing 18-4: Archiving links in $spider_array
get_domain()
The function get_domain() parses the root domain from the target URL. For example, given a target URL like https://www.schrenk.com/store/product_list.php, the root domain is schrenk.com.
The function get_domain() compares the root domains of the links to the root domain of the seed URL to determine if the link is for a URL that is not in the seed URL's domain, as shown in Listing 18-5.
function get_domain($url)
{
// Remove protocol from $url
$url = str_replace("http://", "", $url);
$url = str_replace("https://", "", $url);
// Remove page and directory references
if(stristr($url, "/"))
$url = substr($url, 0, strpos($url, "/"));
return $url;
}
Listing 18-5: Parsing the root domain from a fully resolved URL
This function is only used when the configuration for $ALLOW_OFFSITE is set to false.
exclude_link()
This function examines each link and determines if it should be included in the archive of harvested links. Reasons for excluding a link may include the following:
The link is contained within JavaScript
The link already appears in the archive
The link contains excluded keywords are listed in the exclusion array
The link is to a different domain
function excluded_link($spider_array, $link)
{
# Initialization
global $exclusion_array, $ALLOW_OFFSITE;
$exclude = false;
// Exclude links that are JavaScript commands
if(stristr($link, "javascript"))
{
echo "Ignored JavaScript function: $link\n";
$exclude=true;
}
// Exclude redundant links
for($xx=0; $xx{$saved_link="";
while(isset($saved_link))
{
$saved_link=array_pop($spider_array[$xx]);
if($link == array_pop($spider_array[$xx]))
{
echo "Ignored redundant link: $link\n";
$exclude=true;
break;
}
}
}
// Exclude links found in $exclusion_array
for($xx=0; $xx{if(stristr($link, $exclusion_array[$xx]))
{
echo "Ignored excluded link: $link\n";
$exclude=true;
break;
}
}
// Exclude offsite links if requested
if($ALLOW_OFFSITE==false)
{
if(get_domain($link)!=get_domain($SEED_URL))
{
echo "Ignored offsite link: $link\n";
$exclude=true;
break;
}
}
return $exclude;
}
Listing 18-6: Excluding unwanted links
There are several reasons to exclude links. For example, it's best to ignore any links referenced within JavaScript because—without a proper JavaScript interpreter—those links may yield unpredictable results. Removing redundant links makes the spider run faster and reduces the amount of data the spider needs to manage. The exclusion list allows the spider to ignore undesirable links to places like Google AdSense, banner ads, or other places you don't want the spider to go.
* * *
[60]