Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [49]

By Root 324 0

design. The complete script for the anonymizer project is available on this book's website.[33] For clarity, only script highlights are described in detail here.

Downloading and Preparing the Target Web Page

After initializing libraries and variables, which is done in Listing 10-1, the anonymizer downloads and prepares the target web page for later processing. Note that the anonymizer makes use of the parsing and HTTP libraries described in Part I.

# Download the target web page

$page_array = http_get($target_webpage), $ref="", GET, $data_array="", EXCL_HEAD);

# Clean up the HTML formatting with Tidy

$web_page = tidy_html($page_array['FILE']);

# Get the base page address so we can create fully resolved addresses later

$page_base = get_base_page_address($page_array['STATUS']['url']);

# Remove JavaScript and HTML comments from web page

$web_page = remove($web_page, "");

Listing 10-1: Downloading and prepping the target web page

Modifying the Tag

After prepping the target web page, the tag is either inserted or modified so all relative page addresses will properly resolve from the anonymizer's URL. This is shown in Listing 10-2.

$new_base_value = "";

if(!stristr($web_page, "{

# If there is a , insert at beginning of

if(stristr($web_page, "{

$web_page = eregi_replace("", "\n".$new_base_value, $web_page);

}

# Else insert a at beginning of web page

else

{

$web_page = "\n".$new_base_value."\n" . $web_page;

}

Listing 10-2: Adjusting the target page's tag

Parsing the Links

The next step is to create an array of all the links on the page, which is done with the script in Listing 10-3.

$a_tag_array = parse_array($web_page, "");

Listing 10-3: Creating an array of all the links (anchor tags)

Substituting the Links

After parsing links into an array, the code loops through each link. This loop, shown in Listing 10-4, performs the following steps:

Parse the hyper-reference attribute for each link.

Convert the hyper-reference into a fully resolved URL.

Convert the hyper-reference into the following format:

anonymizer_address?v= hyper referencebase64_encoded

Substitute the original hyper-reference with the URL (representing the anonymizer_address and the original link passed as a variable) created in the previous step.

for($xx=0; $xx{

// Get the original href value

$original_href = get_attribute($a_tag_array[$xx], "href");

// Convert href to a fully resolved address

$fully_resolved_href = get_fully_resolved_address($original_href, $page_base);

// Substitute the original href with "this_page?v=fully resolved address"

$substitution_tag = str_replace($original_href,

trim($this_page."?v=".base64_encode($fully_resolved_href)),

$a_tag_array[$xx]);

// Substitute the original tag with the new one

$web_page = str_replace($a_tag_array[$xx], $substitution_tag, $web_page);

}

Listing 10-4: Substituting links with coded links that re-reference the anonymizer

Displaying the Proxied Web Page

Once all the links are processed, the anonymizer sends the newly processed web page to the requesting web surfer's browser, as shown in Listing 10-5.

# Display the processed target web page

echo $web_page;

Listing 10-5: Displaying the proxied web page

That's all there is to it. The important thing is to design the anonymizer so all links displayed in the anonymizer's window re-reference the anonymizer with a $_GET variable that identifies the actual page to download. This is really not that hard to do, but as mentioned earlier, this anonymizer does not handle forms, cookies, JavaScript, frames, or more advanced web design techniques. That being said, it's a good place to start, and you should use this script to further explore the concept of anonymizing. With a few modifications, you could write web proxies that modify web content

Online Book Reader

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [49]

®Online Book Reader