Webbots, Spiders, and Screen Scrapers - Michael Schrenk [49]
design. The complete script for the anonymizer project is available on this book's website.[33] For clarity, only script highlights are described in detail here.
Downloading and Preparing the Target Web Page
After initializing libraries and variables, which is done in Listing 10-1, the anonymizer downloads and prepares the target web page for later processing. Note that the anonymizer makes use of the parsing and HTTP libraries described in Part I.
# Download the target web page
$page_array = http_get($target_webpage), $ref="", GET, $data_array="", EXCL_HEAD);
# Clean up the HTML formatting with Tidy
$web_page = tidy_html($page_array['FILE']);
# Get the base page address so we can create fully resolved addresses later
$page_base = get_base_page_address($page_array['STATUS']['url']);
# Remove JavaScript and HTML comments from web page
$web_page = remove($web_page, " -->