Webbots, Spiders, and Screen Scrapers - Michael Schrenk [46]
[28] Parsing functions are explained in Chapter 4.
[29] The official reference for HTTP codes is available on the World Wide Web Consortium's website (http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html).
Running the Webbot
Since the output of this webbot contains formatted HTML, it is appropriate to run this webbot within a browser, as shown in Figure 9-2.
Figure 9-2. Running the link-verification webbot
This webbot counts and identifies all the links on the target website. It also indicates the HTTP code and diagnostic message describing the status of the fetch used to download the page and displays the actual amount of time it took the page to load.
Let's take this time to look at some of the libraries used by this webbot.
LIB_http_codes
The following script creates an indexed array of HTTP error codes and their definitions. To use the array, simply include the library, insert your HTTP code value into the array, and echo as shown in Listing 9-8.
include(LIB_http_codes.php);
echo $status_code_array[$YOUR_HTTP_CODE]['MSG']
Listing 9-8: Decoding an HTTP code with LIB_http_codes
LIB_http_codes is essentially a group of array declarations, with the first element being the HTTP code and the second element, ['MSG'], being the status message text. Like the others, this library is also available for download from this book's website.
LIB_resolve_addresses
The library that creates fully resolved addresses, LIB_resolve_addresses, is also available for download at the book's website.
Note
Before you download and examine this library, be warned that creating fully resolved URLs is a lot like making sausage—while you might enjoy how sausage tastes, you probably wouldn't like watching those lips and ears go into the grinder. Simply put, the act of converting relative links into fully resolved URLs involves awkward, asymmetrical code with numerous exceptions to rules and many special cases. This library is extraordinarily useful, but it isn't made up of pretty code.
If you don't need to see how this conversion is done, there's no reason to look. If, on the other hand, you're intrigued by this description, feel free to download the library from the book's website and see for yourself. More importantly, if you find a cleaner solution, please upload it to the book's website to share it with the community.
Further Exploration
You can expand this basic webbot to do a variety of very useful things. Here is a short list of ideas to get you started on advanced designs.
Create a web page with a form that allows people to enter and test the links of any web page.
Schedule a link-verification bot to run periodically to ensure that links on web pages remain current. (For information on scheduling webbots, read Chapter 23.)
Modify the webbot to send email notifications when it finds dead links. (More information on webbots that send email is available in Chapter 16.)
Encase the webbot in a spider to check the links on an entire website.
Convert this webbot into a function that is called directly from PHP. (This idea is explored in Chapter 17.)
Chapter 10. ANONYMOUS BROWSING WEBBOTS
The Internet is a public place, and as in any other community, web surfers leave telltale clues of where they've been and what they've done. While many people feel anonymous online, the fact is that server logs, cookies, and browser caches leave little doubt to what happens on the Internet. While total online anonymity is nearly impossible, you can cloak your activity through a specialized webbot called a proxybot, or simply a proxy. This chapter investigates applications for proxies and later explores a webbot proxy project that provides anonymous web browsing.
Anonymity with Proxies
A proxy is a special type of webbot that serves as an intermediary between webservers and clients. Proxies have many uses including banning people from browsing prohibited websites, blocking banner advertisements, and inhibiting suspect scripts from running on browsers.
One of the more popular proxies is Squid, a web proxy that, among other