Webbots, Spiders, and Screen Scrapers - Michael Schrenk [91]
Figure 21-7. Session variable visible in the query string (URL)
Like the cookie session example, the query session example first emulates the login form. Then it parses the session value from the authenticated result and includes the session value in the query string of each page it requests. A script capable of downloading pages from the practice pages for query session authentication is shown in Listing 21-4.
# Include libraries
include("LIB_http.php");
include("LIB_parse.php");
# Request the login page
$domain = "http://www.schrenk.com/";
$target = $domain."nostarch/webbots/query_authentication";
$page_array = http_get($target, $ref="");
echo $page_array['FILE']; // Display the login page
sleep(2); // Include small delay between page fetches
echo "
";
# Send the query authentication form
$login = $domain."nostarch/webbots/query_authentication/index.php";
$data_array['enter'] = "Enter";
$data_array['username'] = "webbot";
$data_array['password'] = "sp1der3";
$page_array = http_post_form($login, $ref=$target, $data_array);
echo $page_array['FILE']; // Display first page after login page
sleep(2); // Include small delay between page fetches
echo "
";
# Parse session variable
$session = return_between($page_array['FILE'], "session=", "\"", EXCL);
# Request subsequent pages using the session variable
$page2 = $target . "/index2.php?session=".$session;
$page_array = http_get($page2, $ref="");
echo $page_array['FILE']; // Display page two
?>
Listing 21-4: Authenticating a webbot on a page using query sessions
Figure 21-8. Output of Listing 21-4
Final Thoughts
Here are a few additional things to remember when writing webbots that access password-protected websites.
For clarity, the examples in this chapter use a minimal amount of code to perform a task. In actual use, you'll want to follow the comprehensive practices mentioned elsewhere in this book for downloading pages, parsing results, emulating forms, using cURL, and writing fault-tolerant webbots.
It's important to note that no form of online authentication is effective unless it is accompanied by encryption. After all, it does little good to authenticate users if sensitive information is sent across the network in cleartext, which can be read by anyone with a packet sniffer.[66] In most cases, authentication will be combined with encryption. For more information about webbots and encryption, revisit Chapter 20.
If your webbot communicates with more than one domain, you need to be careful not to broadcast your login criteria when writing webbots that use basic authentication. For example, if you hard-code your username and password into a PHP/CURL routine, make sure that you don't use the same function when fetching pages from other domains. This sounds silly, but I've seen it happen, resulting in cleartext login credentials in server log files.
Websites may use a combination of two or more authentication types. For example, an authenticated site might use both query and cookie sessions. Make sure that you account for all potential authentication schemes before releasing your webbots.
The latest versions of all the scripts used in this chapter are available for download at this book's website.
* * *
[66] A packet sniffer is a special type of agent that lets people read raw network traffic.
Chapter 22. ADVANCED COOKIE MANAGEMENT
In the previous chapter, you learned how to use cookies to authenticate webbots to access password-protected websites. This chapter further explores cookies and the challenges they present to webbot developers.
How Cookies Work
Cookies are small pieces of ASCII data that websites store on your computer. Without using cookies, websites cannot distinguish between new visitors and those that visit on a daily basis. Cookies add persistence, the ability to identify people who have previously visited the site, to an otherwise stateless environment. Through the magic of cookies,