Webbots, Spiders, and Screen Scrapers - Michael Schrenk [88]
Example Scripts and Practice Pages
We'll explore three types of online authentication. For each case, you'll receive examples of authentication scripts designed specifically to work with password-protected sections of this book's website. You can experiment (and make mistakes) on these practice pages before writing authenticating webbots that work on real websites. The location of the practice pages is shown in Table 21-2.
Table 21-2. Location of Authentication Practice Pages on the Book's Website
Authentication Method
Location of Practice Pages
Basic authentication
http://www.schrenk.com/nostarch/webbots/basic_authentication
Cookies sessions
http://www.schrenk.com/nostarch/webbots/cookie_authentication
Query sessions
http://www.schrenk.com/nostarch/webbots/query_authentication
For simplicity, all of the authentication examples on the book's website use the login criteria shown in Table 21-3.
Table 21-3. Login Criteria Used for All Authentication Practice Pages
Username
Password
webbot
sp1der3
Basic Authentication
The most common form of online is authentication is basic authentication. Basic authentication is a dialogue between the webserver and browsing agent in which the login credentials are requested and processed, as shown in Figure 21-1.
Web pages subject to basic authentication exist in what's called a realm. Generally, realms refer to all web pages in the current server directory as well as the web pages in sub-directories. Fortunately, browsers shield people from many of the details defined in Figure 21-1. Once you authenticate yourself with a browser, it appears that you don't re-authenticate yourself when accessing other pages within the realm. In reality, the dialogue from Figure 21-1 happens for each page downloaded within the realm. Your browser automatically resubmits your authentication credentials without asking you again for your username and password. When accessing a basic authenticated website with a webbot, you will need to send your login credentials every time the webbot requests a page within the authenticated realm, as shown later in the example script.
Figure 21-1. Basic authentication dialogue
Before you write an auto-authenticating webbot, you should first visit the target website and manually authenticate yourself into the site with a browser. This way you can validate your login credentials and learn about the target site before you design your webbot. When you request a web page from the book's basic authentication test area, your browser will initially present a login form for entering usernames and passwords, as shown in Figure 21-2.
Figure 21-2. Basic authentication login form
After entering your username and password, you will gain access to a simple set of practice pages (shown in Figure 21-3) for testing auto-authenticating webbots and basic authentication. You should familiarize yourself with these simple pages before reading further.
Figure 21-3. Basic authentication test pages
The commands required to download a web page with basic authentication are very similar to those required to download a page without authentication. The only change is that you need to configure the CURLOPT_USERPWD option to pass the login credentials to PHP/CURL. The format for login credentials is the username and password separated by a colon, as shown in Listing 21-1.
# Define target page
$target = "http://www.schrenk.com/nostarch/webbots/basic_authentication/index.php";
# Define login credentials for this page
$credentials = "webbot:sp1der3";
# Create the cURL session
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $target); // Define target site
curl_setopt($ch, CURLOPT_USERPWD, $credentials); // Send credentials
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Return page in string
# Echo page
$page = curl_exec($ch); // Place web page into a string
echo $page; // Echo downloaded page
# Close the cURL session
curl_close($ch);
?>
Listing 21-1: The minimal code required to access the basic authentication test pages