Webbots, Spiders, and Screen Scrapers - Michael Schrenk [90]
Figure 21-5. The login page for the cookie authentication example
Regardless of the authentication method used by your target web page, it's vitally important to explore your target screens with a browser before writing self-authenticating webbots. This is especially true in this example, because your webbot must emulate the login form. You should take this time to explore the cookie authentication pages on this book's website. View the source code for each page, and see how the code works. Use your browser to monitor the values of the cookies the web pages use. Now is also a good time to preview Chapter 22.
Figure 21-6 shows an example of the screens that lay beyond the login screen.
Figure 21-6. The example cookie session page from the book's website
Cookie Session Example
A webbot must do the following to authenticate itself to a website that uses cookie sessions:
Download the web page with the login form
Emulate the form that gathers the login credentials
Capture the cookie written by the server
Provide the session cookie to the server on each page request
The script in Listing 21-3 first downloads the login page as a normal user would with a browser. As it emulates the form that sends the login credentials, it uses the CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR options to tell cURL where the cookies should be written and where to find the cookies that are read by the server. To most people (myself included), it seems redundant to have one set of outbound cookies and another set of inbound cookies. In every case I've seen, webbots use the same file to write and read cookies. It's important to note that PHP/CURL will always save cookies to a file, even when the cookie has no expiration date. This presents some interesting problems, which are explained in Chapter 22.
# Define target page
$target = "http://www.schrenk.com/nostarch/webbots/cookie_authentication/index.php";
# Define the login form data
$form_data="enter=Enter&username=webbot&password=sp1der3";
# Create the cURL session
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $target); // Define target site
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Return page in string
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt"); // Tell cURL where to write
cookies
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt"); // Tell cURL which cookies
to send
curl_setopt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $form_data);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects
# Execute the PHP/CURL session and echo the downloaded page
$page = curl_exec($ch);
echo $page;
# Close the cURL session
curl_close($ch);
?>
Listing 21-3: Auto-authentication with cookie sessions
Once the session cookie is written, your webbot should be able to download any authenticated page, as long as the cookie is presented to the website by your cURL session. Just one word of caution: Depending on your version of cURL, you may need to use a complete path when defining your cookie file.
Authentication with Query Sessions
Query string sessions are nearly identical to cookie sessions, the difference being that instead of storing the session value in a cookie, the session value is stored in the query string. Other than this difference, the process is identical to the protocol describing cookie session authentication (outlined in Figure 21-4). Query sessions create additional work for website developers, as the session value must be tacked on to all links and included in all form submissions. Yet some web developers (myself included) prefer query sessions, as some browsers and proxies restrict the use of cookies and make cookie sessions difficult to implement.
This is a good time to manually explore the test pages for query authentication on the website. Once you enter your username and password, you'll notice that the authentication session value is visible in the URL as a GET value, as shown in Figure 21-7. However, this may not be the case in all