Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [88]

By Root 390 0

Example Scripts and Practice Pages

We'll explore three types of online authentication. For each case, you'll receive examples of authentication scripts designed specifically to work with password-protected sections of this book's website. You can experiment (and make mistakes) on these practice pages before writing authenticating webbots that work on real websites. The location of the practice pages is shown in Table 21-2.

Table 21-2. Location of Authentication Practice Pages on the Book's Website

Authentication Method

Location of Practice Pages

Basic authentication

http://www.schrenk.com/nostarch/webbots/basic_authentication

Cookies sessions

http://www.schrenk.com/nostarch/webbots/cookie_authentication

Query sessions

http://www.schrenk.com/nostarch/webbots/query_authentication

For simplicity, all of the authentication examples on the book's website use the login criteria shown in Table 21-3.

Table 21-3. Login Criteria Used for All Authentication Practice Pages

Username

Password

webbot

sp1der3

Basic Authentication

The most common form of online is authentication is basic authentication. Basic authentication is a dialogue between the webserver and browsing agent in which the login credentials are requested and processed, as shown in Figure 21-1.

Web pages subject to basic authentication exist in what's called a realm. Generally, realms refer to all web pages in the current server directory as well as the web pages in sub-directories. Fortunately, browsers shield people from many of the details defined in Figure 21-1. Once you authenticate yourself with a browser, it appears that you don't re-authenticate yourself when accessing other pages within the realm. In reality, the dialogue from Figure 21-1 happens for each page downloaded within the realm. Your browser automatically resubmits your authentication credentials without asking you again for your username and password. When accessing a basic authenticated website with a webbot, you will need to send your login credentials every time the webbot requests a page within the authenticated realm, as shown later in the example script.

Figure 21-1. Basic authentication dialogue

Before you write an auto-authenticating webbot, you should first visit the target website and manually authenticate yourself into the site with a browser. This way you can validate your login credentials and learn about the target site before you design your webbot. When you request a web page from the book's basic authentication test area, your browser will initially present a login form for entering usernames and passwords, as shown in Figure 21-2.

Figure 21-2. Basic authentication login form

After entering your username and password, you will gain access to a simple set of practice pages (shown in Figure 21-3) for testing auto-authenticating webbots and basic authentication. You should familiarize yourself with these simple pages before reading further.

Figure 21-3. Basic authentication test pages

The commands required to download a web page with basic authentication are very similar to those required to download a page without authentication. The only change is that you need to configure the CURLOPT_USERPWD option to pass the login credentials to PHP/CURL. The format for login credentials is the username and password separated by a colon, as shown in Listing 21-1.

# Define target page

$target = "http://www.schrenk.com/nostarch/webbots/basic_authentication/index.php";

# Define login credentials for this page

$credentials = "webbot:sp1der3";

# Create the cURL session

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $target); // Define target site

curl_setopt($ch, CURLOPT_USERPWD, $credentials); // Send credentials

curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Return page in string

# Echo page

$page = curl_exec($ch); // Place web page into a string

echo $page; // Echo downloaded page

# Close the cURL session

curl_close($ch);

Listing 21-1: The minimal code required to access the basic authentication test pages

Online Book Reader

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [88]

®Online Book Reader