Webbots, Spiders, and Screen Scrapers - Michael Schrenk [92]
There are two types of cookies. Temporary cookies are stored in RAM and expire when the client closes his or her browser; permanent cookies live on the client's hard drive and exist until they reach their expiration date (which may be so far into the future that they'll outlive the computer they're on). For example, consider the script in Listing 22-1, which writes one temporary cookie and one permanent cookie that expires in one hour.
# Set cookie that expires when browser closes
setcookie ("TemporaryCookie", "66");
# Set cookie that expires in one hour
setcookie ("PermanentCookie", "88", time() + 3600);
Listing 22-1: Setting permanent and temporary cookies with PHP
Listing 22-1 shows the cookies' names, values, and expiration dates, if required. Figure 22-1 and Figure 22-2 show how the cookies written by the script in Listing 22-1 appear in the privacy settings of a browser.
Figure 22-1. A temporary cookie written from http://www.schrenk.com, with a value of 66
Figure 22-2. A permanent cookie written from http://www.schrenk.com, with a value of 88
Browsers and webservers exchange cookies in HTTP headers. When a browser requests a web page from a webserver, it looks to see if it has any cookies previously stored by that web page's domain. If it finds any, it will send those cookies to the webserver in the HTTP header of the fetch request. When you execute the cURL command in Figure 22-3, you can see the cookies as they appear in the returned header.
Figure 22-3. Cookies as they appear in the HTTP header sent by the server
A browser will never modify a cookie unless it expires or unless the user erases it using the browser's privacy settings. Servers, however, may write new information to cookies every time they deliver a web page. These new cookie values are then passed to the web browser in the HTTP header, along with the requested web page. According to the specification, a browser will only expose cookies to the domain that wrote them. Webbots, however, are not bound by these rules and can manipulate cookies as needed.
PHP/CURL and Cookies
You can write webbots that support cookies without using PHP/CURL, but doing so adds to the complexity of your designs. Without PHP/CURL, you'll have to read each returned HTTP header, parse the cookies, and store them for later use. You will also have to decide which cookies to send to which domains, manage expiration dates, and return everything correctly in headers of page requests. PHP/CURL does all this for you, automatically. Even with PHP/CURL, however, cookies pose challenges to webbot designers.
Fortunately, PHP/CURL does support cookies, and we can effectively use it to capture the cookies from the previous example, as shown in Listing 22-2.
include("LIB_http.php");
$target="http://www.schrenk.com/nostarch/webbots/EXAMPLE_writing_cookies.php";
http_get($target, "");
Listing 22-2: Reading cookies with PHP/CURL and the LIB_http library
LIB_http defines the file where PHP/CURL stores cookies. This declaration is done near the beginning of the file, as shown in Listing 22-3.
# Location of your cookie file (must be a fully resolved address)
define("COOKIE_FILE", "c:\cookie.txt");
Listing 23-3: Cookie file declaration, as made in LIB_http
As noted in Listing 22-3, the address for a cookie file should be a fully resolved local one. Relative addresses sometimes work, but not for all PHP/CURL distributions. When you execute the scripts in Listing 22-1 (available at this book's website), PHP/CURL writes the cookies (in Netscape Cookie Format) in the file defined in the LIB_http configuration, as shown in Listing 22-4.
# Netscape HTTP Cookie File
# http://www.netscape.com/newsref/std/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
www.schrenk.com FALSE /nostarch/webbots/ FALSE 1159120749 PermanentCookie 88
www.schrenk.com FALSE /nostarch/webbots/ FALSE 0 TemporaryCookie