Webbots, Spiders, and Screen Scrapers - Michael Schrenk [124]
Listing A-5: Redirects that cURL can and cannot follow
Any time you use CURLOPT_FOLLOWLOCATION, set CURLOPT_MAXREDIRS to the maximum number of redirections you care to follow. Limiting the number of redirections keeps your webbot out of infinite loops, where redirections point repeatedly to the same URL. My introduction to CURLOPT_MAXREDIRS came while trying to solve a problem brought to my attention by a network administrator, who initially thought that someone (using a webbot I wrote) launched a DoS attack on his server. In reality, the server misinterpreted the webbot's header request as a hacking exploit and redirected the webbot to an error page. There was a bug on the error page that caused it to repeatedly redirect the webbot to the error page, causing an infinite loop (and near-infinite bandwidth usage). The addition of CURLOPT_MAXREDIRS solved the problem, as demonstrated in Listing A-6.
curl_setopt($s, CURLOPT_FOLLOWLOCATION, TRUE); // Follow header redirections
curl_setopt($s, CURLOPT_MAXREDIRS, 4); // Limit redirections to 4
Listing A-6: Using the CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS options
CURLOPT_USERAGENT
Use this option to define the name of your user agent, as shown in Listing A-7. The user agent name is recorded in server access log files and is available to server-side scripts in the $_SERVER['HTTP_USER_AGENT'] variable.
$agent_name = "test_webbot";
curl_setopt($s, CURLOPT_USERAGENT, $agent_name);
Listing A-7: Setting the user agent name
Keep in mind that many websites will not serve pages correctly if your user agent name is something other than a standard web browser.
CURLOPT_NOBODY and CURLOPT_HEADER
These options tell PHP/CURL to return either the web page's header or body. By default, PHP/CURL will always return the body, but not the header. This explains why setting CURL_NOBODY to TRUE excludes the body, and setting CURL_HEADER to TRUE includes the header, as shown in Listing A-8.
curl_setopt($s, CURLOPT_HEADER, TRUE); // Include the header
curl_setopt($s, CURLOPT_NOBODY, TURE); // Exclude the body
Listing A-8: Using the CURLOPT_HEADER and CURLOPT_NOBODY options
CURLOPT_TIMEOUT
If you don't limit how long PHP/CURL waits for a response from a server, it may wait forever—especially if the file you're fetching is on a busy server or you're trying to connect to a nonexistent or inactive IP address. (The latter happens frequently when a spider follows dead links on a website.) Setting a time-out value, as shown in Listing A-9, causes PHP/CURL to end the session if the download takes longer than the time-out value (in seconds).
curl_setopt($s, CURLOPT_TIMEOUT, 30); // Don't wait longer than 30 seconds
Listing A-9: Setting a socket time-out value
CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR
One of the slickest features of PHP/CURL is the ability to manage cookies sent to and received from a website. Use the CURLOPT_COOKIEFILE option to define the file where previously stored cookies exist. At the end of the session, PHP/CURL writes new cookies to the file indicated by CURLOPT_COOKIEJAR. An example is in Listing A-10; I have never seen an application where these two options don't reference the same file.
curl_setopt($s, CURLOPT_COOKIEFILE, "c:\bots\cookies.txt"); // Read cookie file
curl_setopt($s, CURLOPT_COOKIEJAR, "c:\bots\cookies.txt"); // Write cookie file
Listing A-10: Telling PHP/CURL where to read and write cookies
When specifying the location of a cookie file, always use the complete location of the file, and do not use relative addresses. More information about managing cookies is available in Chapter 22.
CURLOPT_HTTPHEADER
The CURLOPT_HTTPHEADER configuration allows a cURL session to send an outgoing header message to the server. The script in Listing A-11 uses this option to tell the target server the MIME type it accepts, the content type it expects, and that the user agent is capable of decompressing compressed web responses.
Note that