Webbots, Spiders, and Screen Scrapers - Michael Schrenk [123]
# Open a PHP/CURL session
$s = curl_init();
# Configure the cURL command
curl_setopt($s, CURLOPT_URL, "http://www.schrenk.com"); // Define target site
curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE); // Return in string
# Execute the cURL command (send contents of target web page to string)
$downloaded_page = curl_exec($s);
# Close PHP/CURL session
curl_close($s);
?>
Listing A-1: A minimal PHP/CURL session
The rest of this section details how to initiate sessions, set options, execute commands, and close sessions in PHP/CURL. We'll also look at how PHP/CURL provides transfer status and error messages.
* * *
[93] See http://us2.php.net/manual/en/ref.curl.php.
Initiating PHP/CURL Sessions
Before you use cURL, you must initiate a session with the curl_init() function. Initialization creates a session variable, which identifies configurations and data belonging to a specific session. Notice how the session variable $s, created in Listing A-1, is used to configure, execute, and close the entire PHP/CURL session. Once you create a session, you may use it as many times as you need to.
Setting PHP/CURL Options
The PHP/CURL session is configured with the curl_setopt() function. Each individual configuration option is set with a separate call to this function. The script in Listing A-1 is unusual in its brevity. In normal use, there are many calls to curl_setopt(). There are over 90 separate configuration options available within PHP/CURL, making the interface very versatile.[94] The average PHP/CURL user, however, uses only a small subset of the available options. The following sections describe the PHP/CURL options you are most apt to use. While these options are listed here in order of relative importance, you may declare them in any order. If the session is left open, the configu-ration may be reused many times within the same session.
CURLOPT_URL
Use the CURLOPT_URL option to define the target URL for your PHP/CURL session, as shown in Listing A-2.
curl_setopt($s, CURLOPT_URL, "http://www.schrenk.com/index.php");
Listing A-2: Defining the target URL
You should use a fully formed URL describing the protocol, domain, and file in every PHP/CURL file request.
CURLOPT_RETURNTRANSFER
The CURLOPT_RETURNTRANSFER option must be set to TRUE, as in Listing A-3, if you want the result to be returned in a string. If you don't set this option to TRUE, PHP/CURL echoes the result to the terminal.
curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE); // Return in string
Listing A-3: Telling PHP/CURL that you want the result to be returned in a string
CURLOPT_REFERER
The CURLOPT_REFERER option allows your webbot to spoof a hyper-reference that was clicked to initiate the request for the target file. The example in Listing A-4 tells the target server that someone clicked a link on http://www.a_domain.com/index.php to request the target web page.
curl_setopt($s, CURLOPT_REFERER, "http://www.a_domain.com/index.php");
Listing A-4: Spoofing a hyper-reference
CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS
The CURLOPT_FOLLOWLOCATION option tells cURL that you want it to follow every page redirection it finds. It's important to understand that PHP/CURL only honors header redirections and not redirections set with a refresh meta tag or with JavaScript, as shown in Listing A-5.
# Example of redirection that cURL will follow
header("Location: http://www.schrenk.com");
?>