Webbots, Spiders, and Screen Scrapers - Michael Schrenk [126]
* * *
[94] You can find a complete set of PHP/CURL options at http://www.php.net/manual/en/function.curl-setopt.php.
[95] Well-known and standard port numbers are defined at http://www.iana.org/assignments/port-numbers.
Executing the PHP/CURL Command
Executing the PHP/CURL command sets into action all the options defined with the curl_setopt() function. This command executes the previously configured session (referenced by $s in Listing A-18).
$downloaded_page = curl_exec($s);
Listing A-18: Executing a PHP/CURL command for session $s
You can execute the same command multiple times or use curl_setopt() to change configurations between calls of curl_exec(), as long as the session is defined and hasn't been closed. Typically, I create a new PHP/CURL session for every page I access.
Retrieving PHP/CURL Session Information
Additional information about the current PHP/CURL session is available once a curl_exec() command is executed. Listing A-19 shows how to use this command.
$info_array = curl_getinfo($s);
Listing A-19: Getting additional information about the current PHP/CURL session
The curl_getinfo() command returns an array of information, including connect and transfer times, as shown in Listing A-20.
array(20)
{
["url"]=> string(22) "http://www.schrenk.com"
["content_type"]=> string(29) "text/html; charset=ISO-8859-1"
["http_code"]=> int(200) ["header_size"]=> int(247)
["request_size"]=> int(125)
["filetime"]=> int(-1)
["ssl_verify_result"]=> int(0)
["redirect_count"]=> int(0)
["total_time"]=> float(0.884)
["namelookup_time"]=> float(0)
["connect_time"]=> float(0.079)
["pretransfer_time"]=> float(0.079)
["size_upload"]=> float(0)
["size_download"]=> float(19892)
["speed_download"]=> float(22502.2624434)
["speed_upload"]=> float(0)
["download_content_length"]=> float(0)
["upload_content_length"]=> float(0)
["starttransfer_time"]=> float(0.608)
["redirect_time"]=> float(0)
}
Listing A-20: Data made available by the curl_getinfo() command
Viewing PHP/CURL Errors
The curl_error() function returns any errors that may have occurred during a PHP/CURL session. The usage for this function is shown in Listing A-21.
$errors = curl_error($s);
Listing A-21: Accessing PHP/CURL session errors
A typical error response is shown in Listing A-22.
Couldn't resolve host 'www.webbotworld.com'
Listing A-22: Typical PHP/CURL session error
Closing PHP/CURL Sessions
You should close a PHP/CURL session immediately after you are done using it, as shown in Listing A-23. Closing the PHP/CURL session frees up server resources, primarily memory.
curl_close($s);
Listing A-23: Closing a PHP/CURL session
In normal use, PHP performs garbage collection, freeing resources like variables, socket connections, and memory when the script completes. This works fine for scripts that control web pages and execute quickly. However, webbots and spiders may require that PHP scripts run for extended periods without garbage collection. (I've written webbot scripts that run for months without stopping.) Closing each PHP/CURL session is imperative if you're writing webbot and spider scripts that make many PHP/CURL connections and run for extended periods of time.
Appendix B. STATUS CODES
This appendix contains status codes returned by web (HTTP) and news (NNTP) servers. Your webbots and spiders should use these status codes to determine the success or failure communicating with servers. When debug-ging your scripts, status codes also provide hints as to what's wrong.
HTTP Codes
The following is a representative sample of HTTP codes. These codes reflect the status of an HTTP (web page) request. You'll see these codes returned in $returned_web_page['STATUS'][ 'http_code'] if you're using the LIB_http library.
100 Continue
101 Switching Protocols
200 OK
201 Created
202 Accepted
203 Non-Authoritative Information
204 No Content