Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [126]

By Root 393 0
because the option names are predefined PHP constants. Therefore, your code will fail if you specify and option as curlopt_port instead of CURLOPT_PORT.

* * *

[94] You can find a complete set of PHP/CURL options at http://www.php.net/manual/en/function.curl-setopt.php.

[95] Well-known and standard port numbers are defined at http://www.iana.org/assignments/port-numbers.

Executing the PHP/CURL Command

Executing the PHP/CURL command sets into action all the options defined with the curl_setopt() function. This command executes the previously configured session (referenced by $s in Listing A-18).

$downloaded_page = curl_exec($s);

Listing A-18: Executing a PHP/CURL command for session $s

You can execute the same command multiple times or use curl_setopt() to change configurations between calls of curl_exec(), as long as the session is defined and hasn't been closed. Typically, I create a new PHP/CURL session for every page I access.

Retrieving PHP/CURL Session Information

Additional information about the current PHP/CURL session is available once a curl_exec() command is executed. Listing A-19 shows how to use this command.

$info_array = curl_getinfo($s);

Listing A-19: Getting additional information about the current PHP/CURL session

The curl_getinfo() command returns an array of information, including connect and transfer times, as shown in Listing A-20.

array(20)

{

["url"]=> string(22) "http://www.schrenk.com"

["content_type"]=> string(29) "text/html; charset=ISO-8859-1"

["http_code"]=> int(200) ["header_size"]=> int(247)

["request_size"]=> int(125)

["filetime"]=> int(-1)

["ssl_verify_result"]=> int(0)

["redirect_count"]=> int(0)

["total_time"]=> float(0.884)

["namelookup_time"]=> float(0)

["connect_time"]=> float(0.079)

["pretransfer_time"]=> float(0.079)

["size_upload"]=> float(0)

["size_download"]=> float(19892)

["speed_download"]=> float(22502.2624434)

["speed_upload"]=> float(0)

["download_content_length"]=> float(0)

["upload_content_length"]=> float(0)

["starttransfer_time"]=> float(0.608)

["redirect_time"]=> float(0)

}

Listing A-20: Data made available by the curl_getinfo() command

Viewing PHP/CURL Errors

The curl_error() function returns any errors that may have occurred during a PHP/CURL session. The usage for this function is shown in Listing A-21.

$errors = curl_error($s);

Listing A-21: Accessing PHP/CURL session errors

A typical error response is shown in Listing A-22.

Couldn't resolve host 'www.webbotworld.com'

Listing A-22: Typical PHP/CURL session error

Closing PHP/CURL Sessions

You should close a PHP/CURL session immediately after you are done using it, as shown in Listing A-23. Closing the PHP/CURL session frees up server resources, primarily memory.

curl_close($s);

Listing A-23: Closing a PHP/CURL session

In normal use, PHP performs garbage collection, freeing resources like variables, socket connections, and memory when the script completes. This works fine for scripts that control web pages and execute quickly. However, webbots and spiders may require that PHP scripts run for extended periods without garbage collection. (I've written webbot scripts that run for months without stopping.) Closing each PHP/CURL session is imperative if you're writing webbot and spider scripts that make many PHP/CURL connections and run for extended periods of time.

Appendix B. STATUS CODES

This appendix contains status codes returned by web (HTTP) and news (NNTP) servers. Your webbots and spiders should use these status codes to determine the success or failure communicating with servers. When debug-ging your scripts, status codes also provide hints as to what's wrong.

HTTP Codes

The following is a representative sample of HTTP codes. These codes reflect the status of an HTTP (web page) request. You'll see these codes returned in $returned_web_page['STATUS'][ 'http_code'] if you're using the LIB_http library.

100 Continue

101 Switching Protocols

200 OK

201 Created

202 Accepted

203 Non-Authoritative Information

204 No Content

Return Main Page Previous Page Next Page

®Online Book Reader