Webbots, Spiders, and Screen Scrapers - Michael Schrenk [18]
FILE CONTENTS
string(215) "HTTP/1.1 200 OK
Date: Sat, 08 Oct 2008 16:38:51 GMT
Server: Apache/2.0.53 (FreeBSD) mod_ssl/2.0.53 OpenSSL/0.9.7g PHP/4.4.0
X-Powered-By: PHP/4.4.0
Content-Type: text/html; charset=ISO-8859-1
"
ERRORS
string(0) ""
STATUS
array(20) {
["url"]=>
string(39) "http://www.schrenk.com/publications.php"
["content_type"]=>
string(29) "text/html; charset=ISO-8859-1"
["http_code"]=>
int(200)
["header_size"]=>
int(215)
["request_size"]=>
int(200)
["filetime"]=>
int(-1)
["ssl_verify_result"]=>
int(0)
["redirect_count"]=>
int(0)
["total_time"]=>
float(0.683)
["namelookup_time"]=>
float(0.005)
["connect_time"]=>
float(0.101)
["pretransfer_time"]=>
float(0.101)
["size_upload"]=>
float(0)
["size_download"]=>
float(0)
["speed_download"]=>
float(0)
["speed_upload"]=>
float(0)
["download_content_length"]=>
float(0)
["upload_content_length"]=>
float(0)
["starttransfer_time"]=>
float(0.683)
["redirect_time"]=>
float(0)
}
Listing 3-7: File contents, errors, and the download status array returned by LIB_http
The information returned in $array['STATUS'] is extraordinarily useful for learning how the fetch was conducted. Included in this array are values for download speed, access times, and file sizes—all valuable when writing diagnostic webbots that monitor the performance of a website.
Learning More About HTTP Headers
When a Content-Type line appears in an HTTP header, it defines the MIME, or the media type of file sent from the server. The MIME type tells the web agent what to do with the file. For example, the Content-Type in the previous example was text/html, which indicates that the file is a web page. Knowing if the file they just downloaded was an image or an HTML file helps browsers know if they should display the file as text or render an image. For example, the HTTP header information for a JPEG image is shown in Listing 3-8.
HTTP/1.1 200 OK
Date: Mon, 23 Mar 2009 00:06:13 GMT
Server: Apache/1.3.12 (Unix) mod_throttle/3.1.2 tomcat/1.0 PHP/4.0.3pl1
Last-Modified: Wed, 23 Jul 2008 18:03:29 GMT
ETag: "74db-9063-3d3eebf1"
Accept-Ranges: bytes
Content-Length: 36963
Content-Type: image/jpeg
Listing 3-8: An HTTP header for an image file request
Examining LIB_http's Source Code
Most webbots in this book will use the library LIB_http to download pages from the Internet. If you plan to explore any of the webbot examples that appear later in this book, you should obtain a copy of this library; the latest version is available for download at this book's website. We'll explore some of the defaults and functions of LIB_http here.
LIB_http Defaults
At the very beginning of the library is a set of defaults, as shown in Listing 3-9.
define("WEBBOT_NAME", "Test Webbot"); # How your webbot will appear in server
logs
define("CURL_TIMEOUT", 25); # Time (seconds) to wait for network
response
define("COOKIE_FILE", "c:\cookie.txt"); # Location of cookie file
Listing 3-9: LIB_http defaults
LIB_http Functions
The functions shown in Listing 3-10 are available within LIB_http. All of these functions return the array defined earlier, containing downloaded files, error messages, and the status of the file transfer.
http_get($target, $ref) # Simple get request (w/o header)
http_get_withheader($target, $ref) # Simple get request (w/ header)
http_get_form($target, $ref, $data_array) # Form (method ="GET", w/o
header)
http_get_form_withheader($target, $ref, $data_array) # Form (method ="GET", w/ header)
http_post_form($target, $ref, $data_array) # Form (method ="POST", w/o
header)
http_post_withheader($target, $ref, $data_array) # Form (method ="POST", w/
header)
http_header($target, $ref) # Only returns header
Listing 3-10: LIB_http functions
* * *
[13] A complete list of HTTP codes can be found in Appendix B.
Final Thoughts