Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [18]

By Root 359 0

in a HTTP header is the HTTP code, which indicates the status of the request. An HTTP code of 200 tells us that the request was successful. The HTTP code also appears in the status array element.[13]

FILE CONTENTS

string(215) "HTTP/1.1 200 OK

Date: Sat, 08 Oct 2008 16:38:51 GMT

Server: Apache/2.0.53 (FreeBSD) mod_ssl/2.0.53 OpenSSL/0.9.7g PHP/4.4.0

X-Powered-By: PHP/4.4.0

Content-Type: text/html; charset=ISO-8859-1

ERRORS

string(0) ""

STATUS

array(20) {

["url"]=>

string(39) "http://www.schrenk.com/publications.php"

["content_type"]=>

string(29) "text/html; charset=ISO-8859-1"

["http_code"]=>

int(200)

["header_size"]=>

int(215)

["request_size"]=>

int(200)

["filetime"]=>

int(-1)

["ssl_verify_result"]=>

int(0)

["redirect_count"]=>

int(0)

["total_time"]=>

float(0.683)

["namelookup_time"]=>

float(0.005)

["connect_time"]=>

float(0.101)

["pretransfer_time"]=>

float(0.101)

["size_upload"]=>

float(0)

["size_download"]=>

float(0)

["speed_download"]=>

float(0)

["speed_upload"]=>

float(0)

["download_content_length"]=>

float(0)

["upload_content_length"]=>

float(0)

["starttransfer_time"]=>

float(0.683)

["redirect_time"]=>

float(0)

}

Listing 3-7: File contents, errors, and the download status array returned by LIB_http

The information returned in $array['STATUS'] is extraordinarily useful for learning how the fetch was conducted. Included in this array are values for download speed, access times, and file sizes—all valuable when writing diagnostic webbots that monitor the performance of a website.

Learning More About HTTP Headers

When a Content-Type line appears in an HTTP header, it defines the MIME, or the media type of file sent from the server. The MIME type tells the web agent what to do with the file. For example, the Content-Type in the previous example was text/html, which indicates that the file is a web page. Knowing if the file they just downloaded was an image or an HTML file helps browsers know if they should display the file as text or render an image. For example, the HTTP header information for a JPEG image is shown in Listing 3-8.

HTTP/1.1 200 OK

Date: Mon, 23 Mar 2009 00:06:13 GMT

Server: Apache/1.3.12 (Unix) mod_throttle/3.1.2 tomcat/1.0 PHP/4.0.3pl1

Last-Modified: Wed, 23 Jul 2008 18:03:29 GMT

ETag: "74db-9063-3d3eebf1"

Accept-Ranges: bytes

Content-Length: 36963

Content-Type: image/jpeg

Listing 3-8: An HTTP header for an image file request

Examining LIB_http's Source Code

Most webbots in this book will use the library LIB_http to download pages from the Internet. If you plan to explore any of the webbot examples that appear later in this book, you should obtain a copy of this library; the latest version is available for download at this book's website. We'll explore some of the defaults and functions of LIB_http here.

LIB_http Defaults

At the very beginning of the library is a set of defaults, as shown in Listing 3-9.

define("WEBBOT_NAME", "Test Webbot"); # How your webbot will appear in server

logs

define("CURL_TIMEOUT", 25); # Time (seconds) to wait for network

response

define("COOKIE_FILE", "c:\cookie.txt"); # Location of cookie file

Listing 3-9: LIB_http defaults

LIB_http Functions

The functions shown in Listing 3-10 are available within LIB_http. All of these functions return the array defined earlier, containing downloaded files, error messages, and the status of the file transfer.

http_get($target, $ref) # Simple get request (w/o header)

http_get_withheader($target, $ref) # Simple get request (w/ header)

http_get_form($target, $ref, $data_array) # Form (method ="GET", w/o

header)

http_get_form_withheader($target, $ref, $data_array) # Form (method ="GET", w/ header)

http_post_form($target, $ref, $data_array) # Form (method ="POST", w/o

header)

http_post_withheader($target, $ref, $data_array) # Form (method ="POST", w/

header)

http_header($target, $ref) # Only returns header

Listing 3-10: LIB_http functions

* * *

[13] A complete list of HTTP codes can be found in Appendix B.

Final Thoughts

Online Book Reader

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [18]

®Online Book Reader