Webbots, Spiders, and Screen Scrapers - Michael Schrenk [17]
Using LIB_http
The LIB_http library provides a set of wrapper functions that simplify complicated PHP/CURL interfaces. Each of these interfaces calls a common routine, http(), which performs the specified task, using the values passed to it by the wrapper interfaces. All functions in LIB_http share a similar format: A target and referring URL are passed, and an array is returned, containing the contents of the requested file, transfer status, and error conditions.
While LIB_http has many functions, we'll restrict our discussion to simply fetching files from the Internet using HTTP. The remaining features are described as needed throughout the book.
http_get()
The function http_get() downloads files with the GET method; it has many advantages over PHP's built-in functions for downloading files from the Internet. Not only is the interface simple, but this function offers all the previously described advantages of using PHP/CURL. The script in Listing 3-4 shows how files are downloaded with http_get().
# Usage: http_get()
array http_get (string target_url, string referring_url)
Listing 3-4: Using http_get()
These are the inputs for the script in Listing 3-4:
target_url is the fully formed URL of the desired file
referring_url is the fully formed URL of the referer
These are the outputs for the script in Listing 3-4:
$array['FILE'] contains the contents of the requested file
$array['STATUS'] contains status information regarding the file transfer
$array['ERROR'] contains a textual description of any errors
http_get_withheader()
When a web agent requests a file from the Web, the server returns the file contents, as discussed in the previous section, along with the HTTP header, which describes various properties related to a web page. Browsers and webbots rely on the HTTP header to determine what to do with the contents of the downloaded file.
The data that is included in the HTTP header varies from application to application, but it may define cookies, the size of the downloaded file, redirections, encryption details, or authentication directives. Since the information in the HTTP header is critical to properly using a network file, LIB_http configures cURL to automatically handle the more common header directives. Listing 3-5 shows how this function is used.
# Usage: http_get_withheader()
array http_get_withheader (string target_url, string referring_url)
Listing 3-5: Using http_get()
These are the inputs for the script in Listing 3-5:
target_url is the fully formed URL of the desired file
referring_url is the fully formed URL of the referer
These are the outputs for the script in Listing 3-5:
$array['FILE'] contains the contents of the requested file, including the HTTP header
$array['STATUS'] contains status information about the file transfer
$array['ERROR'] contains a textual description of any errors
The example in Listing 3-6 uses the http_get_withheader() function to download a file and display the contents of the returned array.
# Include http library
include("LIB_http.php");
# Define the target and referer web pages
$target = "http://www.schrenk.com/publications.php";
$ref = "http://www.schrenk.com";
# Request the header
$return_array = http_get_withheader($target, $ref);
# Display the header
echo "FILE CONTENTS \n";
var_dump($return_array['FILE']);
echo "ERRORS \n";
var_dump($return_array['ERROR']);
echo "STATUS \n";
var_dump($return_array['STATUS']);
Listing 3-6: Using http_get_withheader()
The script in Listing 3-6 downloads the page and displays the requested page, any errors, and a variety of status information related to the fetch and download.
Listing 3-7 shows what is produced when the script in Listing 3-6 is executed, with the array that includes the page header, error conditions, and status. Notice that the contents of the returned file are limited to only the HTTP header, because we requested only the header and not the entire page. Also, notice that the first line