Webbots, Spiders, and Screen Scrapers - Michael Schrenk [15]
?>
Listing 3-2: Downloading files with file()
The file() function is particularly useful for downloading comma-separated value (CSV) files, in which each line of text represents a row of data with columnar formatting (as in an Excel spreadsheet). Loading files line-by-line into an array, however, is not particularly useful when downloading HTML files because the data in a web page is not defined by rows or columns; in a CSV file, however, rows and columns have specific meaning.
* * *
[12] See Chapter 23 for more information on executing webbots as scheduled events.
Introducing PHP/CURL
While PHP is capable when it comes to simple file downloads, most real-life applications require additional functionality to handle advanced issues such as form submission, authentication, redirection, and so on. These functions are difficult to facilitate with PHP's built-in functions alone. Therefore, most of this book's examples use PHP/CURL to download files.
The open source cURL project is the product of Swedish developer Daniel Stenberg and a team of developers. The cURL library is available for use with nearly any computer language you can think of. When cURL is used with PHP, it's known as PHP/CURL.
The name cURL is either a blend of the words client and URL or an acronym for the words client URL Request Library—you decide. cURL does everything that PHP's built-in networking functions do and a lot more. Appendix A expands on cURL's features, but here's a quick overview of the things PHP/CURL can do for you, a webbot developer.
Multiple Transfer Protocols
Unlike the built-in PHP network functions, cURL supports multiple transfer protocols, including FTP, FTPS, HTTP, HTTPS, Gopher, Telnet, and LDAP. Of these protocols, the most important is probably HTTPS, which allows webbots to download from encrypted websites that employ the Secure Sockets Layer (SSL) protocol.
Form Submission
cURL provides easy ways for a webbot to emulate browser form submission to a server. cURL supports all of the standard methods, or form submission protocols, as you'll learn in Chapter 5.
Basic Authentication
cURL allows webbots to enter password-protected websites that use basic authentication. You've encountered authentication if you've seen this familiar gray box, shown in Figure 3-5, asking for your username and password. PHP/CURL makes it easy to write webbots that enter and use password-protected websites.
Figure 3-5. A basic authentication prompt
Cookies
Without cURL, it is difficult for webbots to read and write cookies, those small bits of data that websites use to create session variables that track your movement. Websites also use cookies to manage shopping carts and authenticate users. cURL makes it easy for your webbot to interpret the cookies that webservers send it; it also simplifies the process of showing webservers all the cookies your webbot has written. Chapter 21 and Chapter 22 have much more to say on the subject of webbots and cookies.
Redirection
Redirection occurs when a web browser looks for a file in one place, but the server tells it that the file has moved and that it should download it from another location. For example, the website www.company.com may use redirection to force browsers to go to www.company.com/spring_sale when a seasonal promotion is in place. Browsers handle redirections automatically, and cURL allows webbots to have the same functionality.
Agent Name Spoofing
Every time a webserver receives a file request, it stores the requesting agent's name in a log file called an access log file. This log file stores the time of access, the IP address of the requester, and the agent name, which identifies the type of program that requested the file. Generally, agent names identify the browser that the web surfer was using to view the website.
Some agent names that a server log file may record are shown in Listing 3-3. The first four names are browsers; the last is the Google spider.
Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.6) Gecko/20050225 Firefox/1.0.1
Mozilla/4.0 (compatible;