Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [19]

By Root 363 0

Some of these functions use an additional input parameter, $data_array, when form data is passed from the webbot to the webserver. These functions are listed below:

http_get_form()

http_get_form_withheader()

http_post_form()

http_post_form_withheader()

If you don't understand what all these functions do now, don't worry. Their use will become familiar to you as you go through the examples that appear later in this book. Now might be a good time to thumb through Appendix A, which details the features of cURL that webbot developers are most apt to need.

Chapter 4. PARSING TECHNIQUES

Parsing is the process of segregating what's desired or useful from what is not. In the case of webbots, parsing involves detecting and separating image names and addresses, key phrases, hyper-references, and other information of interest to your webbot. For example, if you are writing a spider that follows links on web pages, you will have to separate these links from the rest of the HTML. Similarly, if you write a webbot to download all the images from a web page, you will have to write parsing routines that identify all the references to image files.

Parsing Poorly Written HTML

One of the problems you'll encounter when parsing web pages is poorly written HTML. A large amount of HTML is machine generated and shows little regard for human readability, and hand-written HTML often disregards standards by ignoring closing tags or misusing quotes around values. Browsers may correctly render web pages that have substandard HTML, but poorly written HTML interferes with your webbot's ability to parse web pages.

Fortunately, a software library known as HTMLTidy[14] will clean up poorly written web pages. PHP includes HTMLTidy in its standard distributions, so you should have no problem getting it running on your computer. Installing HTMLTidy (also known as just Tidy) should be similar to installing cURL. Complete installation instructions are available at the PHP website.[15]

The parse functions (described next) rely on Tidy to put unparsed source code into a known state, with known delimiters and known closing tags of known case.

Note

If you do not have HTMLTidy installed on your computer, the parsing described in this book may not work correctly.

* * *

[14] See http://tidy.sourceforge.net.

[15] See http://www.php.net.

Standard Parse Routines

I have simplified parsing by identifying a few useful functions and placing them into a library called LIB_parse. These functions (or a combination of them) provide everything needed for 99 percent of your parsing tasks. Whether or not you use the functions in LIB_parse, I highly suggest that you standardize your parsing routines. Standardized parse functions make your scripts easier to read and faster to write—and perhaps just as importantly, when you limit your parsing options to a few simple solutions, you're forced to consider simpler approaches to parsing problems. The latest version of LIB_parse is available from this book's website.

Using LIB_parse

The parsing library used in this book, LIB_parse, provides easy-to-read parsing functions that should meet most parsing tasks your webbots will encounter. Primarily, LIB_parse contains wrapper functions that provide simple interfaces to otherwise complicated routines. To use the examples in this book, you should download the latest version of this library from the book's website.

One of the things you may notice about LIB_parse is the lack of regular expressions. Although regular expressions are the mainstay for parsing text, you won't find many of them here. Regular expressions can be difficult to read and understand, especially for beginners. The built-in PHP string manipulation functions are easier to understand and usually more efficient than regular expressions.

The following is a description of the functions in LIB_parse and the parsing problems they solve. These functions are also described completely within the comments of LIB_parse.

Splitting a String at a Delimiter: split_string()

The simplest parsing function

Return Main Page Previous Page Next Page

®Online Book Reader