Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [23]

By Root 353 0
# Include parse library

include("LIB_http.php"); # Include cURL library

// Download the page

$web_page = http_get($target="http://www.cnn.com", $referer="");

// Remove all JavaScript

$noformat = remove($web_page['FILE'], "");

// Strip out all HTML formatting

$unformatted = strip_tags($only_text);

// Remove unwanted white space

$noformat = str_replace("\t", "", $noformat); // Remove tabs

$noformat = str_replace(" ", "", $noformat); // Remove non-breaking spaces

$noformat = str_replace("\n", "", $noformat); // Remove line feeds

echo $noformat;

Listing 4-14: Parsing the content from the HTML used on http://www.cnn.com

Measuring the Similarity of Strings

Sometimes it is convenient to calculate the similarity of two strings without necessarily parsing them. PHP's similar_text() function returns a value that represents the percentage of similarity between two strings. The syntax for using similar_text() is shown in Listing 4-15.

$similarity_percentage = similar_text($string1, $string2);

Listing 4-15: Example of using PHP's similar_text() function

You may use similar_text() to determine if a new version of a web page is significantly different than a cached version.

Final Thoughts

As demonstrated, a wide variety of parsing tasks can be performed with the standardized parsing routines in LIB_parse, along with a few of PHP's built-in functions. Here are a few more suggestions that may help you in your parsing projects.

Note

You'll get plenty of parsing experience as you explore the projects in this book. The projects also introduce a few advanced parsing techniques. In Chapter 7, we'll cover advanced methods for parsing data in tables. In Chapter 11, you'll learn about the insertion parse, which makes it easier to parse and debug difficult-to-parse web pages.

Don't Trust a Poorly Coded Web Page

While the scripts in LIB_parse attempt to handle most situations, there is no guarantee that you will be able to parse poorly coded or nonsensical web pages. Even the use of Tidy will not always provide proper results. For example, code like this:

may drive your parsing routines crazy. If you're having trouble debugging a parsing routine, check to see if the page has errors. If you don't check for errors, you may waste many hours trying to parse unparseable web pages.

Parse in Small Steps

When you are writing a script that depends on several levels of parsing, avoid the temptation to write your parsing script in one pass. Since succeeding sections of your code will depend on earlier parses, write and debug your scripts one parse at a time.

Don't Render Parsed Text While Debugging

If you are viewing the results of your parse in a browser, remember that the browser will attempt to render your output as a web page. If the results of your parse contain tags, display your parses within

and tags. These tags will tell the browser not to render the results of your parse as HTML. Failure to analyze the unformatted results of your parse may cause you to miss things that are inside tags.[16]

Use Regular Expressions Sparingly

The use of regular expressions is a parsing language in itself, and most modern programming languages support aspects of regular expressions. In the right hands, regular expressions are also useful for parsing and substituting text; however, they are famous for their sharp learning curve and cryptic syntax. I avoid regular expressions whenever possible.

The regular expression engine used by PHP is not as efficient as engines used in other languages, and it is certainly less efficient than PHP's built-in functions for parsing HTML. For those reasons, my preference is to limit regular expression use to instances in which there are few alternatives; in those cases, I use wrapper functions to take advantage of the functionality of regular expressions while shielding the developer from their complexities.

* * *

[16] Chapter 3 describes additional methods for viewing text downloaded from websites.

Chapter

Return Main Page Previous Page Next Page

®Online Book Reader