Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [22]

By Root 320 0

to this point, parsing meant extracting desired text from a larger string. Sometimes, however, parsing means manipulating text. For example, since webbots usually lack JavaScript interpreters, it's often best to delete JavaScript from downloaded files. In other cases, your webbots may need to remove all images or email addresses from a web page. For these reasons, LIB_parse includes the remove() function. The remove() function is an easy-to-use interface for removing unwanted text from a web page. Its usage is shown in Listing 4-9.

string remove( string web page

, string open_tag

, string close_tag

)

Where

web_page

is the contents of the web page you want to affect

open_tag

defines the beginning of the text that you want to remove

close_tag

defines the end of the text you want to remove

Listing 4-9: Using remove()

By adjusting the input parameters, the remove() function can remove a variety of text from web pages, as shown in Listing 4-10.

$uncommented_page = remove($web_page, "");

$links_removed = remove($web_page, "");

$images_removed = remove($web_page, "");

$javascript_removed = remove($web_page, "");

Listing 4-10: Using remove()

Useful PHP Functions

In addition to the previously described parsing functions in LIB_parse, PHP also contains a multitude of built-in parsing functions. The following is a brief sample of the most valuable built-in PHP parsing functions, along with examples of how they are used.

Detecting Whether a String Is Within Another String

You can use the stristr() function to tell your webbot whether or not a string contains another string. The PHP community commonly uses the term haystack to refer to the entire unparsed text and the term needle to refer to the substring within the larger string. The function stristr() looks for an occurrence of needle in haystack. If found, stristr() returns a substring of haystack from the occurrence of needle to the end of the larger string. In normal use, you're not always concerned about the actual returned text. Generally, the fact that something was returned is used as an indication that you found the existence of needle in the haystack.

The stristr() function is handy if you want to detect whether or not a specific word is mentioned in a web page. For example, if you want to know if a web page mentions dogs, you can execute the script shown in Listing 4-11.

if(stristr($web_page, "dogs"))

echo "This is a web page that mentions dogs.";

else

echo "This web page does not mention dogs";

Listing 4-11: Using stristr() to see if a string contains another string

In this example, we're not specifically interested in what the stristr() function returns, but whether is returns anything at all. If something is returned, we know that the web page contained the word dogs.

The stristr() function is not case sensitive. If you need a case-sensitive version of stristr(), use strstr().

Replacing a Portion of a String with Another String

The PHP built-in function str_replace() puts a new string in place of all occurrences of a substring within a string, as shown in Listing 4-12.

$org_string = "I wish I had a Cat.";

$result_string = str_replace("Cat", "Dog", $org_string);

# $result_string contains "I wish I had a Dog."

Listing 4-12: Using str_replace() to replace all occurrences of Cat with Dog

The str_repalce() function is also useful when a webbot needs to remove a character or set of characters from a string. You do this by instructing str_replace() to replace text with a null string, as shown in Listing 4-13.

$result = str_replace("$","","$100.00"); // Remove the dollar sign

# $result contains 100.00

Listing 4-13: Using str_replace() to remove leading dollar signs

Parsing Unformatted Text

The script in Listing 4-14 uses a variety of built-in functions, along with a few functions from LIB_http and LIB_parse, to create a string that contains unformatted text from a website. The result is the contents of the web page without any HTML formatting.

include("LIB_parse.php");

Online Book Reader

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [22]

®Online Book Reader