Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [20]

By Root 361 0
returns a string that contains everything before or after a delimiter term. This simple function can also be used to return the text between two terms. The function provided for that task is split_string(), shown in Listing 4-1.

/*

string split_string (string unparsed, string delimiter, BEFORE/AFTER,

INCL/EXCL)

Where

unparsed is the string to parse

delimiter defines boundary between substring you want and substring you

don't want

BEFORE indicates that you want what is before the delimiter

AFTER indicates that you want what is after the delimiter

INCL indicates that you want to include the delimiter in the parsed text

EXCL indicates that you don't want to include the delimiter in the parsed text

*/

Listing 4-1: Using split_string()

Simply pass split_string() the string you want to split, the delimiter where you want the split to occur, whether you want the portion of the string that is before or after the delimiter, and whether or not you want the delimiter to be included in the returned string. Examples using split_string() are shown in Listing 4-2.

include("LIB_parse.php");

$string = "The quick brown fox";

# Parse what's before the delimiter, including the delimiter

$parsed_text = split_string($string, "quick", BEFORE, INCL);

// $parsed_text = "The quick"

# Parse what's after the delimiter, but don't include the delimiter

$parsed_text = split_string($string, "quick", AFTER, EXCL);

// $parsed_text = "brown fox"

Listing 4-2: Examples of split_string() usage

Parsing Text Between Delimiters: return_between()

Sometimes it is useful to parse text between two delimiters. For example, to parse a web page's title, you'd want to parse the text between the and tags. Your webbots can use the return_between() function in LIB_parse to do this.

The return_between() function uses a start delimiter and an end delimiter to define a particular part of a string your webbot needs to parse, as shown in Listing 4-3.

/*

string return_between (string unparsed, string

start, string end,

INCL/EXCL)

Where

unparsed is the string to parse

start identifies the starting delimiter

endidentifies the ending delimiter

INCL indicates that you want to include the

delimiters in the parsed text

EXCL indicates that you don't want to

include delimiters in the parsed text

*/

Listing 4-3: Using return_between()

The script in Listing 4-4 uses return_between() to parse the HTML title of a web page.

# Include libraries

include("LIB_parse.php");

include("LIB_http.php");

# Download a web page

$web_page = http_get($target="http://www.nostarch.com", $referer="");

# Parse the title of the web page, inclusive of the title tags

$title_incl = return_between($web_page['FILE'], "", "", INCL);

# Parse the title of the web page, exclusive of the title tags

$title_excl = return_between($web_page['FILE'], "", "", EXCL);

# Display the parsed text

echo "title_incl = ".$title_incl;

echo "\n";

echo "title_excl = ".$title_excl;

Listing 4-4: Using return_between() to find the title of a web page

When Listing 4-4 is run in a shell, the results should look like Figure 4-1.

Figure 4-1. Examples of using return_between(), with and without returned delimiters

Parsing a Data Set into an Array: parse_array()

Sometimes the things your webbot needs to parse, like links, appear more than once in a web page. In these cases, a single parsed result isn't as useful as an array of results. Such a parsed array could contain all the links, meta tags, or references to images in a web page. The parse_array() function does essentially the same thing as the return_between() function, but it returns an array of all items that match the parse description or all occurrences of data between two delimiting strings. This function, for example, makes it extremely easy to extract all the links and images from a web page.

The parse_array() function , shown in Listing 4-5, is most useful when your webbots need to parse the content of reoccurring tags. For example, returning an array

Return Main Page Previous Page Next Page

®Online Book Reader