Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [57]

By Root 356 0

$item_array = parse_array($news['FILE'], "", "");

for($xx=0; $xx{

$rss_array['ITITLE'][$xx] = return_between($item_array[$xx],

"", "", EXCL);

$rss_array['ILINK'][$xx] = return_between($item_array[$xx],

"", "", EXCL);

$rss_array['IDESCRIPTION'][$xx] = return_between($item_array[$xx],

"", "", EXCL);

$rss_array['IPUBDATE'][$xx] = return_between($item_array[$xx],

"", "", EXCL);

}

return $rss_array;

}

Listing 12-4: Downloading the RSS feed and parsing data into an array

In addition to using the http_get() function in the LIB_http library, this script also employs the return_between() and parse_array() functions to ease the task of parsing the RSS data from the XML tags.

After downloading and parsing the RSS feed, the data is formatted and displayed with the function in Listing 12-5. (PHP script appears in bold.)

function display_rss_array($rss_array)

{?>

{?>

Listing 12-5: Displaying the contents of $rss_array

Dealing with CDATA

It's worth noting that the function strip_cdata_tags() is used to remove CDATA tags from the RSS data feed. XML uses CDATA tags to identify text that may contain characters or combinations of characters that could confuse parsers. CDATA tells parsers that the data encased in CDATA tags should not be interpreted as XML tags. Listing 12-6 shows the format for using CDATA.

Listing 12-6: format

Since parsers ignore all , the script needs to strip off the tags to make the data displayable in a browser.

* * *

[40] See http://www.rssboard.org.

Adding Filtering to Your Aggregation Webbot

Your webbots can also modify or filter data received from RSS (or any other source). In this chapter's news aggregator, you could filter (i.e., not use) any stories that don't contain specific keywords or key phrases. For example, if you only want news stories that contain the words webbots, web spiders, and spiders, you could create a filter array like the one shown in Listing 12-7.

$filter_array[]="webbots";

$filter_array[]="web spiders";

$filter_array[]="spiders";

Listing 12-7: Creating a filter array

We can use $filter_array to select articles for viewing by modifying the download_parse_rss() function used in Listing 12-4. This modification is shown in Listing 12-8.

function download_parse_rss($target, $filter_array)

{

# Download the RSS page

$news = http_get($target, "");

# Parse title and copyright notice

$rss_array['TITLE'] = return_between($news['FILE'],

"", "", EXCL);

$rss_array['COPYRIGHT'] = return_between($news['FILE'],

"", "", EXCL);

# Parse the items

$item_array = parse_array($news['FILE'], "", "");

for($xx=0; $xx{

# Filter stories for relevance

for($keyword=0; $keyword{

if(stristr($item_array[$xx], $filter_array[$keyword]))

{

$rss_array['ITITLE'][$xx] = return_between($item_array[$xx],

"", "", EXCL);

$rss_array['ILINK'][$xx] = return_between($item_array[$xx],

"", "", EXCL);

$rss_array['IDESCRIPTION'][$xx] = return_between($item_array[$xx],

"", "", EXCL);

$rss_array['IPUBDATE'][$xx] = return_between($item_array[$xx],

Return Main Page Previous Page Next Page

®Online Book Reader