Webbots, Spiders, and Screen Scrapers - Michael Schrenk [57]
$item_array = parse_array($news['FILE'], "
for($xx=0; $xx $rss_array['ITITLE'][$xx] = return_between($item_array[$xx], " $rss_array['ILINK'][$xx] = return_between($item_array[$xx], "", "", EXCL); $rss_array['IDESCRIPTION'][$xx] = return_between($item_array[$xx], " $rss_array['IPUBDATE'][$xx] = return_between($item_array[$xx], " } return $rss_array; } Listing 12-4: Downloading the RSS feed and parsing data into an array In addition to using the http_get() function in the LIB_http library, this script also employs the return_between() and parse_array() functions to ease the task of parsing the RSS data from the XML tags. After downloading and parsing the RSS feed, the data is formatted and displayed with the function in Listing 12-5. (PHP script appears in bold.) function display_rss_array($rss_array) {?> {?> }?> }?> Listing 12-5: Displaying the contents of $rss_array Dealing with CDATA It's worth noting that the function strip_cdata_tags() is used to remove CDATA tags from the RSS data feed. XML uses CDATA tags to identify text that may contain characters or combinations of characters that could confuse parsers. CDATA tells parsers that the data encased in CDATA tags should not be interpreted as XML tags. Listing 12-6 shows the format for using CDATA. Listing 12-6: format Since parsers ignore all , the script needs to strip off the tags to make the data displayable in a browser. * * * [40] See http://www.rssboard.org. Adding Filtering to Your Aggregation Webbot Your webbots can also modify or filter data received from RSS (or any other source). In this chapter's news aggregator, you could filter (i.e., not use) any stories that don't contain specific keywords or key phrases. For example, if you only want news stories that contain the words webbots, web spiders, and spiders, you could create a filter array like the one shown in Listing 12-7. $filter_array[]="webbots"; $filter_array[]="web spiders"; $filter_array[]="spiders"; Listing 12-7: Creating a filter array We can use $filter_array to select articles for viewing by modifying the download_parse_rss() function used in Listing 12-4. This modification is shown in Listing 12-8. function download_parse_rss($target, $filter_array) { # Download the RSS page $news = http_get($target, ""); # Parse title and copyright notice $rss_array['TITLE'] = return_between($news['FILE'], " $rss_array['COPYRIGHT'] = return_between($news['FILE'], " # Parse the items $item_array = parse_array($news['FILE'], " for($xx=0; $xx # Filter stories for relevance for($keyword=0; $keyword if(stristr($item_array[$xx], $filter_array[$keyword])) { $rss_array['ITITLE'][$xx] = return_between($item_array[$xx], " $rss_array['ILINK'][$xx] = return_between($item_array[$xx], "", "", EXCL); $rss_array['IDESCRIPTION'][$xx] = return_between($item_array[$xx], " $rss_array['IPUBDATE'][$xx] = return_between($item_array[$xx],