Webbots, Spiders, and Screen Scrapers - Michael Schrenk [56]
Today, nearly every news service provides information in the form of RSS. RSS feeds are actually web pages that package online content in eXtensible Markup Language (XML) format. Unlike HTML, XML typically lacks formatting information and surrounds data with tags that make parsing very easy. Generally, RSS feeds provide links to web pages and just enough information to let you know whether a link is worth clicking, though feeds can also include complete articles.
The first part of an RSS feed contains a header that describes the RSS data to follow, as shown in Listing 12-1.
RSS feed title
www.Link_to_web_page.com
Description of RSS feed
Copyright notice
Date of RSS publication
Listing 12-1: The RSS feed header describes the content to follow
Not all RSS feeds start with the same set of tags, but Listing 12-1 is representative of the tags you're likely to find on most feeds. In addition to the tags shown, you may also find tags that specify the language used or define the locations of associated images.
Following the header is a collection of items that contains the content of the RSS feed, as shown in Listing 12-2.
Title of item URL of associated web page for item Description of item Publication date of item
Other items may follow, defined as above
Listing 12-2: Example of RSS item descriptions
Depending on the source, RSS feeds may also use industry-specific XML tags to describe item contents. The tags shown in Listing 12-2, however, are representative of what you should find in most RSS data.
Our project webbot takes three RSS feeds and consolidates them on a single web page, as shown in Figure 12-2.
Figure 12-2. The aggregation webbot
The webbot shown in Figure 12-2 summarizes news from three sources. It always shows current information because the webbot requests the current news from each source every time the web page is downloaded.
Writing the Aggregation Webbot
This webbot uses two scripts. The main script, shown in Listing 12-3, defines which RSS feeds to fetch and how to display them. Both scripts are available at this book's website. The PHP sections of this script appear in bold.
# Include libraries
include("LIB_http.php");
include("LIB_parse.php");
include("LIB_rss.php");
?>
$target = "http://www.nytimes.com/services/xml/rss/nyt/RealEstate.xml"; $rss_array = download_parse_rss($target); display_rss_array($rss_array); ?> | $target = "http://www.startribune.com/rss/1557.xml"; $rss_array = download_parse_rss($target); display_rss_array($rss_array); ?> | $target = "http://www.mercurynews.com/mld/mercurynews/news/breaking_news/ rss.xml"; $rss_array = download_parse_rss($target); display_rss_array($rss_array); ?> |
Listing 12-3: Main aggregation webbot script, describing RSS sources and display format
As you can tell from the script in Listing 12-3, most of the work is done in the LIB_rss library, which we will explore next.
Downloading and Parsing the Target
As the name implies, the function download_parse_rss() downloads the target RSS feed and parses the results into an array for later processing, as shown in Listing 12-4.
function download_parse_rss($target)
{
# Download the RSS page
$news = http_get($target, "");
# Parse title and copyright notice
$rss_array['TITLE'] = return_between($news['FILE'],
"
$rss_array['COPYRIGHT'] = return_between($news['FILE'],
"
# Parse the items