Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [56]

By Root 349 0
blogs. After AOL and Sun Microsystems divided up Netscape, the RSS Advisory Board took ownership of the RSS specification.[40]

Today, nearly every news service provides information in the form of RSS. RSS feeds are actually web pages that package online content in eXtensible Markup Language (XML) format. Unlike HTML, XML typically lacks formatting information and surrounds data with tags that make parsing very easy. Generally, RSS feeds provide links to web pages and just enough information to let you know whether a link is worth clicking, though feeds can also include complete articles.

The first part of an RSS feed contains a header that describes the RSS data to follow, as shown in Listing 12-1.

<p>RSS feed title<p>

www.Link_to_web_page.com

Description of RSS feed

Copyright notice

Date of RSS publication

Listing 12-1: The RSS feed header describes the content to follow

Not all RSS feeds start with the same set of tags, but Listing 12-1 is representative of the tags you're likely to find on most feeds. In addition to the tags shown, you may also find tags that specify the language used or define the locations of associated images.

Following the header is a collection of items that contains the content of the RSS feed, as shown in Listing 12-2.

<p>Title of item<p>

URL of associated web page for item

Description of item

Publication date of item

Other items may follow, defined as above

Listing 12-2: Example of RSS item descriptions

Depending on the source, RSS feeds may also use industry-specific XML tags to describe item contents. The tags shown in Listing 12-2, however, are representative of what you should find in most RSS data.

Our project webbot takes three RSS feeds and consolidates them on a single web page, as shown in Figure 12-2.

Figure 12-2. The aggregation webbot

The webbot shown in Figure 12-2 summarizes news from three sources. It always shows current information because the webbot requests the current news from each source every time the web page is downloaded.

Writing the Aggregation Webbot

This webbot uses two scripts. The main script, shown in Listing 12-3, defines which RSS feeds to fetch and how to display them. Both scripts are available at this book's website. The PHP sections of this script appear in bold.

# Include libraries

include("LIB_http.php");

include("LIB_parse.php");

include("LIB_rss.php");

?>

$target = "http://www.nytimes.com/services/xml/rss/nyt/RealEstate.xml";

$rss_array = download_parse_rss($target);

display_rss_array($rss_array);

?>

$target = "http://www.startribune.com/rss/1557.xml";

$rss_array = download_parse_rss($target);

display_rss_array($rss_array);

?>

$target = "http://www.mercurynews.com/mld/mercurynews/news/breaking_news/

rss.xml";

$rss_array = download_parse_rss($target);

display_rss_array($rss_array);

?>

Listing 12-3: Main aggregation webbot script, describing RSS sources and display format

As you can tell from the script in Listing 12-3, most of the work is done in the LIB_rss library, which we will explore next.

Downloading and Parsing the Target

As the name implies, the function download_parse_rss() downloads the target RSS feed and parses the results into an array for later processing, as shown in Listing 12-4.

function download_parse_rss($target)

{

# Download the RSS page

$news = http_get($target, "");

# Parse title and copyright notice

$rss_array['TITLE'] = return_between($news['FILE'],

"", "", EXCL);

$rss_array['COPYRIGHT'] = return_between($news['FILE'],

"", "", EXCL);

# Parse the items

Return Main Page Previous Page Next Page

®Online Book Reader