Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [61]

By Root 381 0
NNTP servers exchange news so frequently that newly submitted articles appear on news servers across the world almost immediately. In 1986, however, news servers often waited until the early morning hours to synchronize, when phone (modem) calls to the network were cheapest. If the newsgroup process seems odd by today's standards, remember that NNTP was optimized for use when networks were slower and more expensive.

While HTTP has superseded many older protocols (like Gopher[45]), newsgroups have survived and are still widely used today. Most modern communication applications like Microsoft Outlook and Mozilla Thunderbird include news clients in their basic configurations (see Figure 14-1).

Figure 14-1. A newsgroup as viewed in Mozilla Thunderbird, a typical news reader

While the number of active newsgroups is declining, there are still tens of thousands of newsgroups in use today. The news server I use (hosted by RoadRunner) subscribes to 26,365 newsgroups. Since the variety of topics covered by newsgroups is so diverse (ranging from alt.alien.visitors to alt.www.software.spiders.programming), you're apt to find one that interests you. Newsgroups are a fun source of homegrown information; however, like many sources on the Internet, you need to take what you read with a grain of salt. Newsgroups allow anyone to make anonymous contributions, and themes like conspiracy, spam, and self-promotion all thrive under those conditions.

* * *

[44] RFC 977 defines the original NNTP specification (http://www.ietf.org/rfc/rfc977.txt).

[45] Gopher was a predecessor to the World Wide Web, developed at the University of Minnesota (http://www.ietf.org/rfc/rfc1436.txt).

Webbots and Newsgroups

Newsgroups are a rich source of content for webbot developers. While less convenient than websites, news servers are not hard to access, especially when you have a set of functions that do most for the work for you. All of this chapter's example scripts use the LIB_nntp library. Functions in this library provide easy access to articles on news servers and create many opportunities for webbots. LIB_nntp contains functions that list newsgroups hosted by specific news servers, list available articles within newsgroups, and download particular articles. As with all libraries used in this book, the latest version of LIB_nntp is available for download at the book's website.

Identifying News Servers

Before you use NNTP, you'll need to find an accessible news server. A Google search for free news servers will provide links to some, but keep in mind that not all news servers are equal. Since few news servers host all newsgroups, not every news server will have the group you're looking for. Many free news servers also limit the number of requests you can make in a day or suffer from poor performance. For these reasons, many people prefer to pay for access to reliable news servers. You might already have access to a premium news server through your ISP. Be warned, however, that some ISPs' news servers (like those hosted by RoadRunner and EarthLink) will not allow access if you are not directly connected to a subnet in their network.

Identifying Newsgroups

Your news bots should always verify that the group you want to access is hosted by your news server. The script in Listing 14-1 uses get_nntp_groups() to create an array containing all the newsgroups on a particular news server. (Remember to put the name of your news server in place of your.news.server below.) Putting the newsgroups in an array is handy, since it allows a webbot to examine groups iteratively.

include("LIB_nntp.php");

$server = "your.news.server";

$group_array= get_nntp_groups($server);

var_dump($group_array);

Listing 14-1: Requesting (and viewing) the newsgroups available on a news server

The result of executing Listing 14-1 is shown in Figure 14-2.

Figure 14-2. Newsgroups hosted on a news server

Notice that Figure 14-2 only shows the newsgroups that hadn't already scrolled off the screen. In this example, my news server returned 46,626 groups. (It also required

Return Main Page Previous Page Next Page

®Online Book Reader