Webbots, Spiders, and Screen Scrapers - Michael Schrenk [62]
News servers terminate messages by sending a line that contains just a period (.), which you can see in the last array element in Figure 14-2. That lone period is the only sign your webbot will receive to tell it to stop looking for data. If your webbot reads buffers incorrectly, it will either hang indefinitely or return with incomplete data. The small function shown in Listing 14-2 (found in LIB_nntp) correctly reads data from an open NNTP network socket and recognizes the end-of-message indicator.
function read_nntp_buffer($socket)
{
$this_line ="";
$buffer ="";
while($this_line!=".\r\n") // Read until lone . found on line
{
$this_line = fgets($socket); // Read line from socket
$buffer = $buffer . $this_line;
}
return $buffer;
}
Listing 14-2: Reading NNTP data and identifying the end of messages
The script in Listing 14-1 uses the function get_nntp_groups() to get an array of available groups hosted by your news server. The script for that function is shown below in Listing 14-3.
function get_nntp_groups($server)
{
# Open socket connection to the mail server
$fp = fsockopen($server, $port="119", $errno, $errstr, 30);
if (!$fp)
{
# If socket error, issue error
$return_array['ERROR'] = "ERROR: $errstr ($errno)";
}
else
{
# Else tell server to return a list of hosted newsgroups
$out = "LIST\r\n";
fputs($fp, $out);
$groups = read_nntp_buffer($fp);
$groups_array = explode("\r\n", $groups); // Convert to an array
}
fputs($fp, "QUIT \r\n"); // Log out
fclose($fp); // Close socket
return $groups_array;
}
Listing 14-3: A function that finds available newsgroups on a news server
As you'll learn, all NNTP commands follow a structure similar to the one used in Listing 14-3. Most NNTP commands require that you do the following:
Connect to the server (on port 119)
Issue a command, like LIST (followed by a carriage return/line feed)
Read the results (until encountering a line with a lone perioid)
End the session with a QUIT command
Close the network socket
Other NNTP commands that identify groups hosted by news servers are listed in RFC 997. You can use the basic structure of get_nntp_groups() as a guide to creating other functions that execute NNTP commands found in RFC 997.
Finding Articles in Newsgroups
As you read earlier, newsgroup articles are distributed among each of the news servers hosting a particular newsgroup and are physically located at each server hosting the newsgroup. Each article has a sequential numeric identifier that identifies the article on a particular news server. You may request the range of numeric identifiers for articles (for a given a newsgroup) with a script similar to the one in Listing 14-4.
include("LIB_nntp.php");
# Request article IDs
$server = "your.news.server";
$newsgroup = "alt.vacation.las-vegas";
$ids_array = get_nntp_article_ids($server, $newsgroup);
# Report Results
echo "\nInfo about articles in $newsgroup on $server\n";
echo "Code: ". $ids_array['RESPONSE_CODE']."\n";
echo "Estimated # of articles: ". $ids_array['EST_QTY_ARTICLES']."\n";
echo "First article ID: ". $ids_array['FIRST_ARTICLE']."\n";
echo "Last article ID: ". $ids_array['LAST_ARTICLE']."\n";
Listing 14-4: Requesting article IDs from a news server
The result of running the script in Listing 14-4 is shown in Figure 14-3.
Figure 14-3. Executing get_nntp_article_ids() and displaying the results
This function returns data in an array, with elements containing a status code,[46] the estimated quantity of articles for that group on the server, the identifier of the first article in the newsgroup, and the identifier of the last article in the newsgroup. An estimate of the number of articles is provided because