Beautiful Code [104]
Let's look at a simple question first: which articles have been read the most? It may not be instantly obvious that this problem is about search, but it is. First of all, you have to search through the logfiles to find the lines that record someone fetching an article. Second, you have to search through those lines to find the name of the article they fetched. Third, you have to keep track, for each article, of how often it was fetched.
Here is an example of one line from one of these files, which wraps to fit the page in this book, but is a single long line in the file:
Code View: Scroll / Show All
c80-216-32-218.cm-upc.chello.se - - [08/Oct/2006:06:37:48 -0700] "GET /ongoing/When/
200x/2006/10/08/Grief-Lessons HTTP/1.1" 200 5945 "http://www.tbray.org/ongoing/"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
Reading from left to right, this tells us that:
Somebody from an organization named chello in Sweden,
who provided neither a username nor a password,
contacted my weblog early in the morning of October 8, 2006 (my server's time zone is seven hours off Greenwich),
and requested a resource named /ongoing/When/200x/2006/10/08/Grief-Lessons
using the HTTP 1.1 protocol;
the request was successful and returned 5,945 bytes;
the visitor had been referred from my blog's home page,
and was using Internet Explorer 6 running on Windows XP.
This is an example of the kind of line I want: one that records the actual fetch of an article. There are lots of other lines that record fetching stylesheets, scripts, pictures, and so on, and attacks by malicious users. You can spot the kind of line I want by the fact that the article's name starts with /ongoing/When/ and continues with elements for the decade, year, month, and day.
Our first step, then, should be to find lines that contain something like:
/ongoing/When/200x/2006/10/08/
Whatever language you're programming in, you could spend lots of time writing code to match this pattern character by character. Or you could apply regular expressions.
4.2.1. Regular Expressions
Regular expressions are special languages designed specifically for matching patterns in text. If you learn how to use them well, you'll save yourself immense amounts of time and irritation. I've never met a really accomplished programmer who wasn't a master of regular expressions (often called regexps for short). Chapter 1, by Brian Kernighan, is dedicated to the beauty of regular expressions.
Because the filenames on my web site match such a strict, date-based pattern, a very straightforward regular expression can find the logfile lines I'm interested in. Other sites' logfiles might require a more elaborate one. Here it is:
"GET /ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/[^ .]+ "
A glance at this line instantly reveals one of the problems with regular expressions; they're not the world's most readable text. Some people might challenge their appearance in a book called Beautiful Code. Let's put that issue aside for a moment and look at this particular expression. The only thing you need to know is that in this particular flavor of regular expression:
\d
Means "match any digit, 0 through 9"
[^ .]
Means "match any character that's not a space or period"[*]
[*] People who have used regular expressions know that a period is a placeholder for "any character," but it's harder to remember that when a period is enclosed in square brackets, it loses the special meaning and refers to just a period.
+
Means "match one or more instances of whatever came just before the +"
That [^ .]+, then, means that the last slash has to be followed by a bunch of nonspace and nonperiod characters. There's a space after the + sign, so the regular expression stops when that space is found.
This regular expression won't