Beautiful Code [106]
3 puts $1
4 end
5 end
The differences are subtle. In line 2, I've added a pair of parentheses (in boldface) around the interesting part of the article name in the regular expression. In line 3, instead of printing out the whole value of line, I print out $1, which in Ruby (and several other regular-expression-friendly languages) means "the first place in the regular expression marked with parentheses." You can mark lots of different pieces of the expression, and thus use $2, $3, and so on.
The first few lines of output produced by running this program over some logfile data look like this:
2003/10/10/FooCampMacs
2006/11/13/Rough-Mix
2003/05/22/StudentLookup
2003/11/13/FlyToYokohama
2003/07/31/PerlAngst
2003/05/21/RDFNet
2003/02/23/Democracy
2005/12/30/Spolsky-Recursion
2004/05/08/Torture
2004/04/27/RSSticker
Before we go to work determining the popularity of different articles, I'd like to argue that in some important ways, this code is beautiful. Take a moment and think of the code you'd have to write to look at an arbitrary chunk of text and do the same matching and selection work done by the parenthesized regexp. It would be quite a few lines of code, and it would be easy to get wrong. Furthermore, if the format of the logfile changed, fixing the pattern matcher would be error-prone and irritating.
Under the covers, the way that regular expressions work is also among the more wonderful things in computer science. It turns out that they can conveniently be translated into finite automata. These automata are mathematically elegant, and there are astoundingly efficient algorithms for matching them against the text you're searching. The great thing is that when you're running an automaton, you have to look only once at each character in the text you're trying to match. The effect is that a well-built regular expression engine can do pattern matching and selection faster than almost any custom code, even if it were written in hand-optimized assembly language. That's beautiful.
I think that the Ruby code is pretty attractive, too. Nearly every character of the program is doing useful work. Note that there are no semicolons on the ends of the lines, nor parentheses around the conditional block, and that you can write puts line instead of puts(line). Also, variables aren't declared—they're just used. This kind of stripped-down language design makes for programs that are shorter and easier to write, as well as (more important) easier to read and easier to understand.
Thinking in terms of time, regular expressions are a win/win. It takes the programmer way less time to write them than the equivalent code, it takes less time to deliver the program to the people waiting for it, it uses the computer really efficiently, and the program's user spends less time sitting there bored.
4.2.3. Content-Addressable Storage
Now we're approaching the core of our problem, computing the popularity of articles. We'll have to pull the article name out of each line, look it up to see how many times it's been fetched, add one to that number, and then store it away again.
This may be the most basic of search patterns: we start with a key (what we're using to search—in this case, an article name), and we're looking for a value (what we want to find—in this case, the number of times the article has been fetched). Here are some other examples:
Key Value
Word List of web pages containing the word
Employee number Employee's personnel record
Passport number "true" or "false," indicating whether the person with that passport should be subject to extra scrutiny
What programmers really want in this situation is a very old idea in computer science: content-addressable memory, also known as an associative store and various other permutations of those words. The idea is to put the key in and get the value out. There actually exists hardware which does just that; it mostly lives deep in the bowels of microprocessors, providing rapid access to page tables and memory caches.
The good news is that you, the