Beautiful Code [105]
4.2.2. Putting Regular Expressions to Work
A regular expression standing by itself, as shown above, can be used on the command line to search files. But it turns out that most modern computer languages allow you to use them directly in program code. Let's do that, and write a program that prints out only the lines that match the expression, which is to say a program that records all the times someone fetched an article from the weblog.
This example (and most other examples in this chapter) is in the Ruby programming language because I believe it to be, while far from perfect, the most readable of languages.
If you don't know Ruby, learning it will probably make you a better programmer. In Chapter 29, the creator of Ruby, Yukihiro Matsumoto (generally known as "Matz"), discusses some of the design choices that have attracted me and so many other programmers to the language.
Example 4-1 shows our first Ruby program, with added line numbers on the left side. (All the examples in this chapter are available from the O'Reilly web site for this book.)
Example 4-1. Printing article-fetch lines
1 ARGF.each_line do |line|
2 if line =~ %r{GET /ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/[^ .]+ }
3 puts line
4 end
5 end
Running this program prints out a bunch of logfile lines that look like the first example. Let's have a line-by-line look at it:
Line 1
We want to read all the lines of the input, and we don't care whether they're from files named on the command line or are being piped in from another program on the standard input. The designers of Ruby believe strongly that programmers shouldn't have to write ugly code to deal with common situations, and this is a common situation. So, ARGF is a special variable that represents all the input sources. If the command line includes arguments, ARGF assumes they're names of files and opens them one by one; if there aren't any, it uses the standard input.
each_line is a method that you can call on pretty well any file-like object, such as ARGF. It reads the lines of input and passes them, one at a time, to a "block" of following code.
The following do says that the block getting the input stretches from there to the corresponding end, and the |line| asks that the each_line method load each line into the variable line before giving it to the block.
This kind of loop may surprise the eyes of a new convert to Ruby, but it's concise, powerful, and very easy to follow after just a bit of practice.
Line 2
This is a pretty straightforward if statement. The only magic is the =~, which means "matches" and expects to be followed by regular expression. You can tell Ruby that something is a regular expression by putting slashes before and after it—for example, /this-is-a-regex/. But the particular regular expression we want to use is full of slashes. So to use the slash syntax, you'd have to "escape" them by turning each / into \/, which would be ugly. In this case, therefore, the %r trick produces more beautiful code.
Line 3
We're inside the if block now. So, if the current line matches the regexp, the program executes puts line, which prints out the line and a line feed.
Lines 4 and 5
That's about all there is to it. The first end terminates the if, and the second terminates the do. They look kind of silly dangling off the bottom of the code, and the designers of Python have figured out a way to leave them out, which leads to some Python code being more beautiful than the corresponding Ruby.
So far, we've shown how regular expressions can be used to find the lines in the logfile that we're interested in. But what we're really interested in is counting the fetches for each article. The first step is to identify the article names. Example 4-2 is a slight variation on the previous program.
Example 4-2. Printing article names
1 ARGF.each_line do |line|
2 if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^