Online Book Reader

Home Category

Beautiful Code [109]

By Root 5114 0
getting real work done by scanning lines of textual input using regular expressions and using a content-addressable store to build up results was first popularized in the awk programming language, whose name reflects the surnames of its inventors Aho, Weinberger, and Kernighan.

This work, of course, was based on the then-radical Unix philosophy—due mostly to Ritchie and Thompson—that data should generally be stored in files in lines of text, and to some extent validated the philosophy.

Larry Wall took the ideas behind awk and, as the author of Perl, turned them into a high-performance, industrial-strength, general-purpose tool that doesn't get the credit it deserves. It served as the glue that has held together the world's Unix systems, and subsequently large parts of the first-generation Web.

Finding Things > Problem: Who Fetched What, When?

4.3. Problem: Who Fetched What, When?

Running a couple of quick scripts over the logfile data reveals that there are 12,600,064 instances of an article fetch coming from 2,345,571 different hosts. Suppose we are interested in who was fetching what, and when? An auditor, a police officer, or a marketing professional might be interested.

So, here's the problem: given a hostname, report what articles were fetched from that host, and when. The result is a list; if the list is empty, no articles were fetched.

We've already seen that a language's built-in hash or equivalent data structure gives the programmer a quick and easy way to store and look up key/value pairs. So, you might ask, why not use it?

That's an excellent question, and we should give the idea a try. There are reasons to worry that it might not work very well, so in the back of our minds, we should be thinking of a Plan B. As you may recall if you've ever studied hash tables, in order to go fast, they need to have a small load factor; in other words, they need to be mostly empty. However, a hash table that holds 2.35 million entries and is still mostly empty is going to require the use of a whole lot of memory.

To simplify things, I wrote a program that ran over all the logfiles and pulled out all the article fetches into a simple file; each line has the hostname, the time of the transaction, and the article name. Here are the first few lines:

crawl-66-249-72-77.googlebot.com 1166406026 2003/04/08/Riffs

egspd42470.ask.com 1166406027 2006/05/03/MARS-T-Shirt

84.7.249.205 1166406040 2003/03/27/Scanner

(The second field, the 10-digit number, is the standard Unix/Linux representation of time as the number of seconds since the beginning of 1970.)

Then I wrote a simple program to read this file and load a great big hash. Example 4-5 shows the program.

Example 4-5. Loading a big hash

Code View: Scroll / Show All

1 class BigHash

2

3 def initialize(file)

4 @hash = {}

5 lines = 0

6 File.open(file).each_line do |line|

7 s = line.split

8 article = s[2].intern

9 if @hash[s[0]]

10 @hash[s[0]] << [ s[1], article ]

11 else

12 @hash[s[0]] = [ s[1], article ]

13 end

14 lines += 1

15 STDERR.puts "Line: #{lines}" if (lines % 100000) == 0

16 end

17 end

18

19 def find(key)

20 @hash[key]

21 end

22

23 end

The program should be fairly self-explanatory, but line 15 is worth a note. When you're running a big program that's going to take a lot of time, it's very disturbing when it works away silently, maybe for hours. What if something's wrong? What if it's going incredibly slow and will never finish? So, line 15 prints out a progress report after every 100,000 lines of input, which is reassuring.

Running this program was interesting. It took about 55 minutes of CPU time to load up the hash, and the program grew to occupy 1.56 GB of memory. A little calculation suggests that it costs around 680 bytes to store the information for each host, or slicing the data another way, about 126 bytes per fetch. This is a little scary, but probably reasonable for a hash table.

Retrieval performance was excellent. I ran 2,000 queries, half of which were randomly selected hosts from the log and thus succeeded,

Return Main Page Previous Page Next Page

®Online Book Reader