Complexity_ A Guided Tour - Melanie Mitchell [114]
Scale-Free Networks
I’m sure you have searched the World Wide Web, and you most likely use Google as your search engine. (If you’re reading this long after I wrote it in 2008, perhaps a new search engine has become predominant.) Back in the days of the Web before Google, search engines worked by simply looking up the words in your search query in an index that connected each possible English word to a list of Web pages that contained that word. For example, if your search query was the two words “apple records,” the search engine would give you a list of all the Web pages that included those words, in order of how many times those words appeared close together on the given page. You might be as likely to get a Web page about the historical price of apples in Washington State, or the fastest times recorded in the Great Apple Race in Tasmania, as you would a page about the famous record label formed in 1968 by the Beatles. It was very frustrating in those days to sort through a plethora of irrelevant pages to find the one with the information you were actually looking for.
In the 1990s Google changed all that with a revolutionary idea for presenting the results of a Web search, called “PageRank.” The idea was that the importance (and probable relevance) of a Web page is a function of how many other pages link to it (the number of “in-links”). For example, at the time I write this, the Web page with the American and Western Fruit Grower report about Washington State apple prices in 2008 has 39 in-links. The Web page with information about the Great Apple Race of Tasmania has 47 in-links. The Web page www.beatles.com has about 27,000 in-links. This page is among those presented at the top of the list for the “apple records” search. The other two are way down the list of approximately one million pages (“hits”) listed for this query. The original PageRank algorithm was a very simple idea, but it produced a tremendously improved search engine whereby the most relevant hits for a given query were usually at the top of the list.
If we look at the Web as a network, with nodes being Web pages and links being hyperlinks from one Web page to another, we can see that PageRank works only because this network has a particular structure: as in typical social networks, there are many pages with low degree (relatively few in-links), and a much smaller number of high-degree pages (i.e., relatively many in-links). Moreover, there is a wide variety in the number of in-links among pages, which allows ranking to mean something—to actually differentiate between pages. In other words, the Web has the skewed degree distribution and hub structure described above. It also turns out to have high clustering—different “communities” of Web pages have many mutual links to one another.
In network science terminology, the Web is a scale-free network. This has become one of the most talked-about notions in recent complex systems research, so let’s dig into it a bit, by looking more deeply at the Web’s degree distribution and what it means to be scale-free.
DEGREE DISTRIBUTION OF THE WEB
How can we figure out what the Web’s degree distribution is? There are two kinds of Web links: in-links and out-links. That is, suppose my page has a link to your page but not vice versa: I have an out-link and you have an in-link. One needs to be specific about which kinds of links are counted. The original PageRank algorithm looked only at in-links and ignored out-links—in this discussion I’ll do the same. We’ll call the number of in-links to a page the in-degree of that page.
Now, what is the Web’s in-degree distribution? It’s hard, if not impossible, to count all the pages