Data Mining - Mehmed Kantardzic [178]
PageRank is based on the random surfer model. If a surfer were to randomly select a starting Web page, and at each time step the surfer were to randomly select a link on the current Web page, then PageRank could be seen as the probability that this random surfer is on any given page. Some Web pages do not contain any hyperlinks. In this model it is assumed that the random surfer selects a random Web page when exiting pages with no hyperlinks. Additionally, there is some chance that the random surfer will stop following links and restart the process.
Computationally the PageRank (Pr) of page u can be computed as follows:
Here d is a dampening factor usually set to 0.85. N refers to the total number of nodes in the graph. The function In(u) returns the set of nodes with edges pointing into node u. |Out(v)| returns the number of nodes with edges pointing from v to that node. For example, if the Web-page connections are those given in Figure 11.4, and the current node under consideration were node B, then the following values would hold through all iterations: N = 3, In(B) = {A,C}, |Out(A)| = |{B,C}| = 2, and |Out(C)| = |{B}| = 1. The values for Pr(A) and Pr(C) would vary depending on the calculations from the previous iterations. The result is a recursive definition of PageRank. To calculate the PageRank of a given node, one must calculate the PageRank of all nodes with edges pointing into that given node.
Figure 11.4. First example used to demonstrate PageRank.
Often PageRank is calculated using an iterative approach where all nodes are given an initial value for Pr of 1/N. Then during a single iteration we calculate what the PageRank of each node would be according to the current values of all nodes linking to that node. This process is repeated until the change between iterations is below some predetermined threshold or the maximum number of iterations is achieved. Let us consider an example graph with three nodes as follows:
Initially the Pr values are as follows:
The first iteration is as follows:
The second iteration then shows the passing of these values through the graph:
If we carry this same procedure out to 100 iterations, we achieve the following results:
Additional iterations produce the same results. From this we can see a stable ordering emerge. B has the largest PageRank value having two in-links, more than any other. However, page A is not far behind, since page B has only a single link to page A without dividing its PageRank value among a number of outbound links.
Next we consider the example used for the HITS algorithm, which is applied on the graph in Figure 11.5a. The PageRank of this graph with a dampening factor of 0.85 is given in Figure 11.5b after running 100 iterations. A reader may, for practice, check the results after the first iteration, or implement the PageRank algorithm and check final results in Figure 11.5b.
Figure 11.5. An example graph and scoring for PageRank also used with HITS. (a) Subgraph of linked pages; (b) calculated PageRank for given graph.
From the values given in Figure 11.5b we can see that node 5 has the highest PageRank by far and also has the highest in-degree or edges pointing in to it. Surprising perhaps is that node 3 has the next highest score, having a score higher than node 4, which has more in-edges. The reason is that node 6 has only one out-edge pointing to node 3, while the edges pointing to node 4 are each one of multiple out-edges. Lastly, as expected, the lowest ranked edges are those with no in-edges, nodes 1 and 2.
One of the main contributions of Google’s founders is implementation and experimental evaluation of the PageRank algorithm. They included a database of web sites with 161 million links, and the algorithm converge in 45 iterations. Repeated experiments