pagerank, a Python code which demonstrates some of the issues involved in automatically ranking web pages.
For discussion, the web can be thought of as an enormous directed graph, comprising nodes (web pages), and directed links (hyperlinks embedded in one page that refer to another page.)
A mathematical model of this situation is the incidence matrix A such that A(I,J) = 1 if page I has a hyperlink to page J.
There is a related matrix, the transition matrix T, formed by taking the transpose of A, and then normalizing each column so that it has sum 1. Then we may suppose that, if one of the hyperlinks on page J is selected at random, then T(I,J) is the probability that selecting a hyperlink on page J will take you to page I.
Now imagine that we start at page I and move to page J in accordance with a randomly selected hyperlink A(I,J). We can repeat this process indefinitely, barring the case where we reach a page with no hyperlinks.
We can model this process mathematically by starting with a vector X that is entirely 0 except for a 1 in position I. Then T*x produces a vector such that (T*x)(j) is the probability that we will have moved to page J in one step. Similarly, T*T*x will give us the probability distribution for our location at step 2, and so on. If we take enough powers of T, then we may hope that we reach a result that does not change significantly, in which case we have discovered an eigenvector of T that represents the limiting distribution of locations assuming we start at page I.
We may also take a Markov chain approach. As a first version, we could start at an arbitrary page, and move repeatedly according to randomly selected hyperlinks. We keep track of the number of times we have visited each page, and when we feel we have explored enough, the counts divided by the total number of moves will give us an estimate of the visit probability.
However, such a process is very vulnerable to being trapped in an isolated corner of the web, or of hitting an inescapable cycle. A much more robust procedure is reached if we agree that a move consists of a 15 percent chance of jumping to a completely arbitrary page, or else of taking a random hyperlink from the current page. This procedure is termed by MacCormick the "random surfer" procedure.
Statistics gathered by the eigenvalue or Markov approach can be used to assert the relative "importance" or rank of each web page.
The computer code and data files made available on this web page are distributed under the GNU LGPL license.
pagerank is available in a MATLAB version and a Python version..