# pagerank

pagerank, an Octave code which demonstrates some of the issues involved in automatically ranking web pages.

For discussion, the web can be thought of as an enormous directed graph, comprising nodes (web pages), and directed links (hyperlinks embedded in one page that refer to another page.)

A mathematical model of this situation is the incidence matrix A such that A(I,J) = 1 if page I has a hyperlink to page J.

There is a related matrix, the transition matrix T, formed by taking the transpose of A, and then normalizing each column so that it has sum 1. Then we may suppose that, if one of the hyperlinks on page J is selected at random, then T(I,J) is the probability that selecting a hyperlink on page J will take you to page I.

Now imagine that we start at page I and move to page J in accordance with a randomly selected hyperlink A(I,J). We can repeat this process indefinitely, barring the case where we reach a page with no hyperlinks.

We can model this process mathematically by starting with a vector X that is entirely 0 except for a 1 in position I. Then T*x produces a vector such that (T*x)(j) is the probability that we will have moved to page J in one step. Similarly, T*T*x will give us the probability distribution for our location at step 2, and so on. If we take enough powers of T, then we may hope that we reach a result that does not change significantly, in which case we have discovered an eigenvector of T that represents the limiting distribution of locations assuming we start at page I.

We may also take a Markov chain approach. As a first version, we could start at an arbitrary page, and move repeatedly according to randomly selected hyperlinks. We keep track of the number of times we have visited each page, and when we feel we have explored enough, the counts divided by the total number of moves will give us an estimate of the visit probability.

However, such a process is very vulnerable to being trapped in an isolated corner of the web, or of hitting an inescapable cycle. A much more robust procedure is reached if we agree that a move consists of a 15 percent chance of jumping to a completely arbitrary page, or else of taking a random hyperlink from the current page. This procedure is termed by MacCormick the "random surfer" procedure.

Statistics gathered by the eigenvalue or Markov approach can be used to assert the relative "importance" or rank of each web page.

### Licensing:

The computer code and data files made available on this web page are distributed under the MIT license

### Languages:

pagerank is available in a MATLAB version and an Octave version and a Python version.

### Related Data and Programs:

power_method, an Octave code which carries out the power method for finding a dominant eigenvalue and its eigenvector.

### Reference:

1. John MacCormick,
Nine Algorithms That Changed the Future: The Ingenious Ideas that Drive Today's Computers,
Princeton University Press,
ISBN-13: 978-0691158198.
2. Cleve Moler,
Experiments with Matlab,
Chapter 7: Google PageRank
https://www.mathworks.com/moler/exm/chapters/pagerank.pdf
3. Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey McLachlan, Angus Ng, Bing Lu, Philip Yu, Zhi-Hua Zhou, Michael Steinbach, David Hand, Dan Steinberg,
Top 10 algorithms in data mining,
Knowledge and Information Systems,
Volume 14, Number 1, January 2008, pages 1-37.

### Source Code:

• power_rank.m, carries out the power method on the Google matrix.
• inc_to_google.m, computes the Google matrix from the incidence matrix.
• inc_to_transition.m, computes the transition matrix from the incidence matrix.
• mac1_inc.m, the incidence matrix for a tiny 5 node directed graph, MacCormick's example 1.
• mac2_inc.m, the incidence matrix for a 16 node directed graph, MacCormick's example 2.
• moler_inc.m, the incidence matrix for a 6 node directed graph discussed by Moler.
• power_rank.m, carries out the power method algorithm on the transition matrix.
• r8vec_print.m, prints an R8VEC.
• r8vec_print_transpose.m, prints an R8VEC "transposed".
• r8vec3_print.m, prints 3 R8VEC's as a table.
• sauer_inc.m, the incidence matrix for a 15 node directed graph.
• six_.m, the incidence matrix for a 6 node directed graph.
• surf_rank.m, carries out the "random surfer" algorithm using a mix of the transition matrix and random jumps.