Chapter 5 — PageRank, a way to estimate the importance of web pages
When a researcher wants to publish a paper in a specialized journal, he must go through review committees that evaluate the quality of his work according to precise criteria. To appear on television or in a newspaper, all one needs is a good story to tell.
Surfing the internet is like sex: everyone brags about doing more than they actually do. But in the case of the Internet, they brag far more.
Omnes Viae ad Googlem ducent
This chapter will attempt to lay the intuitive, axiomatic, and theoretical foundations of PageRank-type ranking methods, brought to prominence by the famous Google search engine [Goo98]. This painstaking survey work was greatly facilitated by Mohamed Bouklit, who, with the help of Alain Jean-Marie, cleared the ground before me [Bou01, BJ02], providing me with all the necessary references.
There already exist a number of surveys on PageRank (cf [BGS02, BGS03, LM04]), but we deemed this chapter necessary because it presents all PageRanks from a single perspective, that of the stochastic interpretation, and highlights several convergence results that cannot be found elsewhere1.
5.1 A needle in a haystack…
One can find (just about) anything on the web; virtually all network users agree on this. Whether the information I am looking for exists on the web is no longer the crucial question. Nowadays, the problem has become: How do I find the page I am looking for?
- By knowing, one way or another, the address in question. For instance, by reading on a bag of flour the address of a site offering cooking recipes, by receiving an email from a friend recommending a web address, or by having the address in one’s Bookmarks…
- By navigating (surfing) the web, starting from a known address. If, for example, I am looking for a page about raccoons and I know of a zoology website, starting from that site may be a good way to find the page I am looking for.
- By using a search engine. In exchange for a query, that is, a set of words attempting to describe more or less precisely what I am looking for, the engine will return a list of pages likely to answer this query. According to [LG99], 85% of internet users rely on search engines.
| Query | Yahoo | |
| PageRank | 1 410 000 | 807 000 |
| Raton laveur | 18 100 | 25 300 |
| Amazon | 107 000 000 | 66 600 000 |
| Pâte à crêpes | 46 300 | 30 900 |
| Pâte à crêpe | 22 800 | 47 000 |
The whole challenge for search engines is to return the pages the user is actually looking for. However, the answers to a given query often number in the hundreds or even thousands2, as shown in Table 5.1. On the other hand, users quickly lose patience, and it is estimated that 90% of users do not go beyond the first page of results.
The goal of search engines is therefore to display within the first ten to twenty results the documents that best answer the question posed. In practice, no sorting method is perfect, but their variety gives search engines the possibility of combining them to better refine their results. The main sorting methods are the following:
- Relevance ranking
-
Relevance ranking is the oldest and most widely used sorting method. It is based on the number of occurrences of the search terms in the pages, their proximity, their position in the text [Sal89, YL95]… Unfortunately, this method has the drawback of being easy to manipulate by authors wishing to place their pages at the top of the list. To do so, it suffices to overload the page with important words, either in the header or in invisible text within the body of the page.
- URL analysis
-
One can also assign importance to a page based on its URL. For instance, depending on the context, URLs belonging to the Top Level Domain
http:.commight be more important than others, as might URLs containing the stringhttp:home[Li+00]. It has also been suggested that URLs at a low depth in the cluster tree are more important than others [CGP98]. - Inbound links
-
Citation counting consists of assigning pages an importance proportional to the number of known links to that page. This method has been widely used in scientometrics to evaluate the importance of publications [Pri63].
- PageRank
-
As we shall see, PageRank is a kind of recursive generalization of citation counting.
5.2 The two axioms of PageRank
PageRank, introduced by Brin et al. in 1998 [Pag+98], is the ranking method that made the Google search engine [Goo98] distinctive. It is in fact an adaptation to the Web of various methods introduced by scientometricians3 since the 1950s. Two scientometric methods in particular should be mentioned:
- Citation counting
-
In 1963, Price published Little Science, Big Science [Pri63]. In this book, the first major work dealing with what would later become scientometrics, he proposed measuring the quality of scientific production through, among other things, a very simple technique, citation counting: one way to measure the quality of a publication is to count the number of times that publication is cited.
- Markovian modeling
-
Markov chains, described in Chapter 4, not only allow one to play Monopoly (registered trademark), but also to model the evolution of a population distributed among several states, provided one can estimate the transition probabilities between states. For example, Goffman proposed in 1971 the study of the evolution of research in various subfields of logic using Markov chains [Gof71].
Let us now see how Brin et al. adapted these concepts to estimating the importance of Web pages.
5.2.1 An important page is pointed to by important pages
Transposed directly to Web graphs, the citation counting method amounts to saying that the importance of a page is proportional to its in-degree, that is, to the number of pages that cite it through a hyperlink:
where means points to .
Although this measure can indeed be used to estimate the importance of pages [CGP98], it is partly undermined by the absence of quality control. Indeed, when a researcher wants to publish a paper in a specialized journal, he must go through review committees that evaluate the quality of his work according to precise criteria. Because of this, the mere fact of being published gives a minimum level of importance to the articles in question, and there is some guarantee that the citations a paper receives are not entirely spurious. In the case of the Web, this safeguard does not exist: because of the low intellectual and material cost of a web page, anyone can link to anything without it necessarily having any real meaning4. For instance, why not create a multitude of meaningless pages that cite me through hyperlinks, in order to inflate my own importance at will?
Brin et al. propose to counter this problem with a recursive description of importance: What is an important page? It is a page pointed to by important pages. Concretely, if a page is pointed to by pages , the importance of should be defined by:
It remains to define and solve Equation (5.2).
5.2.2 The random surfer: chance does things well
A romanticized view of the Web is the principle of hyperlink navigation: to find what he is looking for, the user navigates from page to page and click to click until reaching his destination. Of course, this is not what happens in practice, for both technical reasons (it is not always possible to reach a page from a page ) and social reasons (use of search engines, user fatigue…)5.
Brin et al. had the idea of modeling the behavior of the clicking user by a Markov chain. All that was needed was to find the transition probabilities from one page to another. One of the simplest ways of looking at things is to consider that once on a given page, the user will click uniformly at random on one of the links contained in that page:
This is the basis of the random surfer model. For Brin et al., given that Web graphs reflect a deliberate and thoughtful architecture, interesting pages should be structurally easy to access, just as a city is all the more accessible by the road network the more important it is. Therefore, since the random surfer lets himself be guided by the hyperlink network, statistically, he should land on a page all the more frequently the more important it is. Hence the idea of defining the importance of a Web page by the asymptotic probability of being on that page in the random surfer model.
5.2.3 Consistency of the two interpretations
Now that we have defined two ways of considering the importance of a page, let us observe that they coincide and describe the same phenomenon: indeed, in the random surfer model, it is possible to estimate the asymptotic probability of being on a page as a function of those of the pages that point to :
One can see that we indeed have an importance transfer relation like the one defined by Equation (5.2), and that Equation (5.4) additionally obeys a conservation principle: a given page transmits all of its importance, which is equally distributed among the different pages it points to. Probability, viewed as importance, is thus transmitted through hyperlinks in the manner of a flow.
5.3 The classical models
Although PageRank is often referred to in the singular, there actually exists a multitude of PageRank(s). We shall see here how, starting from the theoretical random surfer model that we have just described, several variations have been introduced in order to adapt to the reality of Web graphs. This survey work has already been carried out for all PageRank(s) arising from explicitly stochastic processes [BGS02, BGS03, LM04], but we also wish to take into account sub-stochastic models here.
5.3.1 Ideal case
In the case where the Web graph that we wish to study is aperiodic and strongly connected, the principles seen in Section 5.2 apply directly: indeed, the task is to find a probability distribution on satisfying:
This amounts to finding the asymptotic distribution of the homogeneous Markov chain whose transition matrix is:
As seen in Section 4.3, the distribution sequence
initiated by an arbitrary probability distribution 6, converges geometrically to the unique distribution satisfying
that is, satisfying relation Equation (5.5).
This probability distribution is called the PageRank.
Remark 5.1.
As specified in our definition of a Web graph, links from a page to itself are not counted. According to [Pag+98], this allows the PageRank computation to be “smoothed out.”
Remark 5.2.
If the graph is not strongly connected, by Theorem 4.3, convergence still occurs, but there is neither uniqueness (the dimension of the solution space equals the number of recurrent strongly connected components) nor a guarantee that the support of the solution equals (in particular if transient components exist). The existence of periodicity, on the other hand, may prevent convergence, but Theorem 4.4 indicates how the problem can be circumvented.
5.3.2 Simple renormalization
Most of the time, a Web graph is not strongly connected. In particular, there exists a non-negligible number of dangling pages, which are either pages that actually contain no links, or simply pages that are known but not indexed. The rows of corresponding to these dangling pages therefore contain only s, and is thus strictly sub-stochastic. Consequently, the sequence of will converge7 to a vector that is zero outside any recurrent strongly connected components on which is stochastic8. To avoid this problem, one could consider recursively removing all dangling pages until a stochastic matrix is obtained, with the drawback of considering a smaller graph than the original. Another approach, proposed by [Pag+98], consists in renormalizing at each iteration:
This iterative procedure is a power method ([Ste94]), so it is known to converge to an eigenvector associated with the largest eigenvalue of . Two cases must then be considered:
-
If the matrix is sub-irreducible, then its maximum eigenvalue is strictly less than 1. By Theorem 4.69, the associated eigenspace has dimension , where is the number of pseudo-recurrent components, and its support is that generated by the union of the filters of the pseudo-recurrent components. In the particular case where is pseudo-irreducible, the eigenspace is a line, and there exists an associated strictly positive eigenvector: thanks to renormalization, everything proceeds as in the case of an irreducible stochastic matrix.
-
If the matrix is not sub-irreducible, contains at least one recurrent strongly connected component on which is stochastic. The maximum eigenvalue of is therefore , which means that renormalization will not change the initial result: the eigenvector will be a linear combination of the eigenvectors on the different strongly connected components on which is stochastic10.
Equivalent stochastic process
The use of sub-stochastic matrices means that we lose the natural interpretation of the random surfer. However, it is sometimes possible to complete the matrix into an equivalent stochastic matrix, that is, to find a stochastic matrix satisfying:
- if , with maximal, then
If there exists a unique maximal probability vector for , the simplest way to define such a matrix is to consider the stochastic defect of : if is sub-stochastic, the stochastic defect of is defined as the vector . The matrix
is indeed a stochastic matrix11 greater than , and it is easy to see that is a probability distribution homogeneous to , hence equal to .
The matrix has the advantage of giving a stochastic interpretation to the simple renormalization procedure: asymptotically, the random surfer follows at each step a link according to the probabilities given by . When he does not know what to do, he zaps to another page according to the probability distribution that is the maximal eigenvector of .
On the other hand, as soon as the maximal eigenspace has dimension greater than , the problem is delicate, even very delicate in the case of pseudo-recurrent components whose filters have a non-zero intersection. We will therefore limit our study of equivalent stochastic processes to cases where the maximal eigenvector is unique.
5.3.3 Stochastic completion
In order to replace with a stochastic matrix, and to avoid establishing an uncontrolled equivalent stochastic process through simple renormalization, a natural idea is to add transitions to dangling pages. One possible method consists in modeling the use of the Back button, and will be the subject of Chapter 6. Another method, proposed12 initially by [Pag+98] (in the form of Algorithm 5.1) and studied among others by [LM04], consists in defining a default probability distribution on , and using it to model the behavior of the random surfer when he arrives on a dangling page. Concretely, each zero row in (which therefore corresponds to a dangling page) is replaced by the row .
This procedure can be generalized to complete any sub-stochastic matrix: The completion of by is then the stochastic matrix
Choice of and interpretation
represents the behavior of the random surfer when does not specify where he should go, that is, all page changes that are not due to the use of hyperlinks (manually typed address, Bookmarks, search engine query…). Generally, the uniform distribution is chosen for :
which represents a zap anywhere at random on the known Web.
It has also been proposed to “personalize” , notably by [Pag+98]. For instance, it is possible to restrict to the home pages of sites. On the one hand, this avoids implicitly giving sites a partial importance measure proportional to the number of crawled pages (see Section 7.5). On the other hand, this obeys a certain natural intuition: when one breaks hyperlink navigation to zap to something else, it is likely that one will start from a home page. This intuition is confirmed by numerous studies that demonstrate the existence of pages that play the role of “hubs” for actual users [CP95, Kle98, Mil+04, WM04].
Irreducibility of
A probability distribution is said to be covering if its support satisfies . This is a natural condition to impose on any zap distribution, since it guarantees that all known pages are potentially accessible after a zap. The uniform distribution is obviously covering. The same holds for the distribution on home pages if all known pages of a site are accessible from the home page. It is also the case for the importance distribution if, and only if, all pages in have a non-zero in-degree.
Theorem 5.1.
Let be a covering probability distribution, and a strictly sub-stochastic matrix. The completion of by is irreducible if, and only if, is sub-irreducible.
Proof.
If is sub-irreducible, then from any page of , it is possible to access a page with a stochastic defect . Indeed, let be an irreducible stochastic matrix such that , and consider a path connecting to in the graph induced by . Either this path exists in , and we have what we need, or it does not exist, which implies that at least one of the pages on the path has a stochastic defect, and therefore .
is therefore irreducible, since any pair of pages is connected in the induced graph by at least one path passing through .
Conversely, if is not sub-irreducible, there exists at least one stochastic strongly connected component strictly smaller than . This component remains unchanged in , which proves that is not irreducible.
Q.E.D.
□
5.3.4 Rank source: zap factor
In order to resolve the irreducibility problem, Brin et al. [Pag+98] propose another incorporation of the zap distribution . They replace by , and seek to solve:
If , one then has the guarantee of operating on a positive, irreducible, and aperiodic matrix, since the underlying graph is a clique13. The Perron-Frobenius theorem therefore ensures convergence of the iterative process to the eigenvector associated with the maximal eigenvalue. However, unless , the new matrix is neither stochastic nor even sub-stochastic, which makes the interpretation in terms of a random surfer problematic. In order to more easily give an interpretation to the iterative process, Brin and Page normalize the matrix by taking the weighted average by of A and :
The resulting matrix preserves the (sub-)stochasticity of , and it is irreducible and aperiodic. If is stochastic, corresponds to the ideal case of Section 5.3.114. The maximal eigenvector, which we shall call PageRank with zap factor, is then obtained by Algorithm 5.2. This vector represents the last academic version of PageRank presented by Brin and Page before Google’s ranking methods became an industrial secret. It is what is generally referred to when one speaks of PageRank.
Remark 5.3.
As an aside, let us note in the founding PageRank paper ([Pag+98]) a slight confusion: Algorithm 5.1 is proposed to solve Equation (5.13). Although it is specified that the introduction of may have a slight impact on the influence of 15, the qualifier “slight” may be an understatement: in Equation (5.13), ensures an aperiodic irreducible matrix. One therefore has the guarantee of a unique strictly positive eigenvector. In contrast, in Algorithm 5.1, one implicitly works on the completion of by , and we have just seen that the influence of is negligible as soon as is not sub-irreducible16 (Theorem 5.1). Fortunately, the confusion was resolved in subsequent articles, notably with the normalization of the zap factor [BP98].
Interpretation
When is stochastic, the dual interpretation of is fairly straightforward:
- Importance transfer
-
with the zap factor, pages transmit only a fraction of their importance. In return, each page is assigned a Minimum Insertion PageRank (MIPR) equal to .
- Random surfer
-
at each step of the stochastic process described by , the random surfer will click at random on one of the outgoing links, with probability , or zap with probability somewhere on the graph according to the distribution .
5.3.5 Choice of
Before going further, it is appropriate to discuss the choice of the parameter and the reasons that led to this choice. To begin with, let us state a universal and unalterable empirical reality: equals , give or take . Since the beginnings of PageRank, has always been the reference value, and to my knowledge, practical PageRank computations following the rank source model seen in Section 5.3.4 always use a between and .
Convergence/graph alteration tradeoff
By Theorem 5.2, if is stochastic, which can be assumed by performing a completion if necessary, the eigenvalues of other than are less than in absolute value17. This guarantees algorithms Algorithm 5.2 and Algorithm 5.3 a geometric convergence with ratio at most . One therefore has an interest in choosing as small as possible… except that the smaller is, the greater the influence of the zap, which is a component external to the intrinsic Web graph. A small alters, or even distorts, the underlying graph. Choosing the largest that guarantees reasonable convergence therefore seems like a good tradeoff. Now, technical limitations mean that the number of iterations achievable by a search engine like Google is on the order of a hundred18. offers a precision of after iterations, after iterations, and therefore appears to be heuristically the desired tradeoff. Indeed, as we shall see in Section 5.5, corresponds to the differentiation threshold for a Web graph of one million pages, while is the differentiation threshold for one billion pages.
Theorem 5.2.
Let be a stochastic matrix. If is an eigenvector of associated with , then is an eigenvector of and . In particular, .
Proof.
Since is a left eigenvector of associated with , we have
Since , , hence
□
Remark 5.4.
In [Kam+03a] there is a proof of the fact that every eigenvalue other than (which is simple for ) is less than . In [LM04], it is additionally shown that the secondary eigenvalues of are equal to times those of (the multiplicities of being counted as secondary), and the authors claim that their proof is more compact than that of [Kam+03a]. Theorem 5.2 additionally shows that the secondary eigenvectors of are those of , and we claim that our proof is more compact than that of [LM04]. All that remains is to find a theorem more precise than Theorem 5.2, with an even more compact proof…
5.4 Rank source and sub-stochastic matrices
In the rank source model seen in Section 5.3.4, we saw that while adding a zap factor associated with a covering distribution guarantees the irreducibility of , the stochasticity of remains that of . When is sub-stochastic, one must therefore adapt, and we shall see in this section the main possible solutions.
5.4.1 Hybrid model: zap factor and renormalization
is pseudo-irreducible (since it is irreducible). A first possible method for obtaining a fixed point is therefore to apply the simple renormalization method (cf Section 5.3.2) to the matrix .
The interpretation in terms of a random surfer is the same as in the case where is stochastic, except that the stochastic defect of must be completed by a zap according to the distribution defined by the maximal probability eigenvector associated with .
5.4.2 Completion and rank source: -compensation
Since stochastic completion is a relatively simple way of assimilating any sub-stochastic matrix to a stochastic matrix, it is interesting to hybridize the stochastic completion method with a zap-type method or virtual page addition (see Section 5.6). This avoids having to renormalize at each iteration, and provides greater consistency in terms of stochastic interpretation. Moreover, if one chooses a completion distribution equal to the zap distribution , one obtains a very simple algorithm (Algorithm 5.3): the -compensation algorithm. The interpretation is equally simple: at each step of the stochastic process described by , the random surfer will click at random on one of the outgoing links (if any exist), with probability . In all other cases, he will zap according to .
5.4.3 Non-compensated PageRank
The non-compensated PageRank algorithm consists in computing , exactly as in Algorithm 5.2, without worrying about renormalization. This yields a strictly positive vector, which can be used to define a PageRank, as shown by Theorem 5.3 and the interpretation that follows.
Theorem 5.3.
Let be a sub-stochastic matrix, a zap factor, , and a covering probability distribution for the graph induced by .
The sequence converges geometrically, with ratio less than or equal to , to a unique fixed point , regardless of the initial vector . is a strictly positive vector, and for any completion of , if is the probability distribution associated with , then , with a totally strict inequality unless is stochastic (i.e. ).
Proof.
Since , , the mapping is -Lipschitz. It therefore has a unique fixed point toward which any sequence converges geometrically, with ratio less than or equal to 19.
This fixed point satisfies , and is therefore equal to:
In particular, it is strictly positive, since as is covering, for every in , there exists in , in such that , and therefore
□
5.4.4 Comparison: -compensated or non-compensated algorithms?
There is a very strong link between the -compensated algorithm and the non-compensated algorithm: they yield the same result, as shown by Theorem 5.4.
Theorem 5.4.
Let be a sub-stochastic matrix, a covering distribution. If is the PageRank obtained by -compensation (cf Algorithm 5.3), and the non-compensated PageRank, fixed point of the mapping , then is homogeneous to .
Proof.
By passing to the limit, it is easy to see that satisfies
We deduce
Equation (5.17) and Equation (5.20) give us .
□
Convergence
The tests we have been able to perform show that with the choice of as the initial distribution, there is no significant difference between the convergence of the -compensation algorithm and the non-compensated one. In both cases, convergence is very fast during the first iterations, and then stabilizes toward a convergence with ratio (cf main loop of Figure 5.1).
Iteration speed
The -compensation must compute the parameter at each iteration, while the non-compensated version does not. Does this have a significant influence on performance?
- If the matrix does not fit in memory, the limiting factor in computing an iteration is the multiplication of by . The time used for compensation is then negligible, and the iteration speed has no influence on the choice between the two algorithms.
- On the other hand, if , or even simply the adjacency matrix, fits in memory, the computation and incorporation of takes a duration comparable to that of computing . In fact, any norm computation adds a non-negligible overhead to the computation of an iteration, and even the computation of the convergence parameter considerably reduces performance. This leads to the SpeedRank algorithm (5.4), which computes the non-compensated PageRank by roughly estimating the number of iterations needed for good convergence. Our experiments revealed a speed gain of more than on iterations between SpeedRank and Algorithm 5.3, which more than compensates for the slight overestimation of the number of iterations needed to converge.
In terms of performance, the -compensated algorithm is therefore to be avoided when working on “small” graphs. One may be surprised in passing that Kamvar et al. use -compensation in their BlockRank algorithm [Kam+03b], which is precisely based on the decomposition of PageRank on small graphs. For specialists in PageRank computation optimization (cf [Kam+03a]), it is curious to overlook a speed gain of …
Leaf-stripping and re-pluming
In practice, PageRank algorithms are rarely applied to the entire graph . A technique known as “leaf-stripping and re-pluming” is very often used: the PageRank is first computed on the graph , where is the set of vertices of having at least one outgoing link. is called the stripped graph, or rake. Once good convergence is achieved, “re-pluming” is performed: we return to the graph and carry out a few PageRank iterations with the PageRank on as the initial estimate.
The problem is that the PageRank on the stripped graph is not necessarily a good estimate of the PageRank on , as shown in Figure 5.1: because of re-pluming, the convergence factor is virtually reset. If one wants to once again reach the convergence condition (), almost as many iterations are needed as starting from the distribution .
The leaf-stripping and re-pluming method is nonetheless useful if one limits the number of iterations in the re-pluming phase: since the main loop operates on the rake, the iterations are faster. As for the final vector, although it is not a stationary vector, it lies halfway between the PageRank on the rake and that on , and the fact that this vector is used in practice seems to indicate that the ranking it produces is worthy of interest.
5.5 Convergence in -norm and convergence of the ranking
In all the PageRank algorithms we have presented, as in all those we are about to present, we use as a convergence criterion the convergence in -norm of a sequence of positive vectors. Thus, when a rank source associated with a zap factor is used and the convergence criterion is met, we know that the error with respect to the limit vector is at most . However, only the ranking induced by is of a priori interest, since the main purpose of PageRank is to provide an importance ordering on the Web pages under consideration20.
5.5.1 Normalized Kendall distance
A first solution is to compare at each iteration the induced rankings, and to stop when there is no more change. One can also define a distance on rankings and replace convergence in -norm with convergence on rankings. A fairly classical distance on rankings is the symmetric difference distance, or Kendall distance21: if and are two rankings, presented as permutations, then the Kendall distance between these two permutations is the minimum number of inversions of two adjacent elements needed to go from one to the other. It can be shown that this distance is translation-invariant and that the distance from a permutation to the identity is:
The distance between two permutations and is therefore .
Since the greatest possible distance between two permutations of size is that between two reversed rankings, namely , one may, if one wants a convergence criterion independent of the size of the ranking, consider the Kendall distance normalized by .
5.5.2 PageRank density
By a simple order-of-magnitude argument, it is possible to establish a link between and the convergence of the ranking. The starting point is the study of the relationship between a page’s rank and its PageRank. Figure 5.2 shows this relationship for two PageRank models that we shall study in more detail: the standard -compensated PageRank with uniform zap over , and the -compensated PageRank with leaf-stripping/re-pluming technique and uniform zap over . The zap factor is of course .
The regularity of the curves22 leads us to consider the mesoscopic density of pages at a given PageRank: we seek to know what is the number of pages whose PageRank lies between and . We place ourselves at the mesoscopic scale, that is, we assume and . Experimentally, we found that the mesoscopic hypothesis was entirely realistic on graphs with more than one million vertices. We also observed that there exists, for each PageRank model, a function , relatively independent of the Web graph under consideration23, such that, if is the number of pages in the graph,
is the normalized mesoscopic density (independent of the graph size ) typical of the PageRank model under consideration. Figure 5.3 shows experimental measurements of corresponding to the two models studied here.
5.6 Models with virtual page
Adding a zap factor is sometimes called the maximal irreducibility method, in the sense that if , then the underlying graph becomes a clique. This method is often criticized for being too intrusive and distorting the structure of the Web graph. The completion method, by adding fictitious links, can also be considered intrusive.
An alternative is to employ so-called minimal irreducibility methods24, that is, to add a virtual page that will play the role of a “hub.” This type of method is used among others by [APC03, Tom03].
5.6.1 Virtual zap page
The principle of the virtual zap page is simple: add an th page, pointed to by and pointing to all other pages, and controlled by and .
Formally, if is a stochastic matrix (possibly completed), one considers the matrix
The corresponding asymptotic probability vector is obtained by Algorithm 5.5.
5.6.2 On the usefulness of the virtual page
Theorem 5.5 shows that a priori, as soon as , the use of a virtual zap page changes strictly nothing compared to adding a zap factor, both in terms of the asymptotic vector25 and of convergence (the eigenvalues other than are less than in absolute value).
Theorem 5.5.
Let be a stochastic matrix. Since is stochastic, irreducible, and aperiodic, we know that is a singular and dominant eigenvalue. More precisely, if satisfies , then one of the following 3 cases holds:
- Either , and is then homogeneous to , where is the probability distribution such that .
- Or .
- Or is neither nor . is then an eigenvector of , is zero, and we have . In particular, .
5.7 Managing physical resources
Before closing this chapter, I would like to highlight some fundamental technical considerations. The reader may have noticed that none of the presented algorithms ever explicitly involves the matrices whose maximal eigenvector is being computed (, , , and ). This is a general phenomenon: if one wants all the algorithm’s constants to fit in main memory26, a minimum of thought is required. Indeed, one must remember that is a sparse matrix containing non-zero elements, where is the average degree. being relatively constant (between and depending on whether unvisited pages are taken into account and whether link filtering is applied), this amounts to approximately linear size in , which is both a lot given the sizes of the graphs under consideration, and a minimum if one wants to use the structure of the Web. The implicit matrices generated by completion and use of the zap factor are far from sparse, and their size can be . We see here the value of using rewriting tricks in PageRank algorithms so as to never involve a matrix less sparse than .
Personally, I perform my PageRank experiments with simply the adjacency matrix, stored as a logical sparse matrix (logical sparse matrix) and a few vectors of size : , , , … Algorithms Algorithm 5.4 and Algorithm 5.6 give examples of the effective treatment of the algorithms. It is thus possible to handle up to 8 million vertices on a home PC with 1 GB of main memory, while keeping all constants in memory and performing a minimum of operations27. To go further, one must use transparent graph compression techniques28.
- 1At least not applied to PageRanks.
- 2The problem is in fact less critical if duplicates are eliminated — see Table 2.1 — but it remains significant.
-
3For several decades now, a new field of research has emerged that is devoted to the study of human intellectual production. The main branches of this meta-science are:
- Bibliometrics
- defined in 1969 as “the application of mathematics and statistical methods to books, articles, and other means of communication.”
- Scientometrics
- it can be considered as bibliometrics specialized to the field of Scientific and Technical Information (STI). However, scientometrics more generally refers to the application of statistical methods to quantitative data (economic, human, bibliographic) characterizing the state of science.
- Informetrics
- a term adopted in 1987 by the F.I.D. (International Federation of Documentation, IFD) to designate all metric activities related to information, covering both bibliometrics and scientometrics.
- 4After all, is it not said that you can find anything on the Web?
- 5Cf [CP95, CM01, Mil+04, TG97, WM04].
- 6Very often, the initial probability vector is taken to be a zap distribution — see Section 5.3.3.
- 7Subject to aperiodicity. See Section 4.5.2 for possible periodicities.
- 8Given the way the graph is constructed, any recurrent strongly connected component not reduced to a single element induces a stochastic process.
- 9See Theorem 4.6.
- 10In other words, the recurrent strongly connected components.
- 11If is sub-stochastic and a probability distribution on , by construction, is always a stochastic matrix.
- 12Unknowingly? See Remark 5.3.
- 13If is merely covering, irreducibility is still guaranteed, since one can go from any page to any other page via a page of . However, aperiodicity is no longer necessarily guaranteed (there exist simple counterexamples, for instance a branch-tree completed on its root), even if empirically, the problem does not arise for Web graphs.
- 14The cases where is strictly sub-stochastic will be studied in detail in Section 5.4.
- 15″The use of [] may have a small impact on the influence of [].“, op. cit.
- 16Recall that it suffices for this to have two pages that point only to each other.
- 17If has more than one recurrent strongly connected component, is an eigenvalue.
- 18Indeed, PageRank must be recomputed periodically, and with several billion pages to process, each iteration takes a non-negligible amount of time.
- 19Note in passing that in the case where is stochastic, we have here a very simple proof of convergence with ratio , but Theorem 5.2 nonetheless provides more information…
- 20In reality, things are slightly different. The ranking returned for a given query is presumably the result of combining several importance estimates, with relevance and PageRank being the main ones. Knowing the quantitative PageRank of pages can then be of interest.
- 21Thanks to François Durand and his master’s thesis report [Dur04] for introducing me to the Kendall distance.
- 22For each of the two PageRanks studied here, we have plotted only a single curve, but experimentally, the other samples studied generate extremely similar curves.
- 23In the case of PageRank with leaf-stripping/re-pluming, this holds for a given proportion of dangling pages. Empirically, this constant is often a crawl invariant.
- 24Cf [Tom03].
- 25Indeed, up to a weight of on the virtual page, the maximal eigenvectors are identical.
- 26It is possible, even necessary for very large graphs, to perform PageRank computations without loading all constants into main memory and with optimized disk accesses [APC03], but seeking to minimize the size of constants remains very important when one wants to implement a PageRank algorithm.
- 27This may seem modest when one knows that one GB of main memory can store the PageRank of million pages in double-precision floating point. However, one should always remember the colossal gain obtained when the adjacency matrix is in main memory.
- 28See for instance the Webgraph project [BV].