Chapter 7 — Fine Decomposition of PageRank

I shall compose until decomposition

Serge Gainsbourg

It’s a psychofrakulator. It creates a cloud of radically fluctuating free-deviant chaotrons which penetrate the synaptic relays. It’s concatenated with a synchronous transport switch that creates a virtual tributary. It’s focused onto a biobolic reflector. And what happens is that hallucinations become reality and the brain is literally fried from within.

Mystery Men

The purpose of this chapter, which is an extension of [MV03b] and [MV04], is to study how PageRank behaves with respect to the site structure presented in Part I, Section 3.3. We will show that there exists a natural decomposition of PageRank into two terms, internal PageRank and external PageRank. This decomposition provides a better understanding of the roles played by internal and external links. A first application is an algorithm for estimating the local PageRank within a website. We will also show some quantitative results on the possibilities available to a website for increasing its own PageRank.

More precisely, Section 7.1 briefly presents the various existing contributions regarding the use of the Web’s site structure to compute PageRank. Section 7.2 specifies the hypotheses and conventions that will be used. The notions of internal and external PageRank will be introduced in Section 7.3, and applied to a theoretical decomposed PageRank algorithm in Section 7.4. Finally, after taking advantage in Section 7.5 of the study of local PageRank variations to introduce the zap factor into our model, we will give in Section 7.6 algorithms applicable to real-world graphs.

7.1 Previous and contemporaneous work

Among all published studies on PageRank, only a few attempt to take advantage of the site structure. [Kam+03b] gives a method for quickly computing a good approximation of PageRank using a partition into websites.

Bianchini et al. decompose PageRank into internal links, incoming links, outgoing links, and leaks [BGS02, BGS03]. This decomposition allows, among other things, a certain understanding of how a website can change its own PageRank, while providing stability results for PageRank in the face of changes in the internal link structure of a website.

Finally, Arasu et al. showed that a PageRank computation on the quotient graph over servers converged faster than on pages, and even faster when taking into account link multiplicity (see [Ara+01]).

Compared to these works, our approach consists of using, as in [BGS02, BGS03], an exact flow decomposition of PageRank, with more flexible assumptions on the zap distribution. From this decomposition, we derive an exact semi-distributed PageRank computation algorithm, which we hybridize with the algorithm proposed in [Kam+03b] in order to obtain a fast semi-distributed algorithm with few approximations.

7.2 Hypotheses

We have seen that a structural definition of a website could be a set of pages tightly interconnected by hyperlinks. The block structure of the adjacency matrix (see Figure 3.5 and Figure 3.6) allows us to hope for many methods to decompose a Web graph into websites, and Section 3.3 gives us one. For the remainder of this chapter, we will therefore assume that our Web graph is equipped with a partition that allows it to be decomposed into websites, denoted , with .

As a first step, we will place ourselves in the ideal case where the Web graph under consideration is strongly connected and aperiodic. We will also assume the absence of self-loops.

As was seen in Section 5.3.1, if we consider the matrix defined by

we know that there exists a unique probability distribution, denoted , satisfying

We will seek to highlight the connections between and .

7.3 Internal PageRank, external PageRank

7.3.1 Notation

For in , we denote by the element of such that . We also define as follows:

We will call the restriction of to the elements internal to the components of :

We also need to define the internal degree (resp. external degree ) of a vertex as its out-degree in the subgraph induced by (resp. induced by ).

We are now in a position to define the notions of internal and external PageRank, and to relate them to the PageRank defined by Equation (7.2).

The incoming internal PageRank (resp. incoming external PageRank ) of a page in is the probability, when the random surfer is in steady state, of arriving at from a page in (resp. from a page in ). From this definition, Equation (7.5) and Equation (7.6) are derived:
The outgoing internal PageRank (resp. outgoing external PageRank ) of a page in is the probability, when the random surfer is in steady state, of going from to a page in (resp. to a page in ). Equation (7.7) and Equation (7.8) formalize this definition:

7.3.2 Conservation laws

One can define a PageRank (possibly internal, outgoing, …) on a site as the sum of the PageRanks of its pages: . With this convention, we can state the internal and external conservation laws of a site:

Theorem 7.1.

Let be a site. The incoming external and outgoing external PageRanks of are equal:

The same holds for the incoming internal and outgoing internal PageRanks:

Proof.

Let us begin by proving the internal conservation law (Equation (7.10)):

Next, Equation (7.6) and Equation (7.8) allow us to write:

Equation (7.12) combined with the internal conservation law Equation (7.10) gives us the external conservation law:

□

Figure 7.1: External PageRank conservation law:

The external conservation law Equation (7.9) shows us that a site returns, through the outgoing external PageRank, exactly the PageRank it receives (the incoming external PageRank). As Lavoisier said, nothing is lost, nothing is created, everything is transformed. If we consider the PageRank flow on the quotient graph , there is therefore conservation of flow (see Figure 7.1). This observation is the basis of the decomposed PageRank computation.

Remark 7.1.

Another way, perhaps simpler, of proving Theorem 7.1 would have been to consider PageRank directly as a stationary flow. It is then obvious that the flow on any subset of is also stationary. We preferred the matrix approach because it is the one we will continue to use hereafter, even though we will always try to interpret the results in terms of flow whenever possible.

7.4 Decomposition of the PageRank computation

7.4.1 Relationship between external PageRank and PageRank

Starting from Equation (7.5) and Equation (7.6), we can write that , and therefore that , where is the identity matrix.

Lemma 7.1.

The matrix is invertible.

Proof.

It suffices to show that is sub-irreducible. This will prove that its spectral radius is strictly less than (Theorem 4.5, Remark 4.3), and therefore that is invertible.

We proceed by contradiction: if is not sub-irreducible, there exists at least one stochastic strongly connected component in the transition graph associated with . This component is necessarily internal to a site since there are no external links. It therefore also exists in , which can then only be irreducible if the component is all of , which is impossible (we assumed that was irreducible and that contained at least two sites).

□

Lemma Lemma 7.1 then allows us to express as a function of the incoming external PageRank :

To compute the PageRank of a site , it therefore suffices to know its internal structure, through the matrix , and the incoming external PageRank it receives from the others.

Remark 7.2.

The matrix , which like is a block diagonal matrix, can be interpreted as the transition matrix of all possible internal paths. Indeed, for , represents the probability of going from to via a path of length that follows only internal links (in particular, if .)

7.4.2 Transition matrix of external PageRank

We want to formalize the intuition of a site-to-site PageRank propagation given by the external PageRank conservation law (see Figure 7.1), and find a description of the relationships between the different components of . By combining Equation (7.6) and Equation (7.14), we obtain:

We can thus define the transition matrix of external PageRank:

This matrix possesses some very interesting properties:

Lemma 7.2.

The matrix is stochastic.

Proof.

is obviously nonnegative, so it suffices to show that the sum of each column of equals . To this end, we begin by rewriting :

Consider the sum of the column of associated with page :

Thus, the sum of each column of is zero, which shows that is stochastic, since the sum of each column equals .

□

Lemma 7.3.

Let be the set of pages with no incoming external link, and the set of pages with at least one incoming external link. If we reorder the pages according to , then can be written as

where is an irreducible stochastic matrix.

Proof.

The columns of corresponding to pages of are zero. The same is therefore true of those of , which shows that can be written in the form

is stochastic, since is. It remains to show that it is irreducible. Consider two vertices and in , and a path leading from to in . Let be the subsequence obtained by keeping in only the vertices of ( and ). Then, is a path in the graph induced by . Indeed, between and , there exists a subpath of consisting of an internal path within , followed by an external jump leading to . By the definition of , we therefore have

is thus indeed a path in the graph induced by , which shows that is irreducible.

Q.E.D.

□

therefore has a unique PageRank, which is zero on ¹ and equals the PageRank of on . Subject to aperiodicity, it can be computed iteratively. Only the coefficients of are needed to compute this PageRank. Although we have not conducted extensive research to estimate the size of , the few analyses we have been able to perform, both on crawls and on server logs (in particular those of INRIA), seem to indicate that one can expect .

7.4.3 Theoretical decomposed PageRank computation

From Equation (7.14) and Equation (7.15), we can establish a theoretical semi-distributed method for computing PageRank.

Each site computes, from its block of the internal transition matrix , its block of the matrix .
The different rows of can then be reconstructed and centralized.
The external PageRank associated with is then computed (here an aperiodicity assumption on is needed).
Each site S obtains its own PageRank , , by applying , , to its matrix .

Lemma 7.4.

The vector thus obtained is homogeneous to the PageRank associated with .

Proof.

Since is irreducible, it suffices to show that equals :

Q.E.D.

□

7.5 Intermezzo: modifying one’s own PageRank

Before tackling the central piece of this chapter, the FlowRank algorithm, we want to show that our decomposition of PageRank explains to what extent a website can modify its own PageRank, which will be an opportunity to gently introduce the zap factor into our model.

The results we are about to present make sense if one accepts the idea that a website can only modify its external PageRank with great difficulty, whereas the situation is entirely different for internal PageRank. Indeed, PageRank exchanges between websites are closely monitored by Google, which does not hesitate to penalize websites that exchange links for the sole purpose of increasing their external PageRank. Such PageRank factories, called link farms or nurseries, are generally assigned a PageRank of zero, and thus end up ranked behind all other pages².

7.5.1 Amplification coefficient

Consider a site , its PageRank and its incoming external PageRank . We define the amplification coefficient of as the ratio between PageRank and incoming external PageRank:

Since , depends only on the internal structure of and on the distribution of external PageRank over ³.

Knowledge of alone gives us an estimate of :

Lemma 7.5.

A bound on the amplification coefficient is

with and .

Proof.

If we consider the vector space associated with , for every elementary vector , , we have , and therefore for every vector defined on .

We deduce the first inequality of Equation (7.24):

as well as the second:

□

The consequence of Equation (7.24) is that as soon as , nothing prevents a site from amplifying PageRank arbitrarily. In the limiting case where (a site with no outgoing external link, for example a commercial site that does not want the user to go elsewhere), there is infinite amplification and a short-circuit phenomenon. We recover the well-known rank sink phenomenon (see [Pag+98]), seen this time from the perspective of amplification: a set of pages with no outgoing link will absorb and accumulate all the incoming external PageRank until the flow is exhausted.

Fortunately, over-amplification is controlled by the zap factor.

7.5.2 Introduction of the zap factor

From now on, we will leave the ideal setting where is strongly connected and aperiodic to consider an arbitrary Web graph . In particular, we must choose which PageRank model to adopt, and our choice naturally fell on the non-compensated PageRank with zap factor. Thanks to Theorem 5.4, we know that this is a model strictly equivalent to the -compensated PageRank generally used, with the difference that it allows working with a constant zap flow.

is therefore now the unique vector satisfying , where is a covering distribution and is the zap factor.

We also need to label the zap flow. We could split it into external and internal flow, depending on whether the zap takes us out of our current site or not, but we judged it more appropriate to consider the zap flow as entirely external, and to separate the external flow into external click flow and external zap flow. We will continue to reserve the terms incoming external PageRank and outgoing external PageRank for the external click flow, and by analogy with electricity we will define two new PageRank flows related to zap: the induced PageRank, denoted , which is the probability⁴ of arriving at a page by zap, and the dissipated PageRank, denoted , which is the probability⁵ of leaving a page by zap.

We now have a bestiary of six PageRanks, or rather six flows, which are summarized in Table 7.1 (recall: is the stochastic defect, defined by ).

flow	incoming	outgoing
internal
external (click)
external (zap)

Table 7.1: The six PageRank flows in the non-compensated model

Note in passing that .

The internal and external conservation laws still hold. This time we will not attempt to prove them by matrix calculation, and will content ourselves with justifying them by the fact that we are dealing with a stationary flow. In particular, the external conservation law on a site now reads:

This equation gives us an interesting result: if a site has a PageRank greater than , its outgoing external PageRank is less than its incoming external PageRank, with equality if, and only if, and . This means that a site can only hope to have a PageRank greater than the default PageRank on condition that it gives less than what it receives.

7.5.3 Zap and amplification coefficient

With the introduction of the zap factor, Equation (7.14) now becomes

The bound seen in Section 7.5.1 still holds upon replacing by and by the total incoming external PageRank , and setting by convention if . We thus obtain Lemma 7.6.

Lemma 7.6.

The amplification factor defined by satisfies

Proof.

We proceed exactly as in the proof of Lemma 7.5. If we consider a fixed site , and if we restrict and to their values on , the first inequality is obtained by writing

and the second similarly:

□

Numerical value

For a real website, it is entirely possible to have (a site with no internal link), or on the contrary (a site with no external link, and all of whose pages have at least one link). Thus, the amplification coefficient can vary between (the site derives no benefit from the PageRank it receives) and (maximum utilization of received PageRank). Since is a universal constant equal to , we conclude that for a fixed total incoming external PageRank, a site’s PageRank can vary depending on its structure by a factor of . For example, a very poorly structured site can, by restructuring itself, achieve a new PageRank equal to approximately ⁶ of the former PageRank.

Robustness of PageRank

Bianchini et al. [BGS02, BGS03] show that the effect a site can produce on the Web is controlled by the PageRank of that site. More precisely, if we consider a dynamic graph between two instants and , they proved that:

This result can also be deduced from Lemma 7.6: if a site changes between and , the largest possible relative variation is the one where we go from to . This modification of the site’s structure, which corresponds to creating a rank sink, can only be done at the expense of external PageRank, and therefore of incoming external PageRank, so we necessarily have , giving a variation of at most . Since Bianchini et al. work here in a compensated model (the sum of PageRanks is constant), a variation of in generates the same variation outside , which gives us inequality Equation (7.32).

7.5.4 Amplification of a given page

For a website, the interest of PageRank is above all to be visible to users. In particular, the administrator of a site will probably be less interested in a high PageRank across the entire site than in a very high PageRank on a few pages, or even on a single page. It is therefore better to concentrate one’s PageRank on a general homepage rather than distribute it among several specialized pages. We will therefore consider the following problem: consider a site of pages fed by an incoming external PageRank . How can we maximize the PageRank of a given page ?

The answer is not difficult once one notices that the optimal structure is the one where all pages of other than point to (and only ) and points to at least one other page of . thus recovers, up to dissipation, all the PageRank of the other pages, and recovers its own PageRank up to a factor of , which is the maximum possible in a graph where, let us recall, self-loops are not taken into account. We then have

In the particular case where is the uniform distribution, we thus obtain

with equality if, and only if, all the incoming external PageRank is concentrated on , that is, .

From all this, we deduce some strategies for improving the PageRank of a page , which corroborate the recommendations found on many websites devoted to PageRank improvement:

Ask the administrators of other sites to always point, as far as possible, to the main page rather than to a specific page. Optionally, set up scripts that redirect to the homepage any access from a page external to the site⁷.
Always remember to include links back to the homepage, and limit the depth of the site (and thus the dissipation) as much as possible. In the particular case of frame-based sites, include <noframe> tags inside which the star structure is explicitly represented.

Let us conclude with a few remarks valid when is the uniform distribution:

With the optimal strategy, a site consisting simply of two pages pointing to each other has a PageRank that is at least equal to the average PageRank, even if the incoming external PageRank is zero: .
If (for example a site that dynamically generates pages pointing to , such as a database query site with links other than forms), the ratio is approximately . It is therefore possible to increase one’s PageRank on linearly. In practice, this is valid provided that Google’s robots take the trouble to explore all the pages⁸, and that the sole purpose of all these pages is not to increase one’s PageRank⁹.

7.6 Real-world case: FlowRank and BlowRank

The objective of this section is to adapt the theoretical computation seen in Section 7.4.3 to real-world situations.

7.6.1 Equilibrium relations with the zap factor

Now that we have had time to become familiar with the induced and dissipated flows, we can rewrite the equations seen in the course of Section 7.4.

With the introduction of the zap factor, Equation (7.14), which describes the link between incoming external PageRank and PageRank, becomes as we have seen

The equilibrium relation for external PageRank, the equivalent of Equation (7.15), is obtained by combining Equation (7.35) with the definition of . We thus obtain:

Lemma 7.7.

The spectral radius of is less than .

Proof.

The proof is similar to that of Lemma 7.2: we show indeed that is substochastic (in the broad sense). We will show for this that the spectral radius of is less than . Since is nonnegative, it suffices, by the Perron-Frobenius theorem, to show that for every nonnegative vector , . To this end, we begin by rewriting :

Now consider a nonnegative vector . Since , , , and are all nonnegative vectors, we have

Since the spectral radius of is less than , we have , hence

Q.E.D.

□

Remark 7.3.

The inequality used in the proof of Lemma 7.7 is very crude. In practice, one can therefore legitimately expect the spectral radius of to be smaller than , and thus that Equation (7.36) allows finding with a convergence rate smaller than . The results presented by Arasu et al. (see [Ara+01]) support this and allow us to hope for very fast convergence in practice.

Application: estimating the PageRank of a website

For the administrator of a website , being able to estimate the importance of its pages without calling on external help or crawling the indexable Web can be of considerable interest. For example, this importance could be incorporated into an internal search engine. Now, according to Equation (7.35), it suffices for this to be able to estimate the incoming external PageRank on .

Indeed, if we define the function , inspired by Algorithm 5.4, as a function that returns, for nonnegative with spectral radius strictly less than and nonnegative, the vector satisfying , with precision , then

To estimate this incoming external PageRank, two local approaches are possible:

We have seen that PageRank is supposed to, to a certain extent, try to represent the behavior of real users (see Section 5.2.2). One can then take, as an estimate of the incoming external PageRank, the number of real clicks from outside the site to a page of the site, measured from the analysis of the Web server logs.
Thanks to, or because of, the factor , the ranking by in-degree is an approximation of the ranking by PageRank (see Section 5.3.5). By counting the number of external references associated with each page, obtained again through an analysis of the Web server logs, one obtains another estimate of .

Once one has an estimate of , it must be balanced against . One way, among many others, to achieve this balancing is to take a weighted average of the two vectors. For example, one could return as a normalized estimate of PageRank on

Note in passing that the rank source is then normalized to regardless of the size of the site under consideration. Since the goal is merely to estimate the relative importance of pages within , this poses no problem.

7.6.2 A first approach: FlowRank

We now have all the data in hand to construct the FlowRank algorithm. Let us first observe that thanks to the zap factor, is completely defined by Equation (7.36), whereas Equation (7.15) only guaranteed obtaining a vector up to a scalar multiple. In order to avoid explicitly inverting the matrices , for , we will resort to the function defined previously. Thanks to this function, all the values we need to know can be computed:

, where is the vector equal to on , elsewhere, gives the column of associated with . This computation can be limited, in both input and output, to the pages of . We thus obtain the matrix that we previously called , and which we will continue by abuse of notation to call . Note also that this computation can be performed locally within each site .
equals . This computation can also be distributed across all sites. Note that is zero outside , so we can restrict ourselves to considering , the restriction of to .
Once and have been computed, it is possible to compute , the restriction of to : .
The non-compensated PageRank is then given by . Once again, this step can be performed at the site level.

All these operations are summarized in Algorithm 7.1.

The FlowRank algorithm has numerous advantages:

By breaking down the SpeedRank computations at the site level, it becomes possible to store the matrix in RAM, which allows extremely fast computation, all the faster since SpeedRank performs no norm computation.
Apart from the computation of , which is global, all other computations can be performed in a decentralized manner at the site level. It is thus possible to parallelize a large portion of the computations.
To take into account the update of a given site, it is not necessary to rerun the entire algorithm (unlike the case of a fully centralized algorithm). It suffices to rerun the SpeedRank computations at the level of the site in question, and to update . Using the previous values as initial values for SpeedRank can make the operation very fast.

But it also has drawbacks. Indeed, SpeedRank computations must be performed. Even though SpeedRank, as its name suggests, is very fast, this number remains high. Likewise, the computation of is a SpeedRank on a matrix. If this matrix does not fit in RAM, much of the practical benefit of the decomposed computation is lost. To solve these problems, we will draw inspiration from a competing algorithm, the BlockRank algorithm.

7.6.3 Distributed algorithm of Kamvar et al.: BlockRank

In parallel with our work on the block structure of Web graphs [MV02, MV03a] and the possible applications to PageRank [MV03b, MV04], Kamvar et al. carried out very similar research. In [Kam+03b], they use a site decomposition to propose a semi-distributed algorithm for computing an estimate of PageRank: BlockRank. This algorithm is based on the computation of a local PageRank, which in our notation is the PageRank on , and of a site PageRank, based on a substochastic BlockRank matrix, denoted , defined on the quotient graph . The estimate of the PageRank of a page is then given by the product of the local PageRank of by the PageRank of , also called BlockRank¹⁰. Although FlowRank and BlockRank resemble each other at first glance, there are important differences that we wish to highlight:

Up to the errors introduced by , FlowRank gives the non-compensated PageRank¹¹, and not an approximation. Thus, the problem of underestimation of the PageRank of main pages encountered with the BlockRank algorithm [Kam+03b] does not arise. There is of course a price to pay in terms of algorithm complexity.
The keystone of FlowRank is non-compensated PageRank, and the SpeedRank algorithm that follows from it, whereas BlockRank stays with the idea of -compensated PageRank, which is three times slower on “small” graphs.
While the local PageRank within a site equals, up to normalization, , the global FlowRank matrix, , and the BlockRank one, , have almost nothing in common. For example, has transitions from a site to itself. One will also note that whereas FlowRank uses an external zap adapted to , the PageRank on uses a uniform zap.

7.6.4 Hybrid algorithm: BlowRank

The main advantage of BlockRank over FlowRank is the reduction from to made possible by the approximations. This gain applies both to the number of local computations and to the size of the global matrix. We are therefore tempted to reduce to a matrix. To do this, we must reduce the external flow information to a scalar per site , for example by replacing with . We must then define how the incoming external PageRank is injected inside each site. We will therefore assume that each site is equipped with a probability distribution that estimates the distribution of incoming external PageRank. This distribution, which we denote , may be the uniform distribution on , or more finely a distribution on the entry pages of the site, which are the most likely to be pointed to.

We can now consider the hybrid BlowRank algorithm (7.2), which differs from FlowRank by an imposed reorganization of the external flow at the entrance of each site: everything happens as if at the entrance of each site, the incoming external PageRank (by click) were collected and redistributed according to .

We thus obtain an algorithm that requires only local calls to , plus one global call on a matrix, which places it in terms of performance at the same level as BlockRank, while the approximations made are smaller, yielding a PageRank less perturbed relative to the global model.

¹This result is natural: a page with no incoming external link cannot receive incoming external PageRank.
²Note that this policy has been the subject of numerous lawsuits between Google and search engine optimization companies. Despite suspicions regarding Google’s impartiality when it comes to defining a nursery, Google has never been found guilty: a search engine ranks its results as it sees fit.
³Note that this distribution can to some extent be influenced by modifications to the structure of . But as we have seen, overly large variations can be a sign of a link farm.
⁴Even though the non-compensated model means that we should no longer speak of probability, we will sporadically allow ourselves to continue using this term, even though it is more correct to speak of flow.
⁵See preceding footnote.
⁶This lovely number is further proof of the necessity of having as the value of .
⁷This recommendation should be taken with caution: the way Google handles redirections is not very clear. Moreover, this can make navigation less ergonomic for the user.
⁸To my great regret, Google has not yet finished exploring the page that points to all pages (see Section 1.4 page Section 1.4), which explains why it does not yet have a maximal PageRank…
⁹Otherwise, beware the penalty.
¹⁰For full details, see [Kam+03b].
¹¹Let us recall once again that from a theoretical standpoint, there is strictly no difference between non-compensated PageRank and the -compensated PageRank traditionally used.