Web Graphs — PageRank-like Importance Measures
FR

Chapter 6 — BackRank

– You’ve got to come back with me!
– Where?
– Back to the future.

Back to the Future

True novelty always arises from a return to the sources.

Edgar Morin, Amour, poésie, sagesse

We have just seen the general theoretical principles governing the various PageRank algorithms. In particular, it has become apparent that the problem of stochastic completion and that of PageRank diffusion can, and indeed should, be treated independently. In this chapter, which is an extension of a series of articles co-authored with Mohamed Bouklit [BM03, MB04], we shall study a way of performing stochastic completion while refining the random surfer model: taking into account the user’s ability to go back at any point during navigation using the Back button of their browser1. As we shall see, the PageRank algorithm resulting from this model has numerous advantages over classical PageRank variants.

6.1 On the importance of the Back button

In 1995, Catledge and Pitkow published a study showing that the Back button accounted for of all recorded navigation actions ( of actions being hyperlink usage) [CP95]. Another study, by Tauscher and Greenberg, dating from 1997, reported hyperlinks and Back [TG97]. Finally, let us cite a more recent study (2004) by Milic et al. [Mil+04], whose main results are presented in Table 6.1.

Navigation mode % Navigation mode %
Hyperlinks Bookmarks
Back button Manually typed URL
New session Start page
Other methods2 Refresh button
Forms Forward button
Table 6.1: Statistical study of real surfers’ navigation modes (after [Mil+04])

From all these studies, a certain decline in the use of the Back button between 1995 and 2004 seems to emerge, but according to [CM01], one must take into account the changes that occurred during this interval3, as well as changes in the experimental protocols used.

The other interesting piece of information these figures give us is that in all cases, the use of the Back button comes very clearly in second place, behind the use of hyperlinks which remains the preferred navigation mode. In particular, one should note that according to all studies, for every clicks on a hyperlink, there is at least use of the Back button. Yet, the use of the Back button has no equivalent in “classical” PageRank(s), whereas quantitatively less significant navigation modes such as manually typed URLs and Bookmarks can be taken into account in the computation algorithms through the zap vector.

Even though it is not really known to date whether a model closer to the real surfer necessarily yields a better PageRank, the importance of the Back button in real users’ behavior seemed to us a sufficient motivation to study the possibilities of incorporating it into a new PageRank model.

6.2 Previous and contemporary work

Kleinberg, in the HITS algorithm [Kle98], also uses the inlink matrix, and thus works on a model where following hyperlinks backward is implicitly taken into account. Nevertheless, the purpose of the HITS algorithm is not to model the Back button but to view the Web graph as a superposition of “hubs” and “authorities,” the former pointing to the latter.

On the other hand, Fagin et al., in [Fag+00] propose a very comprehensive study of the influence of introducing backward steps in a random walk, and propose a stochastic model of the random surfer that takes the Back button into account. In this model, an initial classical Markov chain is considered, to which an infinite-capacity history of visited pages is appended. To each page is associated a probability of going back, provided the history is non-empty. The main results obtained are:

The main drawback of Fagin et al.‘s model is the prohibitive cost4 of computing asymptotic distributions.

In [Syd04], Sydow proposes to reduce the complexity by limiting the history size. He thus introduces a model where only the last visited page is kept in memory (the Back button cannot be used twice in a row), and where the Back probability is uniform5. The resulting algorithm exhibits performance comparable to that of a standard PageRank, in terms of convergence speed and resource requirements. The ranking obtained from the asymptotic probability distribution remains similar to that of a standard PageRank, with significant differences in the ranking of dominant pages.

Compared to Sydow’s algorithm, our approach to incorporating the Back button has the advantage of eliminating many problems associated with classical PageRank variants (stochastic completion, leaf trimming and restoration, convergence speed).

6.3 Modeling the Back button

Adopting an approach similar to [Syd04]6, we choose to introduce the possibility of going back with a limited history. Potentially, to reduce a stochastic process with memory to a memoryless Markov chain, one may need to consider the set of all paths of length in . In the case where , which is the one we shall study, the canonical working space is the set of hyperlinks. We shall introduce two intuitive models of the Back button for the case where , and show that for one of them, it is possible to reduce the working space from to . We shall also see that this latter case is compatible with the use of a zap factor.

6.3.1 Reversible model

In this model, we assume that at each step, the user can either choose an outgoing link from the current page, or press the Back button, which will take them back to the previous step. The Back button is treated as a hyperlink like any other. In particular, the probability of using the Back button is the same as that of clicking on any given link, if at least exists, and equals in the absence of outgoing links. We believe this approach has two advantages over the uniform back probability chosen in [Syd04]:

Let us finally note that the term reversible is due to the fact that two consecutive uses of the Back button cancel each other out.

Formally, the stochastic process we have just introduced can be described as follows: for each , we define as the probability of being at at time . We also introduce the term , for , the probability of having moved from to between times and . is nonzero whenever or is in , and there is a very simple relationship between and :

where means pointed to by or pointing to . One notes that the use of the Back button requires working on the undirected graph induced by .

Similarly, it is possible to write an equation describing the terms : if , to go from to between times and , one can either follow the link from to (this requires being at at time and choosing among possibilities), or use the Back button (this requires having gone from to between times and , and choosing among possibilities). If , but , then the transition from to can only be performed via the Back button. There is no transition in other cases. We can therefore write:

Using Equation (6.1) and Equation (6.2), it is possible to carry out an iterative computation of , which can for example be initiated with the distribution

If is the undirected graph induced by , it thus seems necessary to use variables, compared to for standard PageRank. By substituting Equation (6.1) into Equation (6.2), it is possible to reduce the number of variables to , but this remains very large. This led us to consider the irreversible Back model7.

6.3.2 Irreversible model

In this new model, we assume that it is not possible to use the Back button twice in a row: it is grayed out after use, and one must first follow at least one real link before being able to use it again. Although more complex in appearance, this model has definite advantages over the reversible model:

Figure 6.1: Back button and greenhouse effect: role of (ir)reversibility

In order to compute the evolution in such a model, we shall introduce:

probability of going from page to page between times and using a hyperlink.
probability of having arrived at page at time by using the Back button.

It is possible to describe as a function of the situation at time : to follow the link from to , either one arrived at by following a real link, and must choose among possibilities, or one arrived at via the Back button, which grayed it out and limits the possibilities to . We thus have, for ,

One will note that since , there is no ambiguity in dividing by . One will also note that the destination vertex does not appear in any way in the expression for . If is the set of pages with at least one link, and the set of dead-end pages, we shall therefore henceforth speak of , with , rather than , with .

Similarly, to have arrived at a page via the Back button, one must previously have gone from to a page pointed to by using a hyperlink, then have chosen the Back button among the possibilities. In particular, one cannot have arrived at a page in via the Back button, and is zero for . For , we can write:

where means pointed to by .

If we define the Back-attractiveness of a vertex belonging to by

it is then possible to rewrite Equation (6.5) and Equation (6.4) as follows:

By combining Equation (6.7) and Equation (6.8), we obtain Equation (6.9)

And what about PageRank in all this? The probability of being at a page is the sum of the probability of having arrived there via the Back button and that of having arrived there by following a link. By setting, by convention, for , we have:

6.3.3 Incorporating the zap factor

Adding the irreversible Back button guarantees a stochastic process regardless of the initial graph, with the sole condition that the support of the initial distribution contains no dead-end page. The problem of short-distance dead ends, such as page in Figure 6.1, is solved, but irreducibility problems remain. Worse still, the Back button transforms medium- and long-distance dead ends into rank sinks (see Figure 6.2).

Figure 6.2: Recurrent strongly connected component created by the Back button

It is therefore necessary to introduce zap into the process. We have chosen a classical zap, with a factor and a distribution . We require that “zapping” deactivates the Back button: after a zap, it is not possible to go back. This has the practical advantage of not having to consider all the transitions implicitly generated by the zap, namely . This choice can also be interpreted at the surfer modeling level: according to Table 6.1, many real zaps actually correspond to starting a new session, and therefore to a history reset to .

Now that the framework is defined, it is possible to rewrite , for :

Similarly, we can rewrite the probability of being at time on a page with an empty history. By setting, by convention, for , we have:

And here, a problem arises: what happens if ? We will have a nonzero probability of being on a page in with no possibility of going back. In this situation, with probability , our surfer will “zap” elsewhere, and with probability … they will not know what to do!

We consider two methods to work around this problem:

6.4 Practical algorithm: BackRank

We have just defined a stochastic process, which thanks to the zap factor is irreducible and aperiodic. The Perron-Frobenius theorem therefore applies8 and allows us to assert that successive iterations of Equation (6.12) and Equation (6.11) will converge to a unique fixed point (up to renormalization). The initial conditions can for example be a distribution according to with an empty history ( and ).

6.4.1 Optimization

We shall assume here that we have chosen such that if .

From Equation (6.12) and Equation (6.11), can be defined recursively, for , by:

Equation (6.15) is a two-term recurrence, which is known to converge to a fixed point. Since only the fixed point is of interest, we can replace with while maintaining convergence to the same fixed point (Gauss-Seidel method). We thus obtain Equation (6.16).

Note in passing that the iterations are performed over vertices with nonzero out-degree: there is an implicit leaf trimming, similar to what is done for standard PageRank (see Section 5.4.4.3).

After the vector has converged to a vector , it remains to perform the “restoration” to obtain the asymptotic presence distribution :

We now have all the building blocks of BackRank (Algorithm 6.1). Note in passing that there is no need to perform renormalization, since there is convergence to a fixed point.

6.4.2 Results

Once an algorithm is completely defined, it must be put to the test. In order to make comparisons, we needed a baseline , and we chose the -compensated PageRank defined in Algorithm 5.3. It is indeed the simplest PageRank guaranteeing full stochastic control, apart from the uncompensated PageRank of course9. The leaf trimming and restoration technique was added for fairness toward BackRank, which performs it implicitly, and also for realism (this is the method that appears to be used in practice for off-line computations on very large graphs, and BackRank is designed to handle very large graphs). For both algorithms, we used (unless stated otherwise) and is the uniform distribution over the rake .

The test was conducted on a crawl of 8 million URLs which, for technical reasons10, is split into two samples of million URLs.

Convergence

One of the first viability criteria for a PageRank-type algorithm is its convergence speed. Figure 6.3 compares, on a semi-logarithmic scale, the value of the convergence parameter after iterations. For BackRank, the condition is reached after iterations in both samples, whereas for PageRank, it takes between and iterations. We measured the ratio of the observed geometric decrease at , and found for standard PageRank (convergence is always less than , but tends asymptotically toward ) compared to for BackRank. We conjecture that this gap, which explains BackRank’s performance, is due to the fact that unlike PageRank, BackRank implicitly uses a partial Gauss-Seidel method. It has been shown that using a Gauss-Seidel-type method considerably improves PageRank convergence [Ara+01].

We should clarify that the algorithms used are, up to rewriting11, exactly those described in this work. In particular, the power methods employed are truly power methods. For a more complete study of BackRank’s convergence performance, one would need to investigate how it behaves under the many convergence acceleration methods that exist [Ara+01, Hav99, Kam+03a].

To conclude this convergence study, let us note that on the samples considered, one iteration of BackRank takes on average seconds compared to seconds for PageRank. This difference is due to the fact that all computation constants, including the adjacency matrix, fit in memory. The -compensation thus has a non-negligible cost within an iteration. In fact, even compared to an uncompensated PageRank with estimation, BackRank incurs only a small overhead, on the order of .

Figure 6.3: Compared convergence of BackRank and a standard PageRank

Restoration

The finalization of the PageRank computation, which we call restoration, is a phase that is rather poorly described. According to [Pag+98], after convergence of PageRank on the graph restricted to vertices with nonzero out-degree, dead-end pages are added back to the process, without affecting things significantly12. But in order to assign an importance to these new pages, at least one PageRank iteration must be performed, which requires modifying most of the constants. In our experiment, we chose to perform four iterations on the full graph after convergence on the trimmed graph. Besides slower iterations ( seconds, given that approximately of the pages in the samples were dead-end pages), we observe, as in Section 5.4.4.3 page Section 5.4.4.3, the reinitialization of the parameter (see Figure 6.3): after four iterations, we are at the same level as after eight iterations of the main loop, that is, far from a stationary state…

For BackRank, restoration is much less problematic, since it reduces to applying Equation (6.17) once and only once. There is therefore no need to start a new series of iterations. Let us nonetheless specify, for honesty’s sake, that since refers to and not , the variations in can be larger than . Experimentally, it indeed appears that when is around , the variations at the level of are on the order of (slightly lower in fact). This result is nevertheless more than acceptable, especially in light of the variations of generated by the PageRank restoration.

Figure 6.4: Overlap measurements between the top pages of BackRank and PageRank

Ranking

Technical performance is only half of the evaluation criteria for a PageRank-type algorithm. One must also test the relevance of the importance ranking induced by the obtained probability distribution. A first approach consists of quantitatively comparing the results returned by BackRank and those returned by the reference PageRank. Figure 6.4 presents such a comparison, showing the percentage of pages common to the top pages returned by BackRank and PageRank respectively. We observe significant fluctuations among the very top pages. As the number of pages considered increases, the overlap stabilizes, reaching a relatively stable value of when the number of pages considered approaches of the sample size ( pages). This tends to prove that BackRank yields results fairly close to those returned by PageRank, with some specificities.

This stabilized overlap of for both samples intrigued us. Upon closer examination, we noticed that the overlap rate is a variable that depends only on (at least on our samples). Figure 6.5 shows this relationship between and the overlap for . One will note in particular that:

Figure 6.5: Influence of the zap factor on the overlap

We are aware that we are only providing here a few indicators of the quality of BackRank’s ranking. Strictly speaking, a complete validation would require incorporating BackRank into a search engine and conducting a series of satisfaction tests on a control population. In lieu of this, we shall use the reader as a control population, who can compare in lists Listing 6.1, Listing 6.2, Listing 6.3, and Listing 6.4 the top 10 pages returned by BackRank and PageRank on the two samples considered. One will note for example that the main page of CNRS is ranked by BackRank, but not by PageRank (in fact, it is 16th), whereas for the Ministry of Education, the opposite is true (page also ranked 16th, but by BackRank this time).


http://messagerie.business-village.fr:80/svc/jpro/search
http://server.moselle.cci.fr:80/Fichier/index.html
http://messagerie.business-village.fr:80/svc/jpro/aide
http://backstage.vitaminic.fr:80/add_artist.shtml
http://backstage.vitaminic.fr:80/
http://vosdroits.service-public.fr:80/Index/IndexA.html
http://emploi.cv.free.fr:80/index.htm
http://server.moselle.cci.fr:80/Fichier/listenaf.html
http://ec.grita.fr:80/isroot/fruitine/blank.html
http://www.adobe.fr:80/products/acrobat/readstep.html
Listing 6.1: Sample 1: the 10 most important pages according to BackRank

http://backstage.vitaminic.fr:80/
http://backstage.vitaminic.fr:80/add_artist.shtml
http://forums.grolier.fr:8002/assemblee/nonmembers/
http://server.moselle.cci.fr:80/Fichier/listenaf.html
http://server.moselle.cci.fr:80/Fichier/index.html
http://messagerie.business-village.fr:80/svc/jpro/search
http://www.adobe.fr:80/products/acrobat/readstep.html
http://mac-texier.ircam.fr:80/index.html
http://mac-texier.ircam.fr:80/mail.html
http://bioscience.igh.cnrs.fr:80//current/currissu.htm
Listing 6.2: Sample 1: the 10 most important pages according to PageRank

http://www.fcga.fr:80/
http://www.moselle.cci.fr:80/Fichier/index.html
http://www.lhotellerie.fr:80/Menu.htm
http://www.ima.uco.fr:80/
http://www.info-europe.fr:80/europe.web/document.dir/actu.dir/
http://www.moselle.cci.fr:80/Fichier/listenaf.html
http://www.machpro.fr:80/sofetec.htm
http://www.cnrs.fr:80/
http://www.quartet.fr:80/ce/
http://www.gaf.tm.fr:80/espace-pro.htm
Listing 6.3: Sample 2: the 10 most important pages according to BackRank

http://www.moselle.cci.fr:80/Fichier/index.html
http://www.moselle.cci.fr:80/Fichier/listenaf.html
http://www.lhotellerie.fr:80/Menu.htm
http://www.machpro.fr:80/sofetec.htm
http://www.education.gouv.fr:80/default.htm
http://www.infini.fr:80/cgi-bin/lwgate.cgi
http://www.proto.education.gouv.fr:80/cgi-bin/ELLIB/Lire/Q1
http://www.dma.utc.fr:80/~ldebraux/Genealogie/Whole_File_Report.html
http://www.ldlc.fr:80/
http://www.ldlc.fr:80/contact.shtml
Listing 6.4: Sample 2: the 10 most important pages according to PageRank

  1. 1I apologize in advance to purists, but I confess to preferring the term Back over more verbose alternatives.
  2. 2Dynamic loading, automatic redirections, address auto-completion…
  3. 3On the scale of the Web, if million pages represent a twig, years represent an eternity. During this interval, browser ergonomics evolved (Bookmarks, URL auto-completion, history browsing…), and user behavior was also modified by the omnipresence of search engines. Thus, rather than going back, some users prefer to re-enter in their search engine the query that led them to the page they came from.
  4. 4With Web graphs, any algorithm whose complexity exceeds , or at most , is considered prohibitive.
  5. 5Since the history is limited, the limit distribution is not necessarily equal to the distribution of the classical model, even for .
  6. 6Unless it was Sydow who adopted an approach similar to ours, the research having been conducted independently, and each having discovered the other’s work at the thirteenth WWW conference.
  7. 7It may be possible to reduce the number of variables to , as we shall do with the irreversible model, but we have not yet formalized this reduction, and therefore leave the reversible Back model at a purely descriptive stage.
  8. 8Note in passing that we do not need to explicitly write the associated transition matrix, which is a square matrix of size . It suffices to know that this matrix exists and that it implicitly governs our process.
  9. 9We kept the compensation to make comparisons simpler, since both competitors thus produce a probability distribution.
  10. 10The algorithms run under Matlab.
  11. 11Algorithm 5.3 was rewritten following the model of Algorithm 5.5.
  12. 12”without affecting things significantly,“ op. cit.
Esc