Chapter 6 — BackRank
– You’ve got to come back with me!
– Where?
– Back to the future.
True novelty always arises from a return to the sources.
We have just seen the general theoretical principles governing the various PageRank algorithms. In particular, it has become apparent that the problem of stochastic completion and that of PageRank diffusion can, and indeed should, be treated independently. In this chapter, which is an extension of a series of articles co-authored with Mohamed Bouklit [BM03, MB04], we shall study a way of performing stochastic completion while refining the random surfer model: taking into account the user’s ability to go back at any point during navigation using the Back button of their browser1. As we shall see, the PageRank algorithm resulting from this model has numerous advantages over classical PageRank variants.
6.1 On the importance of the Back button
In 1995, Catledge and Pitkow published a study showing that the Back button accounted for of all recorded navigation actions ( of actions being hyperlink usage) [CP95]. Another study, by Tauscher and Greenberg, dating from 1997, reported hyperlinks and Back [TG97]. Finally, let us cite a more recent study (2004) by Milic et al. [Mil+04], whose main results are presented in Table 6.1.
| Navigation mode | % | Navigation mode | % |
| Hyperlinks | Bookmarks | ||
| Back button | Manually typed URL | ||
| New session | Start page | ||
| Other methods2 | Refresh button | ||
| Forms | Forward button |
From all these studies, a certain decline in the use of the Back button between 1995 and 2004 seems to emerge, but according to [CM01], one must take into account the changes that occurred during this interval3, as well as changes in the experimental protocols used.
The other interesting piece of information these figures give us is that in all cases, the use of the Back button comes very clearly in second place, behind the use of hyperlinks which remains the preferred navigation mode. In particular, one should note that according to all studies, for every clicks on a hyperlink, there is at least use of the Back button. Yet, the use of the Back button has no equivalent in “classical” PageRank(s), whereas quantitatively less significant navigation modes such as manually typed URLs and Bookmarks can be taken into account in the computation algorithms through the zap vector.
Even though it is not really known to date whether a model closer to the real surfer necessarily yields a better PageRank, the importance of the Back button in real users’ behavior seemed to us a sufficient motivation to study the possibilities of incorporating it into a new PageRank model.
6.2 Previous and contemporary work
Kleinberg, in the HITS algorithm [Kle98], also uses the inlink matrix, and thus works on a model where following hyperlinks backward is implicitly taken into account. Nevertheless, the purpose of the HITS algorithm is not to model the Back button but to view the Web graph as a superposition of “hubs” and “authorities,” the former pointing to the latter.
On the other hand, Fagin et al., in [Fag+00] propose a very comprehensive study of the influence of introducing backward steps in a random walk, and propose a stochastic model of the random surfer that takes the Back button into account. In this model, an initial classical Markov chain is considered, to which an infinite-capacity history of visited pages is appended. To each page is associated a probability of going back, provided the history is non-empty. The main results obtained are:
- Depending on the back probabilities, the process can be transient (asymptotically, the starting page is “forgotten” and eventually ceases to influence the probability distribution) or ergodic (backward steps bring the surfer infinitely often back to the starting page), with a phase transition between the two.
- In the particular case where the back probability does not depend on the page, if , the process converges to the same distribution as the classical Markov chain (without the possibility of going back).
- The other cases are characterized, different types of convergence are considered, and the limit values (when they exist) are given.
The main drawback of Fagin et al.‘s model is the prohibitive cost4 of computing asymptotic distributions.
In [Syd04], Sydow proposes to reduce the complexity by limiting the history size. He thus introduces a model where only the last visited page is kept in memory (the Back button cannot be used twice in a row), and where the Back probability is uniform5. The resulting algorithm exhibits performance comparable to that of a standard PageRank, in terms of convergence speed and resource requirements. The ranking obtained from the asymptotic probability distribution remains similar to that of a standard PageRank, with significant differences in the ranking of dominant pages.
Compared to Sydow’s algorithm, our approach to incorporating the Back button has the advantage of eliminating many problems associated with classical PageRank variants (stochastic completion, leaf trimming and restoration, convergence speed).
6.3 Modeling the Back button
Adopting an approach similar to [Syd04]6, we choose to introduce the possibility of going back with a limited history. Potentially, to reduce a stochastic process with memory to a memoryless Markov chain, one may need to consider the set of all paths of length in . In the case where , which is the one we shall study, the canonical working space is the set of hyperlinks. We shall introduce two intuitive models of the Back button for the case where , and show that for one of them, it is possible to reduce the working space from to . We shall also see that this latter case is compatible with the use of a zap factor.
6.3.1 Reversible model
In this model, we assume that at each step, the user can either choose an outgoing link from the current page, or press the Back button, which will take them back to the previous step. The Back button is treated as a hyperlink like any other. In particular, the probability of using the Back button is the same as that of clicking on any given link, if at least exists, and equals in the absence of outgoing links. We believe this approach has two advantages over the uniform back probability chosen in [Syd04]:
- Intuitively, it seems logical that the use of the Back button should depend on the available choices on the current page, that is, on the number of outgoing links. This is partially confirmed by the behaviors observed in [Mil+04].
- Just like the virtual completion page introduced in Section 5.6, the Back button we have just modeled provides an escape even from pages without links, and creates a form of stochastic completion.
Let us finally note that the term reversible is due to the fact that two consecutive uses of the Back button cancel each other out.
Formally, the stochastic process we have just introduced can be described as follows: for each , we define as the probability of being at at time . We also introduce the term , for , the probability of having moved from to between times and . is nonzero whenever or is in , and there is a very simple relationship between and :
where means pointed to by or pointing to . One notes that the use of the Back button requires working on the undirected graph induced by .
Similarly, it is possible to write an equation describing the terms : if , to go from to between times and , one can either follow the link from to (this requires being at at time and choosing among possibilities), or use the Back button (this requires having gone from to between times and , and choosing among possibilities). If , but , then the transition from to can only be performed via the Back button. There is no transition in other cases. We can therefore write:
Using Equation (6.1) and Equation (6.2), it is possible to carry out an iterative computation of , which can for example be initiated with the distribution
If is the undirected graph induced by , it thus seems necessary to use variables, compared to for standard PageRank. By substituting Equation (6.1) into Equation (6.2), it is possible to reduce the number of variables to , but this remains very large. This led us to consider the irreversible Back model7.
6.3.2 Irreversible model
In this new model, we assume that it is not possible to use the Back button twice in a row: it is grayed out after use, and one must first follow at least one real link before being able to use it again. Although more complex in appearance, this model has definite advantages over the reversible model:
- It is closer to the behavior of the Back button in real browsers, which indeed becomes disabled when the history is exhausted. The use of the Back button after history exhaustion in the reversible model is more akin to the use of the Forward button in real browsers. And, if we are to believe [Mil+04] (see Table 6.1), the role of the Forward button remains relatively negligible.
- It is compatible with the effective introduction of a zap factor (see Section 6.3.3).
- Computing the asymptotic distribution, even with a zap factor, requires no more resources than in the case of a classical PageRank (see Section 6.4.1).
- The use of the Back button inevitably introduces a kind of “greenhouse effect” at dead-end pages, which can be problematic (an effect similar to that of recurrent strongly connected components described in the previous chapter). The irreversible model reduces this effect compared to the reversible model. Consider the example of Figure 6.1: a page has links, one to a dead-end page , the other to an escape page from which the entire graph is reachable. If the random surfer ends up at , they will necessarily return to via the Back button. Going back to then creates the beginning of a greenhouse effect, and the return to occurs with probability in the reversible model (2 favorable outcomes out of 3), compared to in the irreversible case: the probability of remaining “stuck” at is lower in the irreversible model.
In order to compute the evolution in such a model, we shall introduce:
- probability of going from page to page between times and using a hyperlink.
- probability of having arrived at page at time by using the Back button.
It is possible to describe as a function of the situation at time : to follow the link from to , either one arrived at by following a real link, and must choose among possibilities, or one arrived at via the Back button, which grayed it out and limits the possibilities to . We thus have, for ,
One will note that since , there is no ambiguity in dividing by . One will also note that the destination vertex does not appear in any way in the expression for . If is the set of pages with at least one link, and the set of dead-end pages, we shall therefore henceforth speak of , with , rather than , with .
Similarly, to have arrived at a page via the Back button, one must previously have gone from to a page pointed to by using a hyperlink, then have chosen the Back button among the possibilities. In particular, one cannot have arrived at a page in via the Back button, and is zero for . For , we can write:
where means pointed to by .
If we define the Back-attractiveness of a vertex belonging to by
it is then possible to rewrite Equation (6.5) and Equation (6.4) as follows:
By combining Equation (6.7) and Equation (6.8), we obtain Equation (6.9)
And what about PageRank in all this? The probability of being at a page is the sum of the probability of having arrived there via the Back button and that of having arrived there by following a link. By setting, by convention, for , we have:
6.3.3 Incorporating the zap factor
Adding the irreversible Back button guarantees a stochastic process regardless of the initial graph, with the sole condition that the support of the initial distribution contains no dead-end page. The problem of short-distance dead ends, such as page in Figure 6.1, is solved, but irreducibility problems remain. Worse still, the Back button transforms medium- and long-distance dead ends into rank sinks (see Figure 6.2).
It is therefore necessary to introduce zap into the process. We have chosen a classical zap, with a factor and a distribution . We require that “zapping” deactivates the Back button: after a zap, it is not possible to go back. This has the practical advantage of not having to consider all the transitions implicitly generated by the zap, namely . This choice can also be interpreted at the surfer modeling level: according to Table 6.1, many real zaps actually correspond to starting a new session, and therefore to a history reset to .
Now that the framework is defined, it is possible to rewrite , for :
Similarly, we can rewrite the probability of being at time on a page with an empty history. By setting, by convention, for , we have:
And here, a problem arises: what happens if ? We will have a nonzero probability of being on a page in with no possibility of going back. In this situation, with probability , our surfer will “zap” elsewhere, and with probability … they will not know what to do!
We consider two methods to work around this problem:
-
Choose such that . Since the only condition required of a zap distribution is that it be covering, this is entirely possible. A distribution over home pages will suffice, provided no home page is a single dead-end page. One can also choose the uniform distribution over defined by
-
Complete the stochastic process on the fly. We do know the probability of not knowing what to do between times and : . It then suffices to redistribute this probability, for example as zap according to . Equation Equation (6.12) then becomes:
6.4 Practical algorithm: BackRank
We have just defined a stochastic process, which thanks to the zap factor is irreducible and aperiodic. The Perron-Frobenius theorem therefore applies8 and allows us to assert that successive iterations of Equation (6.12) and Equation (6.11) will converge to a unique fixed point (up to renormalization). The initial conditions can for example be a distribution according to with an empty history ( and ).
6.4.1 Optimization
We shall assume here that we have chosen such that if .
From Equation (6.12) and Equation (6.11), can be defined recursively, for , by:
Equation (6.15) is a two-term recurrence, which is known to converge to a fixed point. Since only the fixed point is of interest, we can replace with while maintaining convergence to the same fixed point (Gauss-Seidel method). We thus obtain Equation (6.16).
Note in passing that the iterations are performed over vertices with nonzero out-degree: there is an implicit leaf trimming, similar to what is done for standard PageRank (see Section 5.4.4.3).
After the vector has converged to a vector , it remains to perform the “restoration” to obtain the asymptotic presence distribution :
We now have all the building blocks of BackRank (Algorithm 6.1). Note in passing that there is no need to perform renormalization, since there is convergence to a fixed point.
6.4.2 Results
Once an algorithm is completely defined, it must be put to the test. In order to make comparisons, we needed a baseline , and we chose the -compensated PageRank defined in Algorithm 5.3. It is indeed the simplest PageRank guaranteeing full stochastic control, apart from the uncompensated PageRank of course9. The leaf trimming and restoration technique was added for fairness toward BackRank, which performs it implicitly, and also for realism (this is the method that appears to be used in practice for off-line computations on very large graphs, and BackRank is designed to handle very large graphs). For both algorithms, we used (unless stated otherwise) and is the uniform distribution over the rake .
The test was conducted on a crawl of 8 million URLs which, for technical reasons10, is split into two samples of million URLs.
Convergence
One of the first viability criteria for a PageRank-type algorithm is its convergence speed. Figure 6.3 compares, on a semi-logarithmic scale, the value of the convergence parameter after iterations. For BackRank, the condition is reached after iterations in both samples, whereas for PageRank, it takes between and iterations. We measured the ratio of the observed geometric decrease at , and found for standard PageRank (convergence is always less than , but tends asymptotically toward ) compared to for BackRank. We conjecture that this gap, which explains BackRank’s performance, is due to the fact that unlike PageRank, BackRank implicitly uses a partial Gauss-Seidel method. It has been shown that using a Gauss-Seidel-type method considerably improves PageRank convergence [Ara+01].
We should clarify that the algorithms used are, up to rewriting11, exactly those described in this work. In particular, the power methods employed are truly power methods. For a more complete study of BackRank’s convergence performance, one would need to investigate how it behaves under the many convergence acceleration methods that exist [Ara+01, Hav99, Kam+03a].
To conclude this convergence study, let us note that on the samples considered, one iteration of BackRank takes on average seconds compared to seconds for PageRank. This difference is due to the fact that all computation constants, including the adjacency matrix, fit in memory. The -compensation thus has a non-negligible cost within an iteration. In fact, even compared to an uncompensated PageRank with estimation, BackRank incurs only a small overhead, on the order of .
Restoration
The finalization of the PageRank computation, which we call restoration, is a phase that is rather poorly described. According to [Pag+98], after convergence of PageRank on the graph restricted to vertices with nonzero out-degree, dead-end pages are added back to the process, without affecting things significantly12. But in order to assign an importance to these new pages, at least one PageRank iteration must be performed, which requires modifying most of the constants. In our experiment, we chose to perform four iterations on the full graph after convergence on the trimmed graph. Besides slower iterations ( seconds, given that approximately of the pages in the samples were dead-end pages), we observe, as in Section 5.4.4.3 page Section 5.4.4.3, the reinitialization of the parameter (see Figure 6.3): after four iterations, we are at the same level as after eight iterations of the main loop, that is, far from a stationary state…
For BackRank, restoration is much less problematic, since it reduces to applying Equation (6.17) once and only once. There is therefore no need to start a new series of iterations. Let us nonetheless specify, for honesty’s sake, that since refers to and not , the variations in can be larger than . Experimentally, it indeed appears that when is around , the variations at the level of are on the order of (slightly lower in fact). This result is nevertheless more than acceptable, especially in light of the variations of generated by the PageRank restoration.
Ranking
Technical performance is only half of the evaluation criteria for a PageRank-type algorithm. One must also test the relevance of the importance ranking induced by the obtained probability distribution. A first approach consists of quantitatively comparing the results returned by BackRank and those returned by the reference PageRank. Figure 6.4 presents such a comparison, showing the percentage of pages common to the top pages returned by BackRank and PageRank respectively. We observe significant fluctuations among the very top pages. As the number of pages considered increases, the overlap stabilizes, reaching a relatively stable value of when the number of pages considered approaches of the sample size ( pages). This tends to prove that BackRank yields results fairly close to those returned by PageRank, with some specificities.
This stabilized overlap of for both samples intrigued us. Upon closer examination, we noticed that the overlap rate is a variable that depends only on (at least on our samples). Figure 6.5 shows this relationship between and the overlap for . One will note in particular that:
- The overlap is a decreasing function of . This echoes the idea, discussed in Section 5.3.5, of the smoothing role of the zap factor.
- In particular, the overlap tends toward as tends toward , meaning that BackRank tends, just like PageRank, toward the in-degree distribution (see Section 5.3.5).
- The overlap tends toward as tends toward , which seems to indicate that BackRank is intrinsically half different from (or half similar to?) standard PageRank. One should however not forget that when , the ranking is ossified by rank sinks and no longer necessarily meaningful. Upon verification of a few URLs, this overlap indeed only means that BackRank does not get trapped in exactly the same sinks as PageRank.
We are aware that we are only providing here a few indicators of the quality of BackRank’s ranking. Strictly speaking, a complete validation would require incorporating BackRank into a search engine and conducting a series of satisfaction tests on a control population. In lieu of this, we shall use the reader as a control population, who can compare in lists Listing 6.1, Listing 6.2, Listing 6.3, and Listing 6.4 the top 10 pages returned by BackRank and PageRank on the two samples considered. One will note for example that the main page of CNRS is ranked by BackRank, but not by PageRank (in fact, it is 16th), whereas for the Ministry of Education, the opposite is true (page also ranked 16th, but by BackRank this time).
http://messagerie.business-village.fr:80/svc/jpro/search
http://server.moselle.cci.fr:80/Fichier/index.html
http://messagerie.business-village.fr:80/svc/jpro/aide
http://backstage.vitaminic.fr:80/add_artist.shtml
http://backstage.vitaminic.fr:80/
http://vosdroits.service-public.fr:80/Index/IndexA.html
http://emploi.cv.free.fr:80/index.htm
http://server.moselle.cci.fr:80/Fichier/listenaf.html
http://ec.grita.fr:80/isroot/fruitine/blank.html
http://www.adobe.fr:80/products/acrobat/readstep.html
http://backstage.vitaminic.fr:80/
http://backstage.vitaminic.fr:80/add_artist.shtml
http://forums.grolier.fr:8002/assemblee/nonmembers/
http://server.moselle.cci.fr:80/Fichier/listenaf.html
http://server.moselle.cci.fr:80/Fichier/index.html
http://messagerie.business-village.fr:80/svc/jpro/search
http://www.adobe.fr:80/products/acrobat/readstep.html
http://mac-texier.ircam.fr:80/index.html
http://mac-texier.ircam.fr:80/mail.html
http://bioscience.igh.cnrs.fr:80//current/currissu.htm
http://www.fcga.fr:80/
http://www.moselle.cci.fr:80/Fichier/index.html
http://www.lhotellerie.fr:80/Menu.htm
http://www.ima.uco.fr:80/
http://www.info-europe.fr:80/europe.web/document.dir/actu.dir/
http://www.moselle.cci.fr:80/Fichier/listenaf.html
http://www.machpro.fr:80/sofetec.htm
http://www.cnrs.fr:80/
http://www.quartet.fr:80/ce/
http://www.gaf.tm.fr:80/espace-pro.htm
http://www.moselle.cci.fr:80/Fichier/index.html
http://www.moselle.cci.fr:80/Fichier/listenaf.html
http://www.lhotellerie.fr:80/Menu.htm
http://www.machpro.fr:80/sofetec.htm
http://www.education.gouv.fr:80/default.htm
http://www.infini.fr:80/cgi-bin/lwgate.cgi
http://www.proto.education.gouv.fr:80/cgi-bin/ELLIB/Lire/Q1
http://www.dma.utc.fr:80/~ldebraux/Genealogie/Whole_File_Report.html
http://www.ldlc.fr:80/
http://www.ldlc.fr:80/contact.shtml
- 1I apologize in advance to purists, but I confess to preferring the term Back over more verbose alternatives.
- 2Dynamic loading, automatic redirections, address auto-completion…
- 3On the scale of the Web, if million pages represent a twig, years represent an eternity. During this interval, browser ergonomics evolved (Bookmarks, URL auto-completion, history browsing…), and user behavior was also modified by the omnipresence of search engines. Thus, rather than going back, some users prefer to re-enter in their search engine the query that led them to the page they came from.
- 4With Web graphs, any algorithm whose complexity exceeds , or at most , is considered prohibitive.
- 5Since the history is limited, the limit distribution is not necessarily equal to the distribution of the classical model, even for .
- 6Unless it was Sydow who adopted an approach similar to ours, the research having been conducted independently, and each having discovered the other’s work at the thirteenth WWW conference.
- 7It may be possible to reduce the number of variables to , as we shall do with the irreversible model, but we have not yet formalized this reduction, and therefore leave the reversible Back model at a purely descriptive stage.
- 8Note in passing that we do not need to explicitly write the associated transition matrix, which is a square matrix of size . It suffices to know that this matrix exists and that it implicitly governs our process.
- 9We kept the compensation to make comparisons simpler, since both competitors thus produce a probability distribution.
- 10The algorithms run under Matlab.
- 11Algorithm 5.3 was rewritten following the model of Algorithm 5.5.
- 12”without affecting things significantly,“ op. cit.