Appendix B — A Short Study of PageRank on the INRIA Website
Starting from a snapshot of the graph of the website http://www.inria.fr, we applied a PageRank-type algorithm to determine which pages received the highest ranking. Several interesting observations emerged:
-
The justification for removing, in the PageRank computation, the leaves of the graph (by leaf, we mean a node with zero out-degree).
-
It proved essential to remove self-links (links from a page to itself), as this causes a resonance phenomenon. On the INRIA graph, before performing this modification, the PageRank was largely dominated by
which combines the roles of a self-referencing page and the root of a sink.
-
The choice of the weight of the “random click” turns out to be crucial: if it is too small, sinks or near-sinks will absorb all the PageRank. If it is too large, the iterative aspect of PageRank will vanish, and the ranking will roughly correspond to a ranking by in-degree. In the case of INRIA, a random click probability of at each iteration appears to be a good compromise.
Results It seems quite interesting to analyse the top ten URLs returned by our PageRank algorithm (see Table 2.1). One can observe that, while naturally correlated with the in-degree ranking, it differs from it significantly (compare Table 2.1 and Table 2.2).
| URL (http://www.inria.fr/…) | Local PR | Google PR | In-deg |
| index.fr.html | 608 | ||
| rapportsactivite/RA94/RA94.kw.html | 327 | ||
| actualites/index.fr.html | 367 | ||
| fonctions/plan.fr.html | 297 | ||
| valorisation/index.fr.html | 302 | ||
| travailler/index.fr.html | 312 | ||
| recherche/index.fr.html | 297 | ||
| publications/index.fr.html | 294 | ||
| inria/index.fr.html | 229 | ||
| rapportsactivite/RA94/RA94.pers.html | 320 |
In terms of relevance, the pages returned by our PageRank appear to be well chosen overall (homepage in first place, “index” or “site map” type pages), with the notable exception of two pages:
Upon verification, and as one might expect, these two pages turn out to be the two main nodes of a near-sink, namely rapportsactivite/RA94/. These two pages, having both a high in-degree and being located in a near-sink, appear very difficult to filter out using only a local PageRank.
| URL (http://www.inria.fr/…) | In-deg |
| index.fr.html | 608 |
| index.en.html | 391 |
| actualites/index.fr.html | 367 |
| rapportsactivite/RA94/RA94.kw.html | 327 |
| rapportsactivite/RA94/RA94.pers.html | 320 |
| travailler/index.fr.html | 312 |
| valorisation/index.fr.html | 302 |
| fonctions/recherche.fr.html | 299 |
| fonctions/annuaire.fr.html | 297 |
| fonctions/plan.fr.html | 297 |
Comparison with Google Google assigns a ranking of 9/10 to the INRIA homepage and 8/10 to the other top ten pages of the local PageRank, with the exception of
which receive a score of 6/10. Two main observations:
-
The two
RA94pages had a local PageRank nearly equal to that of the other pages, with the exception of the homepage. Google’s global PageRank managed to isolate them. A plausible explanation is the likely existence of numerous links from external pages to the “index” type pages, whereas it is very likely that very few external pages point toRA94. -
The score of 6/10 assigned to
remains high, certainly higher than one would wish. Many homepages of websites considered more noteworthy do not achieve this score.