Chapter 1 — What is the Web?

I shall speak, for instance, of the Mind and of the “demiurge,” while carefully refraining from defining too precisely what I mean by these terms: because I do not clearly know… In this dreadful state of our ignorance, must we really define so precisely?

Rémy Chauvin, la Biologie de l’Esprit

1.1 Genesis of the Web

Sources: A History of Networks [Gen04].

Around the early 1960s, while a melody drifted through the sky (a bird called Sputnik), the United States decided to develop a form of decentralised network capable of withstanding a nuclear attack that would destroy one or more of its nerve centres. It was the year sixty-two.

A few years later, this project became ARPANET. It was 1969 and four American universities were connected by this network of a new kind. From that moment on, the network never stopped growing and evolving, until it became what is now called the Internet¹, that is to say a formidable network of networks.

In 1972, the Acceptable Use Policy (AUP) charter prohibited any commercial entity from connecting to the network.

In 1984, the milestone of 1,000 machines was reached and Centre Européen pour la Recherche Nucléaire (CERN) joined the Internet. Six years later, in 1990, while the number of connected computers reached 300,000, the largest Internet site in the world was that of CERN, the future birthplace of the Web, a vast worldwide collection of so-called hypertext and hypermedia documents distributed over the Internet.

That same year, 1990, the AUP ceased to exist, paving the way for what would become, a few years later, the Dot-com Bubble.

In 1991, Tim Berners-Lee of CERN introduced the concept of the World Wide Web (WWW), sometimes referred to simply as the Web. The World Wide Web is the part of the Internet where the navigation method is HyperText and the protocol is HyperText Transfer Protocol (HTTP).

The philosophy of HTTP lies in so-called hypertext links that connect pages to one another and allow navigation when selected. We speak of the “Web” — with a capital letter — even though it is in reality the “World Wide Web” or “W3”.

1.2 A Definition of the Web

Originally, as we have just seen, the Web was characterised by both a protocol, HTTP, and a language, HyperText Markup Language (HTML); the former serving to deliver “pages” (files) written in the latter, interpreted on the client side by Web browsers (Browsers).

Nowadays, one may question the validity of this dual characterisation:

HTML is an easy-to-learn language for writing structured documents linked to one another by hyperlinks. In practice, it is no longer used exclusively through HTTP and can be found on many other media (CD-ROM, DVD-ROM…) for purposes as numerous as they are varied (documentation, education, encyclopaedia, help…).
In order to deliver multimedia content, HTTP is designed to serve any type of file. Images, sounds, videos, texts in various formats, archives, executables… are all accessible through the HTTP protocol, in a spirit more or less removed from the original hypertext navigation concept. This tendency toward “all-HTTP” leads to sometimes paradoxical situations. For instance, to transfer files (even large ones), there is a certain tendency to abandon the appropriate protocol, File Transfer Protocol (FTP), in favour of HTTP. As shown in Table 1.1 ², it appears that files traditionally transferred via FTP are now transferred via peer-to-peer (P2P), but also via HTTP (for example, most download servers on sourceforge.net use HTTP).
Certain documents that are not HTML documents allow hyperlink navigation: proprietary documents (Acrobat PDF, Flash…) or new languages (WML, XML…).

Year	*HTTP*	*FTP*	*P2P*
2001	13 %	10 %	35 %
2002	14 %	2 %	50 %
2004	20 %	negligible	65 %

Table 1.1: Evolution of HTTP, FTP, and P2P traffic (as a percentage of volume)

One can glimpse the difficulty of finding a “good” definition of the Web. For the sake of simplicity more than anything else, throughout this thesis we shall continue to define the Web according to its initial dual characterisation. Thus, we shall call Web the set of documents written in HTML and available on the Internet via the HTTP protocol. We are aware of the extremely restrictive nature of this definition, but prefer to work on a well-defined and widely studied set rather than to seek an exhaustiveness that may not even be attainable.

1.3 Accessibility of the Web

Within the Web we have just defined, the problem of visibility and accessibility of pages now arises. What can we see of the Web and how can we access it? Several structurings of the Web based on these questions of visibility, accessibility, and indexability have been proposed.

1.3.1 Depths of the Web

Michael K. Bergman, in 2000, proposed the metaphor of depth to distinguish the different Webs [Ber00]. One thus distinguishes:

The Surface Web: the surface of the Web, according to Bergman, consists of all static and publicly available pages.
The Deep Web: conversely, the deep Web consists of dynamic websites and databases accessible through a Web interface.

This vision of the Web remains rather Manichaean. Danny Sullivan [Sul00] proposes a third kind of Web, The Shallow Web³, made up for instance of publicly available dynamic pages, such as those of the Internet Movie Database (IMDB) (http://www.imdb.com), or those of the site http://citeseer.ist.psu.edu/.

1.3.2 Visibility of the Web

Chris Sherman and Gary Price propose in their book The Invisible Web [SP01] an approach based on visibility by the major search engines. The equivalent of the Surface Web is, for Sherman and Price, the set of pages indexed by search engines. According to them, the rest of the Web then breaks down into 4 categories:

The Opaque Web: pages that could be indexed by search engines but are not (limitation on the number of pages indexed per site, indexing frequency, missing links to pages thus preventing crawling).
The Private Web: web pages that are available but voluntarily excluded by webmasters (password, metatags, or files in the page to prevent the search engine robot from indexing it).
The Proprietary web: pages accessible only to authorised persons (intranets, authentication systems…). The robot therefore cannot access them.
The Truly Invisible Web: content that cannot be indexed for technical reasons. For example, format unknown to the search engine, dynamically generated pages (including characters such as ? and &)…

1.3.3 Accessible Web

Each of the two approaches we have just seen has its advantages and disadvantages. While Bergman’s definition is fairly appealing (on the one hand, a static Web accessible by clicks; on the other, a dynamic Web reachable only through queries), it does not realistically describe the current Web. The approach of Sherman and Price, namely discrimination based on indexability by search engines, is more flexible, but does not, in my opinion, always associate the right causes with the right effects⁴.

A third approach, used by [BB98, Dah00, Hen+99], provides a kind of synthesis of the models just mentioned. It is the model of the accessible Web:

Definition 1.1.

The accessible Web is the set of Web pages that may be pointed to by a hyperlink.

More precisely:

This is equivalent to considering as part of the accessible Web any page that can be accessed simply by typing the correct address — also referred to by the term Uniform Resource Locator (URL) [BMM94] — into a browser.
Dynamic pages that do not have hidden variables, that is, whose possible variables are passed in the URL, are part of the accessible Web.

By convention, we shall exclude certain pages from our definition of the accessible Web:

Error pages returned by a server (4xx errors, for example);
Pages whose access by robots is forbidden by a robots.txt file;
Pages protected by login and/or password (even if the login and password can be given in the URL).

This definition obviously has its flaws as well. For example, it does not take into account at all the problem of duplicates (how should one consider two pages with strictly identical content — for instance two URLs corresponding to the same physical file?), nor that of the temporal dynamicity of pages and their content (what does the accessibility of the front page of a newspaper mean? The accessibility of a page that returns the time and date?). Strictly speaking, we should thus speak of the Web accessible at a given instant , and accept identifying a page with its URL despite the inevitable redundancies. Even so, many grey areas persist. For instance, some ill-intentioned administrators do not return the same content depending on whether the requester of the page is human or not, in order to deceive search engines; others return pages adapted to the browser used by the visitor; still others check that the visitor has indeed passed through the home page before navigating within the site, and redirect them to the home page if this is not the case; finally, because of routing problems, it is entirely possible that at a given instant , a server is perfectly visible from one IP address and inaccessible from another. In addition to the instant , the address from which the request is made and all the information transmitted in the HTTP request must therefore also be defined.

1.4 Intermezzo: the page that linked to all pages

During a discussion with Jean-Loup Guillaume and Matthieu Latapy, in that hallowed place of scientific research known as the coffee machine room, while we were arguing at length about the bow-tie model proposed by Broder et al. [Bro+00]⁵, a whimsical idea fell from the cup: what if we set up a Web page capable of linking to all other pages?

After being first written by Matthieu Latapy, whose code I then took over, a somewhat improved version of the original page is now available at http://www.liafa.jussieu.fr/~fmathieu/arbre.php [GLM02].

The principle of this page is that of a typewriter, or rather a click-writer. To reach a given page, one simply clicks its letters one by one, the resulting address being kept in memory through a variable passed as a parameter in the URL. The page dynamically checks whether this address makes sense (whether it belongs to the indexable Web), and if so, a hyperlink is inserted to that address⁶. Figure 1.1 is a screenshot of the page that links to all pages in action.

Figure 1.1: Screenshot of *the page that links to all pages* in action

Up to URL length limitations (and character escaping bugs that have not yet been fixed), the page’s purpose, namely to be connected by hyperlinks to virtually every page of the indexable Web, is achieved.

This page is above all a playful exercise, but it allows one to take a step back from a good number of received ideas:

The primary goal of this page was to deliver a small jab at the interpretation generally given to the bow-tie model, and I believe this goal was achieved.
It allows one to take with a minimum of hindsight all claims about the structure of the Web. After all, I can claim to know a portion of the Web of approximately pages, give or take a factor of , where is the maximum number of characters that a URL can contain⁷. This result, far exceeding all estimates of the Web’s size put forward so far (see Section 2.2), also allows one to assert some astonishing statistics:
- The average degree of the Web I know is , where is a perturbation due to real pages. Moreover, contrary to everything previously believed, the degree distribution does not follow a power law but closely resembles a Dirac delta.
- There exists a strongly connected component in the Web I know, and any page in the Web I know almost certainly belongs to this component.

Of course, these results should not be taken at face value! I would be the first to find suspicious a paper claiming that the Web has more than a googol of pages⁸, or that heavy-tailed distributions do not exist. The page that links to all pages is merely a Saturday night idea⁹ that was put into practice, and I am perfectly aware of this. But it has the advantage of showing us in full light the immensity of the accessible Web (and in particular the impossibility of indexing it¹⁰) and of urging us to understand clearly that all meaningful results one can obtain about the Web in fact concern crawls that represent an infinitesimal fraction of the indexable Web in terms of the number of pages.

To conclude this intermezzo, let us note that in terms of information theory, we doubt that the page that links to all pages and its dynamic pages are worth more than the few lines of code that lie behind them.

¹The term Internet was apparently introduced in 1974 by Vincent Cerf and Bob Kahn. Note in passing that internet and Internet do not mean the same thing! One is a common noun designating a meta-network structure, the other is a proper noun for THE meta-network using the TCP/IP protocol.
²Many thanks to Philippe Olivier and Nabil Benameur for providing me with these data.
³To remain faithful to Bergman’s analogy, one could translate Shallow Web as near space. I prefer the term swamp, which conveys rather well the effect of this zone of the Web on crawlers.
⁴For example, all dynamic pages accessible from http://www.liafa.jussieu.fr/~fmathieu/arbre.php [GLM02] should logically belong to the opaque Web, whereas the categorisation of Sherman and Price seems to consign them to being completely invisible…
⁵See Section 3.2.
⁶The page that links to all pages also displays the Google PageRank (a score from 0 to 10) when available. It does not yet make coffee, alas.
⁷The HTTP 1.1 protocol does not specify a maximum URL length (cf [Fie+99]). is therefore in principle as large as one wishes. In practice, each server has a fixed limit, on the order of a few kilobytes. Moreover, there was a time when sending a URL longer than KB was an effective way to crash an IIS server…
⁸A googol is a number equal to . It does not honour the Russian writer Nikolai Gogol (1809–1852), but was coined in 1938 by the nephew of the American mathematician Edward Kasner (1878–1955), Milton, who was then 9 years old. Note that in French, googol is written… gogol! The name of the famous search engine is directly derived from the name invented by Kasner; there is in fact a legal dispute between Google and Kasner’s heirs. Finally, let us mention that the number represented by a followed by a googol of s is called googolplex.
⁹I borrow this concept of a Saturday night idea from Michel de Pracontal. It is “the kind of idea that researchers sometimes discuss during their leisure hours, without taking them too seriously” (L’imposture scientifique en dix leçons, lesson 7, page 183).
¹⁰To give a sense of scale, performing a breadth-first traversal of the page that links to all pages at a rate of 10 pages per second, it would take approximately 100 million years to manage to type http://free.fr. As for the address http://www.lirmm.fr, the age of the universe is far too short for a robot to find it solely by crawling the page that links to all pages.