Chapter 1 — Introduction

This thesis presents a major part of the research I have conducted since my doctoral thesis, defended in December 2004. It is centered around peer-to-peer (P2P), with special attention dedicated to the topic of preference networks. My work outside the P2P field, mainly research around Google’s PageRank, is not developed here (references are available in the Curriculum Vitae attached to this thesis). The two main reasons for this editorial choice are the desire to present a body of work articulated around a common theme (P2P), as well as the interesting constraint posed by writing a thesis of limited size.

In this introductory chapter, I propose to share with the reader my conception of peer-to-peer, not as common sense understands it, but as a research domain. The following chapter will then present my positioning as a researcher within this domain and the structure of the rest of this thesis.

1.1 A subjective history of peer-to-peer

There exists a plethora of works trying to answer the question of what peer-to-peer is and how to approach it. The size, technicality, and medium are extremely variable: books [70, 72], online studies and wikibooks [69, 77], pedagogical introduction [76]…

Given all of this, it is difficult to offer a vision that has not already been proposed elsewhere. This is why I have chosen to present peer-to-peer starting from its emergence as a social phenomenon, viewed through the lens of my personal experience. I do not claim that my approach is recommended for someone who would like to learn about peer-to-peer: it is better in that case to start with one of the references above. Nevertheless, I believe that approaching things “through the wrong end of the telescope,” as I am about to do, allows for a better understanding of the path I have followed in my research, and thus brings out what unifies it.

My own analysis of the history of peer-to-peer is therefore the following: I tend to think that peer-to-peer was born at the moment when it was the only technical solution capable of meeting a social need. More precisely, let us go back about ten years or so. The democratization of Internet access, coupled with significant advances in signal processing (known to the general public through the acronyms MP3 and DiVX) was beginning to make it possible to retrieve audio and video “content” within reasonable timeframes.

Problem: where to retrieve the content from? At the time (late s), the digital economy was still in its infancy. Traditional media were still hesitant to take the digital leap, and there existed bit¹ of a portal allowing access to content.

Initially, relatively old methods were employed within restricted circles of underground users: use of FTPz (hijacking an anonymous FTP server in order to use the storage and bandwidth resources of said server), attachments in discussion forums (the alt.binaries of Usenet), …

Then, with the Internet bubble came online storage sites. This solution was very quickly preferred over FTPz: no more need to scan IP addresses, and lower risk of data loss. The most famous example of this era must be the site http://MySpace.com, which is not at all the current social networking site, but offered MB of storage per account, with no limit on the number of accounts or bandwidth².

Alas, one day the bubble burst. Sites like http://MySpace.com closed, or were completely throttled in the best case. The absence of a business model around an unlimited and free storage offering, and the astronomical costs generated by abusive bandwidth usage were certainly not unrelated to this. But a portion of Internet users had developed a taste for easy access to content, and did not want to go back. The need was there. It was then that, across various forums, an idea began to emerge: What if we pooled our storage and bandwidth resources? With enough participants, we would have more resources than any server, wouldn’t we?.

A first attempt, with the software Napster, had already been cut short due to legal problems. Nevertheless, driven by the need for content access, software like KaZaA or eDonkey [2, 3] began to emerge, while somewhere on the web, an unknown computer scientist named Bram Cohen was working on a file-sharing protocol and using files restricted to those over to test what would become BitTorrent [23]³. It was and peer-to-peer (P2P) was born from the ashes of FTPz, MySpace, and Napster.

I must admit that when I first heard about this idea of using the users, my reaction was to think that will never work. I did not believe that users were ready to offer their bandwidth, with all the constraints that implied. Moreover, I had hastily estimated the performance of such a system using bandwidth conservation, and I had concluded that performance would be extremely poor. Since then, I have had time to realize my error and make amends, as I will discuss in more detail in Chapter 3.

Today, the success of P2P file-sharing applications, which use only the resources provided by the users, is no longer in question. The time has come for the development of new functionalities, such as streaming. Live broadcast applications (or rather slightly delayed broadcast) like CoolStreaming [78], PPLive [41], SopCast [7], TVants [8], or UUSee [9] are becoming increasingly popular, while video-on-demand broadcasting is beginning to appear [1, 4, 5].

1.2 Defining peer-to-peer

Now that the context is set, we can begin to reflect on the question of defining peer-to-peer. To illustrate the problem, allow me to recount the following anecdote: there has been much talk recently about the so-called DADVSI law relating to copyright and neighboring rights in the information society. One of the goals of this law was to counter illegal uses of P2P, or even potentially illegal P2P software. But to legislate, one must define, and this is what article 21 attempts to do, making punishable by three years of imprisonment and 300,000 euros in fines the act of publishing, making available to the public, or communicating to the public, knowingly and in any form whatsoever, software manifestly intended for the unauthorized making available to the public of protected works or objects.

This article generated much controversy, notably because as written, it included most Web server software, including the very well-known Apache software. It was thus necessary to exclude these software packages in an ad-hoc manner by adding a special clause. Nevertheless, I find the definition of P2P that emerges from this legal text interesting: P2P software is intended to make content available, the meaning of content being a priori as broad as possible.

This first definition, which I will call a motivation definition, is of course insufficient to define P2P, as the DADVSI law clearly demonstrated. One must therefore look further. This is generally where the definition by opposition comes in: contrary to the classic client/server paradigm, where each component of the system is either a server (content distributor) or a client (content consumer), the P2P paradigm considers a system where each participant can be both client and server simultaneously.

Much more targeted than the motivation definition, the definition by opposition to the client/server model nevertheless contains a flaw when one seeks to specify what it means to be able to be both client and server. Consider for example the ssh protocol, a network protocol allowing a client to connect to a server. If I open an ssh session from my computer to a friend’s computer, it is clear that I am not doing peer-to-peer. If that friend opens in return an ssh session on my own computer, each will potentially have access to the other’s resources (each computer being both client and server), yet this still does not resemble peer-to-peer in the common sense. But if an entire social network starts using ssh sessions to exchange content⁴, it ends up forming a peer-to-peer network, or more precisely a friend-to-friend network in this specific case. The question is: where is the boundary?

Conversely, it is very easy to imagine client/server uses for P2P software. Thus, every week, I use the BitTorrent client to retrieve MP3s from a friend⁵. He has the files, I retrieve them, BitTorrent serving only to facilitate the transfer: the information needed to retrieve the files is concentrated in a small .torrent file that is very easy to produce and exchange by email. Here again, the boundary between the client/server paradigm and P2P is extremely blurry.

It therefore appears that even though it is a good approximation, the definition by opposition is imperfect. Just as one cannot define the essence of peer-to-peer as being tied to a technology, there is no simple architectural categorization (paradigm) into which peer-to-peer fits.

All of this explains why over time I have forged my own definition of peer-to-peer, which tries to best capture the fact that in my experience, P2P is first and foremost a decentralized way of approaching a distribution problem in a given hardware scenario. In other words, if one looks at the evolution of P2P from its beginnings to the present day, one sees a basic principle remaining the same beyond technological evolutions: if one wants to offer any service to a large number of users while having few or no servers at one’s disposal, the only solution is to take the resources from the only place where they exist, namely from the users (the peers) themselves.

This observation leads to a third and final definition of P2P, which I would describe as a definition by complexity: a system, a usage, a scenario is of the peer-to-peer type as soon as the way to answer the question
Who retrieves what from whom?
is non-trivial.

This definition may seem strange and counter-intuitive, but its advantages are certain: it frees itself from technological questions and transforms a semantic question (what is peer-to-peer?) into a more algorithmic problem. Thus, to know whether an assembly of ssh connections forms a P2P system, the definition I propose answers yes, provided that the system is more complex than a succession of independent bilateral distributions⁶. In other words, one can speak of peer-to-peer as soon as the search for a given file, or its transfer, uses more than one client/server connection.

One also obtains, and this is what interests me as a researcher, a way of defining what peer-to-peer research is: trying to understand the question who retrieves what from whom when it becomes complicated, decomposing it into sub-questions more specific to a given problem (search and distribution, availability and volatility, fairness and heterogeneity,…) and trying to answer them in the best possible way.

1.3 Research themes

Now that the subject of peer-to-peer research is open, I propose to survey the different themes and approaches on which a P2P researcher may work. I propose to employ a dichotomous classification based on four axes: localization/distribution, structured/unstructured, push/pull, theoretical/empirical. This classification is far from unique and it is not perfect, but I find it relatively simple and practical.

1.3.1 Localization and distribution

The first way to decompose the question Who retrieves what from whom? is to separate this question into two sub-problems: localization and distribution. The localization phase consists of answering the question Who has what?: this is the localization phase. One solution is for example that a central server monitors all content present in a system, thus allowing any search to be answered. This is for example the operating principle of BitTorrent trackers [23] or eDonkey servers [3]. Of course, these solutions are centralized, and therefore not peer-to-peer (the question who has what is then trivial). A first step towards complexity is the use of several servers instead of just one (super-nodes). This reduces the concentration of information, which can be dangerous from a security standpoint, but one must begin to think about communication between super-nodes: we have a semi-centralized, semi-complex, in short semi-peer-to-peer solution. Finally, one can let the peers sort things out entirely among themselves to know who has what. Among these peer-to-peer approaches to localization, there is for example flooding, which consists of emitting a content request and passing it from neighbor to neighbor, but the most widely used approach is undoubtedly that of distributed hash tables [34, 61, 64, 68, 73, 79], which associate each content with a peer (or even several) responsible for knowing who possesses that content.

Once the content is localized, the problem of its effective distribution arises: how to retrieve the content? This is a problem that varies enormously depending on what one means by distribution: downloading a small or large file, viewing a channel with slight delay, a video-on-demand… The subtleties related to the different types of distribution will be explored in depth in Chapter 3. For now, one should simply remember that since the exact problem to be solved varies, the palette of solutions is broad: broadcast trees, use of queues, reciprocal exchanges…

Most of the time, these two problems, localization and distribution, are independent. For example, while localization in a BitTorrent system was historically provided by a centralized tracker, recent versions also offer the possibility of using a DHT. Similarly, the choice of distribution protocol can adapt based on the result of localization: this is the multi-protocol approach, employed by MLDonkey [30].

Of course, there are exceptions, and the distinction between localization and distribution is not always airtight. Thus, for security and confidentiality reasons, the Freenet software [22] is designed so that distribution follows the same network path as the one created during localization, making the two problems inseparable.

1.3.2 Structured, unstructured

Another way to approach the who-what-from-whom question is to think about how the problem will be solved in practice. From this perspective, two major currents exist: structured and self-structured approaches.

The principle of structured approaches is to ensure that peers conform to a pre-determined structure. This structure will guide the peers in their behavior and their manner of exchanging messages or content. Most distributed hash tables use a structured algorithm [34, 61, 64, 68, 73, 79]. Structured solutions have also been proposed to solve the problem of slightly delayed broadcast [19, 35].

These structures are generally graphs (trees, trees with disjoint interior nodes [19], de Bruijn graphs [34, 35], regular graphs, …), sometimes built from a virtual topology (ring [61, 73], -torus [64], hypercube [61, 68, 73, 79], …).

The strength of structured approaches is that the system’s behavior is determined by a structure known in advance, and that this structure can be designed to guarantee optimal, or near-optimal, performance. Their weakness comes from the difficulties of maintaining a structure in the face of peer volatility. Moreover, many structures assume that peers have similar capacities, and with rare exceptions [75], they adapt poorly to a population that is often heterogeneous in practice.

The other solution is to let (more or less) the peers sort things out, that is, let them make decisions based on their own perception of the system. There is then no longer an explicit structure, but a fuzzy and implicit pseudo-structure resulting from the sum of local choices. This is why, moreover, I prefer the term self-structured to the term unstructured that is generally used. After all, a structure is, by definition, only the way in which a set is organized. For example, with all the properties it possesses (average degree, connected components, diameter, …), the Erdos-Renyi graph, which is built from local decisions (is this edge in the graph?), is for me the perfect example of self-structure.

The self-structured approach, one of whose most representative examples is certainly BitTorrent, is mainly used in content distribution, whether in file sharing or for streaming. It is used in localization in certain systems like Freenet or Gnutella, and can appear in a hybridized form (as mentioned previously, no compartmentalization is completely rigid) in certain DHTs [34, 61].

By construction, self-structured approaches are adapted to the volatility and heterogeneity of peers. But since their structure is not known a priori, it is difficult to predict their performance, and validation must often be done empirically. However, advances have been made to give self-structured systems a theoretical foundation [14, 37, 62].

1.3.3 Push and Pull

One can also characterize, from a research perspective, the nature of a peer-to-peer system by looking at where decisions are made: at the client level (pull), at the server level (push), or a bit of both.

In a pull approach, the content requester will itself initiate the distribution with the content holders⁷, possibly deciding which part to retrieve from whom. This is the kind of approach used in the main current file-sharing systems (BitTorrent and eDonkey).

Conversely, the push philosophy dictates that the holders manage the proper functioning of the system, with requesters having a more passive role. Distribution systems based on broadcast trees, where content is routed from the source to the recipients, are generally considered typical of push [19, 35].

As can be seen in the previous examples, this classification is generally associated with the distribution problem. The reason is that the main localization solutions, DHTs (but also centralized localization solutions), do a bit of both: holders proactively transmit the description of their content to a third party, and requesters subsequently go fetch these descriptions. This publish/subscribe approach can nevertheless be interpreted, conceptually, as a special case of push/pull hybridization.

Another preconception (which I have intentionally reproduced in the preceding examples) consists of associating push distribution with structured systems and pull distribution with self-structured systems. Intuitively, this association is easy to justify: in a pull approach, each requester organizes itself how it retrieves content, and one might think that a structure too rigid would risk reducing the room for maneuver. Conversely, letting a set of holders manage the distribution of content seems doomed to failure without a minimum of coordination, that is, of structure. And yet, epidemic broadcast algorithms [14], which I will discuss in more detail during Chapter 3 are clearly push-oriented (with a slight pull feedback, let us note), self-structured, and nevertheless destined for what I hope will be a brilliant future.

1.3.4 Theoretical, empirical

Beyond the three criteria I have just described, which are related to the specific problems of peer-to-peer, I believe it is also necessary to consider the distinction between theory and practice. This is a classification (even a divide sometimes) that has very distant epistemological origins, but whose repercussions on peer-to-peer are just as important as the distinctions mentioned above. To borrow a definition from Matthieu Latapy, the theoretical approach (or more generally fundamental research) primarily seeks to understand, while the goal of the empirical approach is above all to do (applied research).

Being somewhat caricatural, the proponent of a purely empirical approach will disparage any research that has not been validated by a series of experiments under real conditions. This is for example the position defended by Bram Cohen, the designer of BitTorrent⁸. Conversely, the purely theoretical approach will seek to propose solutions that can be shown to work, or models that can be thought to capture the essence of a phenomenon. This is how most of the structured solutions cited above proceed, or the fluid modeling of BitTorrent proposed by Qiu and Srikant [63].

In absolute terms, each of the two approaches taken in its pure state is incomplete, and it is only when combined that they can fully express their potential. Either by proposing a theory that allows understanding observations (induction), or by trying to falsify (in the epistemological sense) a theory through a series of experiments. This is unfortunately not always easy, the increasing compartmentalization of domains not really helping to bridge this gap, and a given piece of research (including in peer-to-peer) will always be perceived with a more empirical or more theoretical coloring, the exact nuance depending of course on the eye of the observer.

1.3.5 Summary

As the examples given above have illustrated, the classification I propose is far from being a rigid compartmentalization, and it should rather be seen as a compass allowing one to try to orient oneself within the peer-to-peer research domain. Systems are becoming increasingly hybrid and mix supposedly antagonistic genres. It is no longer possible, and this is not necessarily a bad thing, to completely isolate themes as was the case in an era when the main topics addressed were DHT proposals (localization, structured, publish/subscribe, theoretical) and studies of real traces (localization or distribution depending on the trace, empirical).

¹A neo-grammaticalization inspired by the history of negation in French.
² http://web.archive.org/web/20001019040949/www.myspace.com/Index.asp
³I had stumbled by chance upon Bram Cohen’s test page in 2004, but I am unable to find it again today, and I can therefore only rely on my memory alone to report this fact. I do not even remember the exact title of the work offered for download, other than it must have been opus or of a themed series.
⁴This is the operating principle of existing software, such as for example Waste
⁵I should specify that these are working recordings from rehearsals of the jazz band I am part of, therefore just as legal as they are unlistenable to the public.
⁶As an exercise, I leave it to the reader to have fun answering the question Is Skype a peer-to-peer application? using the three definitions I have just given.
⁷I recall that in peer-to-peer, one must take requester and holder from a logical standpoint, the holder of one part of content potentially being simultaneously a requester of another part of that same content.
⁸This position is perfectly apparent if one consults his blog, notably his post attacking the Avalanche protocol: http://bramcohen.livejournal.com/20140.html.