PPSP                                                      V. Grishchenko
Internet-Draft                                                 A. Bakker
Intended status: Informational                                  TU Delft
Expires: January 12, 2012                                  July 11, 2011

        The Generic Multiparty Transport Protocol (swift)
            <draft-grishchenko-ppsp-swift-02.txt>

Abstract

   swift is a generic multiparty (swarming) transport protocol.

   TCP, today's dominating transport protocol, is connection/
   conversation-oriented. But traffic-wise, the currently dominating
   usecase is content dissemination. There is a multitude of
   incompatible approaches to resolve that discrepancy above/below the
   transport layer: peer-to-peer, CDN, caches, mirrors, multicast, etc.
   swift aims at creating a single unified content-centric transport
   protocol serving as a lingua franca of content distribution. To
   implement that ultimate data cloud model, the protocol has to unify
   use cases of data download, video-on-demand and live streaming. It
   must work in the settings of client-server, peer-to-peer, CDN or
   peer-assisted networks, effectively blending those architectures.


Status of this memo

   This Internet-Draft is submitted to IETF in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.


   Copyright (c) 2011 IETF Trust and the persons identified as the
 


Grishchenko, et al.     Expires January 12, 2012                [Page 1]

Internet-Draft                   swift                         July 2011


   document authors.  All rights reserved.
   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Requirements notation  . . . . . . . . . . . . . . . . . . . .  3
   2.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   3.  Design goals . . . . . . . . . . . . . . . . . . . . . . . . .  3
   4.  swift subsystems and design choices  . . . . . . . . . . . . .  5
     4.1.  The atomic datagram principle  . . . . . . . . . . . . . .  6
     4.2.  Handshake and multiplexing . . . . . . . . . . . . . . . .  7
     4.3.  Generic acknowledgments  . . . . . . . . . . . . . . . . .  8
     4.4.  Data integrity and on-demand Merkle hashes . . . . . . . .  9
       4.4.1. Peak hashes . . . . . . . . . . . . . . . . . . . . . . 10
       4.4.2. Hash trees for files  . . . . . . . . . . . . . . . . . 11
       4.4.3. Hash trees for streams  . . . . . . . . . . . . . . . . 12
     4.5.  Peer exchange and NAT hole punching  . . . . . . . . . . . 12
     4.6.  Data requests (HINTs)  . . . . . . . . . . . . . . . . . . 13
     4.7.  Subsetting of the protocol . . . . . . . . . . . . . . . . 13
     4.8.  Directory lists  . . . . . . . . . . . . . . . . . . . . . 13
   5. Enveloping  . . . . . . . . . . . . . . . . . . . . . . . . . . 14
     5.1.  IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
     5.2.  UDP  . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
     5.3.  TCP  . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
     5.4.  swift over RTP as PPSP . . . . . . . . . . . . . . . . . . 14
       5.4.1. Design  . . . . . . . . . . . . . . . . . . . . . . . . 15
       5.4.2. PPSP Requirements . . . . . . . . . . . . . . . . . . . 17
     5.5.  RTP over swift as PPSP . . . . . . . . . . . . . . . . . . 19
       5.5.1. PPSP Requirements . . . . . . . . . . . . . . . . . . . 19
     5.6.  swift over HTTP as PPSP  . . . . . . . . . . . . . . . . . 20
       5.6.1. Design  . . . . . . . . . . . . . . . . . . . . . . . . 21
       5.6.2. PPSP Requirements . . . . . . . . . . . . . . . . . . . 22
   6. Security Considerations . . . . . . . . . . . . . . . . . . . . 25
   7. Extensibility . . . . . . . . . . . . . . . . . . . . . . . . . 25
     7.1. 32 bit vs 64 bit  . . . . . . . . . . . . . . . . . . . . . 25
     7.2. IPv6  . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
     7.3. Congestion control algorithms . . . . . . . . . . . . . . . 25
     7.4. Piece picking algorithms  . . . . . . . . . . . . . . . . . 26
     7.5. Reciprocity algorithms  . . . . . . . . . . . . . . . . . . 26
     7.6. Different crypto/hashing schemes  . . . . . . . . . . . . . 26
 


Grishchenko, et al.     Expires January 12, 2012                [Page 2]

Internet-Draft                   swift                         July 2011


   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
   Authors' addresses . . . . . . . . . . . . . . . . . . . . . . . . 28




1.  Requirements notation

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].


2.  Introduction

   Historically, the Internet was based on end-to-end unicast and,
   considering the failure of multicast, was addressed by different
   technologies, which ultimately boiled down to maintaining and
   coordinating distributed replicas. On one hand, downloading from a
   nearby well-provisioned replica is somewhat faster and/or cheaper; on
   the other hand, it requires to coordinate multiple parties (the data
   source, mirrors/CDN sites/peers, consumers). As the Internet
   progresses to richer and richer content, the overhead of peer/replica
   coordination becomes dwarfed by the mass of the download itself.
   Thus, the niche for multiparty transfers expands. Still, current,
   relevant technologies are tightly coupled to a single usecase or even
   infrastructure of a particular corporation. The mission of our
   project is to create a generic content-centric multiparty transport
   protocol to allow seamless, effortless data dissemination on the Net.

                       TABLE 1. Usecases.
         | mirror-based   peer-assisted        peer-to-peer
   ------+----------------------------------------------------
   data  | SunSITE        CacheLogic VelociX   BitTorrent
   VoD   | YouTube        Azureus(+seedboxes)  SwarmPlayer
   live  | Akamai Str.    Octoshape, Joost     PPlive

   The protocol must be designed for maximum genericity, thus focusing
   on the very core of the mission, contain no magic constants and no
   hardwired policies. Effectively, it is a set of messages allowing to
   securely retrieve data from whatever source available, in parallel.
   The protocol must be able to run over IP as an independent transport
   protocol. For compatibility reasons, it must also run over UDP and
   TCP.


3.  Design goals

 


Grishchenko, et al.     Expires January 12, 2012                [Page 3]

Internet-Draft                   swift                         July 2011


   The technical focus of the swift protocol is to find the simplest
   solution involving the minimum set of primitives, still being
   sufficient to implement all the targeted usecases (see Table 1),
   suitable for use in general-purpose software and hardware (i.e. a web
   browser or a set-top box). The five design goals for the protocol
   are:

   1. Embeddable kernel-ready protocol.
   2. Embrace real-time streaming, in- and out-of-order download.
   3. Have short warm-up times.
   4. Traverse NATs transparently.
   5. Be extensible, allow for multitude of implementation over
      diverse mediums, allow for drop-in pluggability.

   Later in the draft, the objectives are referenced as (1)-(5).

   The goal of embedding (1) means that the protocol must be ready to
   function as a regular transport protocol inside a set-top box, mobile
   device, a browser and/or in the kernel space. Thus, the protocol must
   have light footprint, preferably less than TCP, in spite of the
   necessity to support numerous ongoing connections as well as to
   constantly probe the network for new possibilities. The practical
   overhead for TCP is estimated at 10KB per connection [HTTP1MLN]. We
   aim at <1KB per peer connected. Also, the amount of code necessary to
   make a basic implementation must be limited to 10KLoC of C.
   Otherwise, besides the resource considerations, maintaining and
   auditing the code might become prohibitively expensive.

   The support for all three basic usecases of real-time streaming,
   in-order download and out-of-order download (2) is necessary for the
   manifested goal of THE multiparty transport protocol as no single
   usecase dominates over the others.

   The objective of short warm-up times (3) is the matter of end-user
   experience; the playback must start as soon as possible. Thus any
   unnecessary initialization roundtrips and warm-up cycles must be
   eliminated from the transport layer.

   Transparent NAT traversal (4) is absolutely necessary as at least 60%
   of today's users are hidden behind NATs. NATs severely affect
   connection patterns in P2P networks thus impacting performance and
   fairness [MOLNAT,LUCNAT].

   The protocol must define a common message set (5) to be used by
   implementations; it must not hardwire any magic constants, algorithms
   or schemes beyond that. For example, an implementation is free to use
   its own congestion control, connection rotation or reciprocity
   algorithms. Still, the protocol must enable such algorithms by
 


Grishchenko, et al.     Expires January 12, 2012                [Page 4]

Internet-Draft                   swift                         July 2011


   supplying sufficient information. For example, trackerless peer
   discovery needs peer exchange messages, scavenger congestion control
   may need timestamped acknowledgments, etc.


4.  swift subsystems and design choices

   To large extent, swift's design is defined by the cornerstone
   decision to get rid of TCP and not to reinvent any TCP-like
   transports on top of UDP or otherwise. The requirements (1), (4), (5)
   make TCP a bad choice due to its high per-connection footprint,
   complex and less reliable NAT traversal and fixed predefined
   congestion control algorithms. Besides that, an important
   consideration is that no block of TCP functionality turns out to be
   useful for the general case of swarming downloads. Namely,
     1. in-order delivery is less useful as peer-to-peer protocols
     often employ out-of-order delivery themselves and in either case
     out-of-order data can still be stored;
     2. reliable delivery/retransmissions are not useful because
     the same data might be requested from different sources; as
     in-order delivery is not required, packet losses might be
     patched up lazily, without stopping the flow of data;
     3. flow control is not necessary as the receiver is much less
     likely to be saturated with the data and even if so, that
     situation is perfectly detected by the congestion control;
     4. TCP congestion control is less useful as custom congestion
     control is often needed [LEDBAT].
   In general, TCP is built and optimized for a different usecase than
   we have with swarming downloads. The abstraction of a "data pipe"
   orderly delivering some stream of bytes from one peer to another
   turned out to be irrelevant. In even more general terms, TCP
   supports the abstraction of pairwise _conversations_, while we need
   a content-centric protocol built around the abstraction of a cloud
   of participants disseminating the same _data_ in any way and order
   that is convenient to them.

   Thus, the choice is to design a protocol that runs on top of
   unreliable datagrams. Instead of reimplementing TCP, we create a
   datagram-based protocol, completely dropping the sequential data
   stream abstraction. Removing unnecessary features of TCP makes it
   easier both to implement the protocol and to verify it; numerous TCP
   vulnerabilities were caused by complexity of the protocol's state
   machine. Still, we reserve the possibility to run swift on top of TCP
   or HTTP. The draft itself assumes swift-over-UDP implementation; the
   necessary adjustments to run the protocol over IP, TCP, RTP or HTTP
   are listed in Sec. 5. A reference implementation of swift over UDP is
   available [SWIFTIMPL].

 


Grishchenko, et al.     Expires January 12, 2012                [Page 5]

Internet-Draft                   swift                         July 2011


   Pursuing the maxim of making things as simple as possible but not
   simpler, we fit the protocol into the constraints of the transport
   layer by dropping all the transmission's technical metadata except
   for the content's root hash (compare that to metadata files used in
   BitTorrent). Elimination of technical metadata is achieved through
   the use of Merkle [MERKLE,ABMRKL] hash trees, exclusively single-file
   transfers and other techniques. As a result, a transfer is identified
   and bootstrapped by its root hash only.

   To avoid the usual layering of positive/negative acknowledgment
   mechanisms we introduce a scale-invariant acknowledgment system (see
   Sec 4.4). The system allows for aggregation and variable level of
   detail in requesting, announcing and acknowledging data, serves
   in-order and out-of-order retrieval with equal ease. Besides the
   protocol's footprint, we also aim at lowering the size of a minimal
   useful interaction. Once a single datagram is received, it must be
   checked for data integrity, and then either dropped or accepted,
   consumed and relayed.

4.1.  The atomic datagram principle

   Ideally, every datagram sent must be independent of other datagrams,
   so each datagram SHOULD be processed separately and a loss of one
   datagram MUST NOT disrupt the flow. Thus, a datagram carries zero or
   more messages, and neither messages nor message interdependencies
   should span over multiple datagrams. In particular, any data piece is
   verified using uncle hash chains; all hashes necessary for verifying
   data integrity are put into the same datagram as the data (Sec. 4.3).
   As a general rule, if some additional data is still missing to
   process a message within a datagram, the message SHOULD be dropped.

   Each datagram starts with four bytes corresponding to the receiving
   channel number (Sec. 4.2). The rest of a datagram is a concatenation
   of messages. Each message within a datagram has fixed length,
   depending on the type of the message. The first byte of a message
   denotes its type. Integers are serialized in the network (big-endian)
   byte order. Variable-length messages, free-form text or JSON/bencoded
   objects are not allowed. Consider an example of an acknowledgment
   message (Sec 4.4). It has message type of 2 and a payload of a four-
   byte integer (say, 1); it might be written in hex as: "02 00000001".
   Later in the document, a hex-like two char per byte notation is used
   to represent message formats.

   In case a datagram has a piece of data, a sender MUST always put the
   data message (type id 1) in the tail of a datagram. Such a message
   consists of type id, bin number (see Sec. 4.3) and the actual data.
   Normally there is 1 kilobyte of data, except the case when file size
   is not a multiple of 1024 bytes, so the tail packet is somewhat
 


Grishchenko, et al.     Expires January 12, 2012                [Page 6]

Internet-Draft                   swift                         July 2011


   shorter. Example:
   01 00000000 48656c6c6f20776f726c6421
   (This message accommodates an entire file: "Hello world!")


4.2.  Handshake and multiplexing

   For the sake of simplicity, one transfer always deals with one file
   only. Retrieval of large collections of files is done by retrieving a
   directory list file and then recursively retrieving files, which
   might also turn to be directory lists (see Sec. 4.9). To distinguish
   different transfers between the same pair of peers, the protocol
   introduces an additional layer of multiplexing, the channels.
   "Channels" loosely correspond to TCP connections; "content" of a
   single "channel" is a single file. A channel is established with a
   handshake. To start a handshake, the initiating peer needs to know:
   (1) the IP address of a peer
   (2) peer's UDP port and
   (3) the root hash of the content (see Sec. 4.5).
   The handshake is made by a HANDSHAKE message, whose only payload is a
   channel number. HANDSHAKE message type is 0. The initiating handshake
   must be followed by the transfer's root hash.

   The initiator sends first datagram to its peer:
      00000000  04 7FFFFFFF 1234123412341234123412341234123412341234
      00 00000011
   (to unknown channel, handshake from channel 0x11, initiating a
   transfer of a file with a root hash 123...1234)

   Peer's response datagram:
      00000011  00 00000022  03 00000003
   (peer to the initiator: use channel number 0x22 for this transfer;
   I also have first 4 kilobytes of the file, see Sec. 4.3)

   At this point, the initiator knows that the peer really responds; for
   that purpose channel ids MUST be random enough to prevent easy
   guessing. So, the third datagram of a handshake MAY already contain
   some heavy payload. To minimize the number of initialization
   roundtrips, the first two datagrams MAY also contain some minor
   payload, e.g. a couple of HAVE messages roughly indicating the
   current progress of a peer or a HINT (see Sec. 4.7).
      00000022
   (this is a simple zero-payload keepalive datagram consisting of
   a 4-byte channel id only. At this point both peers have the
   proof they really talk to each other; three-way handshake is
   complete)

   In general, no error codes or responses are used in the protocol;
 


Grishchenko, et al.     Expires January 12, 2012                [Page 7]

Internet-Draft                   swift                         July 2011


   absence of any response indicates an error. Invalid messages are
   discarded. Explicit closing of a channel may be achieved by setting
   channel number to zero by a handshake message: 00 00000000.

   Simple NAT hole punching [SNP] introduces the scenario when both
   parties of the handshake are initiators. To avoid creation of two
   transfers in the case both initiating datagrams get through, both
   peers must then act as responding peers. Thus, once an initiating
   datagram is sent and another initiating "counter"-datagram is
   received, the initiating peer sends a response datagram with the same
   channel id as in the outstanding initiating datagram.


4.3.  Generic acknowledgments

   Generic acknowledgments came out of the need to simplify the
   data addressing/requesting/acknowledging mechanics, which tends
   to become overly complex and multilayered with the conventional
   approach. Take BitTorrent+TCP tandem for example:

   1. The basic data unit is of course a byte of content in a file.
   2. BitTorrent's highest-level unit is a "torrent", physically a
   byte range resulting from concatenation of content files.
   3. A torrent is divided into "pieces", typically about a thousand
   of them. Pieces are used to communicate own progress to other
   peers. Pieces are also basic data integrity units, as the torrent's
   metadata includes SHA1 hash for every piece.
   4. The actual data transfers are requested and made in 16KByte
   units, named "blocks" or chunks.
   5. Still, one layer lower, TCP also operates with bytes and byte
   offsets which are totally different from the torrent's bytes and
   offsets, as TCP considers cumulative byte offsets for all content
   sent by a connection, be it data, metadata or commands.
   6. Finally, another layer lower, IP transfers independent datagrams
   (typically around a kilobyte), which TCP then reassembles into
   continuous streams.

   Obviously, such addressing schemes need lots of mappings; from
   piece number and block to file(s) and offset(s) to TCP sequence
   numbers to the actual packets and the other way around. Lots of
   complexity is introduced by mismatch of bounds: packet bounds are
   different from file, block or hash/piece bounds. The picture is
   typical for a codebase which was historically layered.

   To simplify this aspect, we employ a generic content addressing
   scheme based on binary intervals (shortcutted "bins"). The base
   interval is 1KB "packet", the top interval is the complete 2**63
   range.  Till Sec. 4.4.1, any file is considered to be 2**k bytes long.
 


Grishchenko, et al.     Expires January 12, 2012                [Page 8]

Internet-Draft                   swift                         July 2011


   The binary tree of intervals is simple, well-understood, correlates
   well with machine representation of integers and the structure of
   Merkle hashes (Sec. 4.4). A novel addition to the classical scheme
   are "bin numbers", a scheme of numbering binary intervals which
   lays them out into a vector nicely. Bin numbering is done in the
   order of interval's "center", ascending, namely:

              7
        3          11
     1     5     9    13
   0  2  4  6   8 10 12 14

   The number 0xFFFFFFFF (32-bit) or 0xFFFFFFFFFFFFFFFF (64-bit) stands
   for an empty interval; 0x7FFF...FFF stands for "everything". In
   general, this numbering system allows to work with simpler data
   structures, e.g. to use arrays instead of binary trees in many cases.
   As a minor convenience, it also allows to use one integer instead of
   two to denote an interval. By requiring that every message uses bin
   numbers, we enforce genericity.

   Back to the acknowledgment message. A HAVE message (type 3) states
   that the sending peer obtained the specified bin and successfully
   checked its integrity:
   02 00000003
   (got/checked first four kilobytes of a file/stream)

   The data is acknowledged in terms of bins; as a result, every
   single packet is acknowledged logarithmic number of times. That
   provides some necessary redundancy of acknowledgments and
   sufficiently compensates unreliability of datagrams. Compare that
   e.g. to TCP acknowledgments, which are (linearly) cumulative.
   For keeping the state information, an implementation MAY use the
   "binmap" data structure, which is a hybrid of a bitmap and a binary
   tree, discussed in detail in [BINMAP].
   An ACK message (type 2) acknowledges data that was received from
   its addressee; to facilitate delay-based congestion control, an
   ACK message contains a timestamp:

   02 00000002 12345678
   (got the second kilobyte of the file from you; my microsecond
   timer was showing 0x12345678 at that moment)


4.4.  Data integrity and on-demand Merkle hashes

   The integrity checking scheme is unified for two usecases of download
   and streaming. Also, it works down to the level of a single datagram
   by employing Merkle hash trees [MERKLE]. Peers receive chains of
 


Grishchenko, et al.     Expires January 12, 2012                [Page 9]

Internet-Draft                   swift                         July 2011


   uncle hashes just in time to check the incoming data. As metadata is
   restricted to just a single root hash, newcomer peers derive the size
   of a file from hashes. That functionality heavily depends on the
   concept of peak hashes, discussed in Sec. 4.4.1. Any specifics
   related to the cases of file download and streaming is discussed in
   Sec. 4.4.2 and 4.4.3, respectively.

   Here, we discuss the common part of the workflow. As a general rule,
   the sender SHOULD prepend data with hashes which are necessary for
   verifying that data, no more, no less. While some optimistic
   optimizations are definitely possible, the receiver SHOULD drop data
   if it is impossible to verify it. Before sending a packet of data to
   the receiver, the sender inspects the receiver's previous
   acknowledgments to derive which hashes the receiver already has for
   sure. Suppose, the receiver had acknowledged bin 1 (first two
   kilobytes of the file), then it must already have uncle hashes 5, 11
   and so on. That is because those hashes are necessary to check
   packets of bin 1 against the root hash. Then, hashes 3, 7 and so on
   must be also known as they are calculated in the process of checking
   the uncle hash chain. Hence, to send bin 12 (i.e. the 7th kilobyte of
   data), the sender needs to prepend hashes for bins 14 and 9, which
   let the data be checked against hash 11 which is already known to the
   receiver. The sender MUST put into the datagram the chain of uncle
   hashes necessary for verification of the packet, always before the
   data message itself, i.e.:

   04 00000009 F01234567890ABCDEF1234567890ABCDEF123456
   04 0000000E 01234567890ABCDEF1234567890ABCDEF1234567
   (uncle hashes for the packet 12)
   01 0000000C DA1ADA1ADA1A...
   (packet 12 itself)

   The sender MAY optimistically skip hashes which were sent out in
   previous (still unacknowledged) datagrams. It is an optimization
   tradeoff between redundant hash transmission and possibility of
   collateral data loss in the case some necessary hashes were lost in
   the network so some delivered data cannot be verified and thus has to
   be dropped. In either case, the receiver builds the Merkle tree on-
   demand, incrementally, starting from the root hash, and uses it for
   data validation.


4.4.1. Peak hashes

   The concept of peak hashes enables two cornerstone features of swift:
   download/streaming unification and file size proving. Formally, peak
   hashes are hashes defined over filled bins, whose parent hashes are
   defined over incomplete (not filled) bins. A filled bin is a bin
 


Grishchenko, et al.     Expires January 12, 2012               [Page 10]

Internet-Draft                   swift                         July 2011


   which does not extend past the end of the file, or, more precisely,
   contains no empty packets. Practically, we use peaks to cover the
   data range with logarithmic number of hashes, so each hash is defined
   over a "round" aligned 2^k interval. As an example, suppose a file is
   7162 bytes long. That fits into 7 packets, the tail packet being 1018
   bytes long. The binary representation for 7 is 111. Here we might
   note that in general, every "1" in binary representation of the
   file's packet length corresponds to a peak hash. Namely, for this
   particular file we'll have three peaks, bin numbers 3, 9, 12. Thus,
   once a newcomer joins a swarm, the first peer who sends him data
   prepends it with peak hashes. The newcomer checks them against the
   root hash (see Sec 4.4.2).

   04 00000003 1234567890ABCDEF1234567890ABCDEF12345678
   04 00000009 234567890ABCDEF1234567890ABCDEF123456789
   04 0000000C 34567890ABCDEF1234567890ABCDEF1234567890
   (this sequence of peak hashes proves that a file is 7KB long)


4.4.2. Hash trees for files

   The Merkle hash tree covers the entire data range (2**63 bytes). 
   Every hash in the tree is defined in the usual way, as a SHA1 
   hash of a concatenation of two lower-level SHA1 hashes, which
   correspond to left and right data half-ranges respectively. 
   For example,
                hash_1 = SHA1 (hash_0+hash_2)
   where + stands for concatenation and hash_i stands for Merkle hash
   of the bin number i. Obviously, that does not hold for the
   base-layer hashes. Those are normal SHA1 hashes over 1KB data
   ranges ("packets"), except probably for the tail packet, which
   might have less than 1KB of data. The normal recursive formula does
   not apply to empty bins, i.e. bins that have no data absolutely;
   their hashes are just zeros.

   Lemma. Peak hashes could be checked against the root hash.
   Proof. (a) Any peak hash is always the left sibling. Otherwise, be
   it the right sibling, its left neighbor/sibling must also be
   defined over a filled bin, so their parent is also defined over a
   filled bin, contradiction. (b) For the rightmost peak hash, its
   right sibling is zero. (c) For any peak hash, its right sibling
   might be calculated using peak hashes to the left and zeros for
   empty bins. (d) Once the right sibling of the leftmost peak hash
   is calculated, its parent might be calculated. (e) Once that parent
   is calculated, we might trivially get to the root hash by
   concatenating the hash with zeros and hashing it repeatedly.

   Informally, the Lemma might be expressed as follows: peak hashes
 


Grishchenko, et al.     Expires January 12, 2012               [Page 11]

Internet-Draft                   swift                         July 2011


   cover all data, so the remaining hashes are either trivial (zeros) or
   might be calculated from peak hashes and zero hashes.

   Thus, once a peer gets peak hashes and checks them against the
   root hash, it learns the file size and it also gets practical
   anchors for building uncle chains during the transmission (as the
   root hash is too high in the sky). A newcomer peer MAY signal it
   already has peak hashes by acknowledging any bin, even the empty one:

   03 FFFFFFFF

   Otherwise, the first of the senders SHOULD bootstrap him with all the
   peak hashes.


4.4.3. Hash trees for streams

   In the case of live streaming a transfer is bootstrapped with a
   public key instead of a root hash, as the root hash is undefined or,
   more precisely, transient, as long as new data keeps coming.
   Streaming/download unification is achieved by sending signed peak
   hashes on-demand, ahead of the actual data. Similarly to the previous
   case, the sender might use acknowledgements to derive which data
   range the receiver has peak hashes for and to prepend the data hashes
   with the necessary (signed) peak hashes. Except for the fact that the
   set of peak hashes changes with the time, other parts of the
   algorithm work as described in 4.4.2. As we see, in both cases data
   length is not known on advance, but derived on-the-go from the peak
   hashes. Suppose, our 7KB stream extended to another kilobyte. Thus,
   now hash 7 becomes the only peak hash, eating hashes 3, 9 and 12. So,
   the source sends out a signed peak hash message (type 7) to announce
   the fact:

   07 00000007 1234567890ABCDEF1234567890ABCDEF12345678 SOME-SIGN-HERE


4.5.  Peer exchange and NAT hole punching

   Peer exchange messages are common for many peer-to-peer protocols. By
   exchanging peer IP addresses in gossip fashion, peers relieve central
   coordinating entities (the trackers) from unnecessary work. Following
   the example of BitTorrent, swift features two types of PEX messages:
   "peer connected" (type 5) and "peer disconnected" (type 6). Peers are
   represented as IPv4 address-port pairs:
   05 7F000000 1F40
   (connected to 127.0.0.1:8000)

   To unify peer exchange and NAT hole punching functionality, the
 


Grishchenko, et al.     Expires January 12, 2012               [Page 12]

Internet-Draft                   swift                         July 2011


   sending pattern of PEX messages is restricted. As swift handshake is
   able to do simple NAT hole punching [SNP] transparently, PEX messages
   must be emitted in the way to facilitate that. Namely, once peer A
   introduces peer B to peer C by sending a PEX message to C, it SHOULD
   also send a message to B introducing C. The messages SHOULD be within
   2 seconds from each other, but MAY and better not be simultaneous,
   leaving a gap of twice the "typical" RTT, i.e. 300-600ms. The peers
   are supposed to initiate handshakes to each other thus forming a
   simple NAT hole punching pattern where the introducing peer
   effectively acts as a STUN server. Still, peers MAY ignore PEX
   messages if uninterested in obtaining new peers or because of
   security considerations (rate limiting) or any other reason.


4.6.  Data requests (HINTs)

   While bulk download protocols normally do explicit requests for
   certain ranges of data (e.g. BitTorrent's REQUEST message), live
   streaming protocols quite often do without to save round trips.
   Explicit requests are often needed for security purposes; consider
   that BitTorrent can only verify hashes of complete pieces that might
   consist of multiple blocks requested from many peers. As swift has no
   such implications, it is supposed to work both ways. Namely, a peer
   SHOULD send out requested pieces, while it also may send some other
   data in case it runs out of requests or on some other reason. To
   emphasize that, request messages are named HINTs; their only purpose
   is to coordinate peers and to avoid unnecessary data retransmission.
   A peer SHOULD to process HINTs sequentially. HINT message type is 8.
   08 00000009
   (a peer requests fifth and sixth packets)


4.7.  Subsetting of the protocol

   As the same protocol is supposed to serve diverse usecases, different
   peers may support different subsets of messages. The supported subset
   SHOULD be signaled in the handshake packets. The SWIFT_MSGTYPE_RCVD
   message (type 9) serves exactly this purpose. It contains a 32-bit
   big-endian number with bits set to 1 at offsets corresponding to
   supported message type ids. E.g. for a tracker peer which receives
   only handshakes and (root) hashes, sends out handshakes and PEX
   messages, that message will look like: 09 00000011 Peers running over
   TCP may not accept ACK messages, etc.


4.8.  Directory lists

   Directory list files MUST start with magic bytes ".\n..\n". The rest
 


Grishchenko, et al.     Expires January 12, 2012               [Page 13]

Internet-Draft                   swift                         July 2011


   of the file is a newline-separated list of hashes and file names for
   the content of the directory. An example:

   .
   ..
   1234567890ABCDEF1234567890ABCDEF12345678  readme.txt
   01234567890ABCDEF1234567890ABCDEF1234567  big_file.dat


5. Enveloping

5.1.  IP

   The most theoretically correct way is to run swift on top of IP, as
   another transport protocol like TCP or UDP. Albeit, that option has
   significant downsides. First, that is inevitable NAT/firewall
   compatibility problems. Second, that necessitates in-kernel
   implementation for all peers.


5.2.  UDP

   Currently, swift-over-UDP is the default deployment option.
   Effectively, UDP allows the use of IP with minimal overhead, it also
   allows userspace implementations. Besides the classic 1KB packet
   scenario, the bin numbering allows to use swift over Jumbo
   frames/datagrams. Both data and acknowledgments may use e.g. 8KB
   packets instead of "standard" 1KB. Hashing scheme stays the same.
   Using swift with 512 or 256-byte packets is theoretically possible
   with 64-bit byte-precise bin numbers, but IP fragmentation might be a
   better method to achieve the same result.


5.3.  TCP

   If ran over TCP, swift becomes functionally equivalent to BitTorrent.
   Namely, most swift messages have corresponding BitTorrent messages
   and vice versa, except for BitTorrent's explicit interest
   declarations and choking/unchoking, which serve the classic
   implementation of the tit-for-tat algorithm [TIT4TAT].

5.4.  swift over RTP as PPSP

   In this section we sketch how swift can be carried over RTP [RFC3550]
   to form the Peer-to-Peer Streaming Protocol (PPSP) [I-D.ietf-ppsp-
   reqs] running over UDP. The PPSP charter requires existing media
   transfer protocols be used [PPSPCHART]. Hence, the general idea is to
   encode a swift datagram in an RTP packet by placing all the non-DATA
 


Grishchenko, et al.     Expires January 12, 2012               [Page 14]

Internet-Draft                   swift                         July 2011


   messages such as HINT and HAVE into its Extension header and send
   DATA messages in the payload field. This procedure creates an RTP
   packet that adheres to the current standard, but meets swift's atomic
   datagram principle. The latter means that a receiving peer can
   autonomously verify the payload as being correct content, thus
   preventing the spread of corrupt data (PPSP.SEC-REQ-4). 

   To illustrate the idea further, consider two peers A and B that both
   have part of the streaming content. Peer A sends an RTP packet to B
   containing HINT, HAVE and HASH and DATA messages. The HINT message
   conveys what chunks A would like to receive from B, the HAVE message
   informs B of what new chunks A has received from others since their
   last exchange and DATA is content that B has requested via a HINT in
   a previous RTP packet. B responds with a similar packet, thus
   creating a cycle of sharing behavior. 

   The use of ACK messages for reliability is left as a choice of the
   application using PPSP.

5.4.1. Design

   5.4.1.1. Joining a Swarm

   To commence a PPSP download a peer A must have the swarm ID of the
   stream and a list of one or more tracker contact points (e.g.
   host+port). The list of trackers is optional in the presence of a
   decentralized tracking mechanism. The swarm ID consists of the swift
   root hash of the content, divided in chunks of N kilobyte, with N=1K
   at present (see Discussion).

   Peer A now registers with the PPSP tracker following the tracker
   protocol (to be defined elsewhere) and receives the IP address and
   RTP port of peers already in the swarm, say B, C, and D. Peer A now
   sends an RTP packet containing a HANDSHAKE without channel
   information to B, C, and D. This serves as an end-to-end check that
   the peers are actually in the correct swarm. Optionally A could
   include a HINT message in some RTP packets if it wants to start
   receiving content immediately. B and C respond with a HANDSHAKE and
   HAVE messages. D sends just a HANDSHAKE and omits HAVE messages as a
   way of choking A.


   5.4.1.2. Exchanging Chunks

   In response to B and C, A sends new RTP packets to B and C with HINTs
   for disjunct sets of chunks. B and C respond with DATA messages and
   HAVE messages, updating their chunk availability. Upon receipt, A
   sends HAVE for the chunks received and new HINT messages to B and C.
 


Grishchenko, et al.     Expires January 12, 2012               [Page 15]

Internet-Draft                   swift                         July 2011


   When e.g. C finds that A obtained a chunk (from B) that C did not yet
   have, C's response includes a HINT for that chunk.

   D does not send HAVE messages, instead if D decides to unchoke peer
   A, it sends an RTP packet with HAVE messages to inform A of its
   current availability. If B or C decide to choke A they stop sending
   HAVE and DATA messages and A should then rerequest from other peers.
   They may continue to send HINT messages, or exponentially slowing
   KEEPALIVE messages such that A keeps sending them HAVE messages.

   Once A has received all content (video-on-demand use case) it stops
   sending messages to all other peers that have all content (a.k.a.
   seeders).


   5.4.1.3. Leaving a Swarm

   Peers can implicitly leave a swarm by stopping to respond to
   messages. Sending peers should remove these peers from the current
   peer list. This mechanism works for both graceful and ungraceful
   leaves (i.e., peer crashes or disconnects). When leaving gracefully,
   a peer should deregister from the tracker following the PPSP tracker
   protocol.

   More explicit graceful leaves could be implemented using RTCP. In
   particular, a peer could send a RTCP BYE on the RTCP port that is
   derivable from a peer's RTP port for all peers in its current peer
   list. However, to prevent malicious peers from sending BYEs a form of
   peer authentication is required (e.g. using public keys as peer IDs
   [PERMIDS].)


   5.4.1.4. Discussion

   In this draft we assume 1K chunks, but experiments should be
   conducted whether the resulting number of hash calculations per
   second needed to verify chunks are not excessive (e.g., for mobile
   hardware.) Multiple chunks can be transmitted in a single RTP
   packet.

   Normal RTP payloads can be of variable size. Because the hashing
   scheme can deal with variable-sized chunks, this could be an
   alternative. The dynamic length prediction of swift would no longer
   work, however, so content length would have to be explicit in the
   metadata. Storage on disk is also more complex with out-of-order
   chunk downloads.

   The current design uses just the Extension header fields in an RTP
 


Grishchenko, et al.     Expires January 12, 2012               [Page 16]

Internet-Draft                   swift                         July 2011


   packet. The sequence number is not used, but could be used to detect
   packet loss between peers. The timestamp, SSRC and CSRC fields are
   also unused by the PPSP. Using them for their original RTP purpose,
   however, creates a security vulnerability as malicious peers could
   modify them undetected. Hence, the alternative design in Sec. 5.5.
   should be considered, where RTP is carried over swift and its header
   is therefore also protected by the hashing scheme.

   At present, Peer IDs are not required in this design.


5.4.2. PPSP Requirements

   5.4.2.1. Basic Requirements

   - PPSP.REQ-1: The swift PEX message can be used as the basis for a   
   tracker protocol, to be discussed elsewhere.

   - PPSP.REQ-2: This draft preserves the properties of RTP, but extra
   mechanisms may be necessary to protect against faulty or malicious
   peers.

   - PPSP.REQ-3: This draft does not place requirements on peer IDs,
   IP+port is sufficient. 

   - PPSP.REQ-4: The content is identified by its root hash (video-on-
   demand) or a public key (live streaming).

   - PPSP.REQ-5: The content is partitioned in 1K chunks (see 5.4.1.4.)

   - PPSP.REQ-6: Each chunk is identified by a bin number (and its
   cryptographic hash.)

   - PPSP.REQ-7: The protocol is carried over UDP because RTP is.


   5.4.2.2. Peer Protocol Requirements

   The peer protocol requirements suggest the protocol should allow
   peers to distribute information about the chunk availability of
   others. We consider the use of such hearsay a risk for the protocol
   as it may be exploited by malicious peers.

   - PPSP.PP.REQ-1: A GET_HAVE would have to be added to request which
   chunks are available from a peer, if the proposed push-based HAVE
   mechanism is not sufficient. 

   - PPSP.PP.REQ-2: A set of HAVE messages satisfies this. 
 


Grishchenko, et al.     Expires January 12, 2012               [Page 17]

Internet-Draft                   swift                         July 2011


   - PPSP.PP.REQ-3: A GET_PEX would have to be added to request known
   peers from a peer, if the proposed push-based PEX mechanism is not
   sufficient.

   - PPSP.PP.REQ-4: HAVE messages convey current availability. New peer
   joins could be propagated using PEX, but this could be used as a DDoS
   attack mechanism. Reporting leaves via PEX also enables exploitation
   by malicious peers. A secure tracking / peer sampling protocol like
   [PUPPETCAST] may be needed for peer-address exchange to be safe.

   - PPSP.PP.REQ-5: A new PPSP specific Peer Report message would have
   to be added to RTCP. If forwarded by other peers cryptography is
   needed to prevent malicious peers from sending IP address changes.
   For example, provide each peer with a public/private key pair and
   send self-certified address changes [PERMIDS] that include the swarm
   ID and a timestamp.

   - PPSP.PP.REQ-6: Transmission and chunk requests are integrated in
   this protocol.


   5.4.2.3. Security Requirements

   - PPSP.SEC.REQ-1: An access control mechanism like Closed Swarms
   [CLOSED] would have to be added. 

   - PPSP.SEC.REQ-2: As swift is carried over RTP, RTP encryption can be
   used. Alternatively, just the payload could be encrypted. The latter
   allows for caching servers that are part of the swarm but do not need
   access to the decryption keys (they need access to the swift HASHES
   in the Extention header to verify the packet's integrity).


   - PPSP.SEC.REQ-3: RTP encryption or IPsec [RFC4303] can be used.

   - PPSP.SEC.REQ-4: The Merkle tree hashing scheme prevents the
   indirect spread of corrupt content, as peers will only forward chunks
   to others if their integrity check out. Another protection mechanism
   is to not depend on hearsay (i.e., do not forward other peers'
   availability information), or to only use it when the information
   spread is self-certified by its subjects.

   Other attacks, such as a malicious peer claiming it has content but
   not replying, are still possible. Or wasting CPU and bandwidth at a
   receiving peer by sending packets where the DATA doesn't match the
   HASHes.


 


Grishchenko, et al.     Expires January 12, 2012               [Page 18]

Internet-Draft                   swift                         July 2011


   - PPSP.SEC.REQ-5: The Merkle tree hashing scheme allows a receiving
   peer to detect a malicious or faulty sender, which it can
   subsequently ignore. Spreading this knowledge to other peers such
   that they know about this bad behavior is hearsay. 


   - PPSP.SEC.REQ-6: A risk in peer-to-peer streaming systems is that
   malicious peers launch an Eclipse [ECLIPSE] attack on the initial
   injectors of the content (in particular in live streaming). The
   attack tries to let the injector upload to just malicious peers which
   then do not forward the content to others, thus stopping the
   distribution. An Eclipse attack could also be launched on an
   individual peer. Letting these injectors only use trusted trackers
   that provide true random samples of the population or using a secure
   peer sampling service [PUPPETCAST] can help negate such an attack. 


   - PPSP.SEC.REQ-7: swift supports decentralized tracking via PEX or
   additional mechanisms such as DHTs [SECDHTS], but self-certification
   of addresses is needed, see discussion of PPSP.PP.REQ-5 above.
   Content distribution can continue as long as there are peers that
   have it available.

   - PPSP.SEC.REQ-8: The verification of data via hashes obtained from a
   trusted source is well-established in the BitTorrent protocol
   [BITTORRENT]. The proposed Merkle tree scheme is a secure extension
   of this idea. Self-certification and not using hearsay are other
   lessons learned from existing distributed systems.



5.5.  RTP over swift as PPSP

   As discussed in Sec. 5.4.1.4. running swift over RTP means that
   important fields in the RTP header are not protected by the hashing
   scheme. This means that, in particular, the timestamp and PT fields
   could be maliciously altered. Hence, it would be safer to carry RTP
   as a payload inside swift (or to always encrypt a packet). Carrying
   RTP over swift would also allow decryption-less caching.

   In this design a chunk would be a complete RTP packet, which can be
   of variable size. The hashing scheme can deal with variable-sized
   chunks. The dynamic length prediction of swift would no longer work,
   so content length would have to be explicit in the metadata.

5.5.1. PPSP Requirements

   5.5.1.1. Basic Requirements
 


Grishchenko, et al.     Expires January 12, 2012               [Page 19]

Internet-Draft                   swift                         July 2011


   See Section 5.4.2.1.

   5.5.1.2. Peer Protocol Requirements

   - PPSP.PP.REQ-5: A new PPSP specific REPORT message would have to be
   added to swift. 

   For further discussion of the requirements, see See 5.4.2.2.


   5.5.1.3. Security Requirements

   - PPSP.SEC.REQ-1: An access control mechanism like Closed Swarms
   [CLOSED] would have to be added. 

   - PPSP.SEC.REQ-2: A new encryption scheme for swift and/or its DATA
   messages would have to be selected, or IPsec [RFC4303] can be used.
   The scheme should allow decryptionless caching servers.

   - PPSP.SEC.REQ-3: Encryption similar to RTP encryption or IPsec
   [RFC4303] can be used between peers.

   For further discussion of the requirements, see See 5.4.2.3.





5.6.  swift over HTTP as PPSP

   In this section we sketch how swift can be carried over HTTP
   [RFC2616] to form the PPSP running over TCP. The general idea is to
   encode a swift datagram in HTTP GET and PUT requests and their
   replies by transmitting all the non-DATA messages such as HINTs and
   HAVEs as headers and send DATA messages in the body. This idea
   follows the atomic datagram principle for each request and reply. So
   a receiving peer can autonomously verify the message as carrying
   correct data, thus preventing the spread of corrupt data (PPSP.SEC-
   REQ-4). 

   A problem with HTTP is that it is a client/server protocol. To
   overcome this problem, a peer A uses a PUT request instead of a GET
   request if the peer B has indicated in a reply that it wants to
   retrieve a chunk from A. In cases where peer A is no longer
   interested in receiving requests from B (described below) B may need
   to establish a new HTTP connection to A to quickly download a chunk,
   instead of waiting for a convenient time when A sends another
   request. As an alternative design, two HTTP connections could be used
 


Grishchenko, et al.     Expires January 12, 2012               [Page 20]

Internet-Draft                   swift                         July 2011


   always.

5.6.1. Design

   5.6.1.1. Joining a Swarm

   To commence a PPSP download a peer A must have the swarm ID of the
   stream and a list of one or more tracker contact points, as above.
   The swarm ID as earlier also consists of the swift root hash of the
   content, divided in chunks of N kilobyte, with N=1K at present. 

   Peer A now registers with the PPSP tracker following the tracker
   protocol (to be defined elsewhere) and receives the IP address and
   HTTP port of peers already in the swarm, say B, C, and D. Peer A now
   establishes persistent HTTP connections with  B, C, D and sends GET
   requests with the Request-URI set to /<encoded roothash>. Optionally
   A could include a HINT message in some requests if it wants to start
   receiving content immediately. A HINT is encoded as a Range header
   with a new "bins" unit.

   B and C respond with a 200 OK reply with header-encoded HAVE
   messages. A HAVE message is encoded as e.g. an extended Accept-
   Ranges: header with the new bins unit and the possibility of listing
   the set of accepted bins. If no HINT/Range header was present in the
   request, the body of the reply is empty. D sends just a 200 OK reply
   and omits the HAVE/Accept-Ranges header as a way of choking A.

   5.6.1.2. Exchanging Chunks

   In response to B and C, A sends GET requests with Range headers,
   requesting disjunct sets of chunks. B and C respond with 206 Partial
   Content replies with the requested chunks in the body and Accept-
   Ranges headers, updating their chunk availability. The HASHES for the
   chunks are encoded in a new Content-Merkle header and the Content-
   Range is set to identify the chunk. A new "multipart-bin ranges"
   equivalent to the "multipart-bytes ranges" media type may be used to
   transmit multiple chunks in one reply.

   Upon receipt, A sends a new GET request with a HAVE/Accept-Ranges
   header for the chunks received and new HINT/Range headers to B and C.
   Now when e.g. C finds that A obtained a chunk (from B) that C did not
   yet have, C's response includes a HINT/Range for that chunk. In this
   case, A's next request to C is not a GET request, but a PUT request
   with the requested chunk sent in the body.

   Again, working around the fact that HTTP is a client/server protocol,
   peer A periodically sends HEAD requests to peer D (which was
   virtually choking A) that serve as keepalives and may contain
 


Grishchenko, et al.     Expires January 12, 2012               [Page 21]

Internet-Draft                   swift                         July 2011


   HAVE/Accept-Ranges headers. If D decides to unchoke peer A, it
   includes an Accept-Ranges header in the 200 OK reply to inform A of
   its current chunk availability. 

   If B or C decide to choke A they start responding with 204 No Content
   replies without HAVE/Accept-Ranges headers and A should then re-
   request from other peers. However, if their replies contain
   HINT/Range headers A should keep on sending PUT requests with the
   desired data (another client/server workaround). If not, A should
   slowly send HEAD requests as keepalive and content availability
   update.

   Once A has received all content (video-on-demand use case) it closes
   the persistent connections to all other peers that have all content
   (a.k.a. seeders).


   5.6.1.3. Leaving a Swarm

   Peers can explicitly leave a swarm by closing the connection. This
   mechanism works for both graceful and ungraceful leaves (i.e., peer
   crashes or disconnects). When leaving gracefully, a peer should
   deregister from the tracker following the PPSP tracker protocol.


   5.6.1.4. Discussion

   As mentioned earlier, this design suffers from the fact that HTTP is
   a client/server protocol. A solution where a peer establishes two
   HTTP connections with every other peer may be more elegant. The
   mapping of swift messages to headers remains the same:

   	HINT = Range
   	HAVE = Accept-Ranges
   	HASH = Content-Merkle
   	PEX = e.g. extended Content-Location

   The Content-Merkle header should include some parameters to indicate
   the hash function and chunk size (e.g. SHA1 and 1K) used to build the
   Merkle tree.


5.6.2. PPSP Requirements

   5.6.2.1. Basic Requirements

   - PPSP.REQ-1: The HTTP-based BitTorrent tracker protocol [BITTORRENT]
   can be used as the basis for a tracker protocol, to be discussed
 


Grishchenko, et al.     Expires January 12, 2012               [Page 22]

Internet-Draft                   swift                         July 2011


   elsewhere.

   - PPSP.REQ-2: This draft preserves the properties of HTTP, but extra
   mechanisms may be necessary to protect against faulty or malicious
   peers.

   - PPSP.REQ-3: This draft does not place requirements on peer IDs,
   IP+port is sufficient. 

   - PPSP.REQ-4: The content is identified by its root hash (video-on-
   demand) or a public key (live streaming).

   - PPSP.REQ-5: The content is partitioned in 1K chunks (see 5.6.1.1.)

   - PPSP.REQ-6: Each chunk is identified by a bin number (and its
   cryptographic hash.)

   - PPSP.REQ-7: The protocol is carried over TCP because HTTP is.


   5.6.2.2. Peer Protocol Requirements

   The peer protocol requirements suggest the protocol should allow
   peers to distribute information about the chunk availability of
   others. We consider the use of such hearsay a risk for the protocol
   as it may be exploited by malicious peers.

   - PPSP.PP.REQ-1: A HEAD request can be used to find out which chunks
   are available from a peer, which returns the new Accept-Ranges
   header.

   - PPSP.PP.REQ-2: The new Accept-Ranges header satisfies this. 

   - PPSP.PP.REQ-3: A GET with a request-URI requesting the peers of a
   resource (e.g. /<encoded roothash>/peers) would have to be added to
   request known peers from a peer, if the proposed push-based
   PEX/~Content-Location mechanism is not sufficient.

   - PPSP.PP.REQ-4: HAVE/Accept-Ranges headers convey current
   availability. New peer joins could propagated using PEX, but this
   could be used as a DDoS attack mechanism. Reporting leaves via PEX
   also enables exploitation by malicious peers. A secure tracking /
   peer sampling protocol like [PUPPETCAST] may be needed for peer-
   address exchange to be safe.

   - PPSP.PP.REQ-5: A new PPSP specific Peer-Report header would have to
   be added. If forwarded by other peers cryptography is needed to
   prevent malicious peers from sending IP address changes. For example,
 


Grishchenko, et al.     Expires January 12, 2012               [Page 23]

Internet-Draft                   swift                         July 2011


   provide each peer with a public/private key pair and send self-
   certified address changes [PERMIDS] that include the swarm ID and a
   timestamp. The peer acting as client would then have to reconnect to
   the other peer on its new address.

   - PPSP.PP.REQ-6: Transmission and chunk requests are integrated in
   this protocol.


   5.6.2.3. Security Requirements

   - PPSP.SEC.REQ-1: An access control mechanism like Closed Swarms
   [CLOSED] would have to be added. 

   - PPSP.SEC.REQ-2: As swift is carried over HTTP, HTTPS encryption can
   be used instead. Alternatively, just the body could be encrypted. The
   latter allows for caching servers that are part of the swarm but do
   not need access to the decryption keys (they need access to the swift
   HASHES in the headers to verify the packet's integrity).

   - PPSP.SEC.REQ-3: HTTPS encryption or the content encryption
   facilities of HTTP can be used.

   - PPSP.SEC.REQ-4: The Merkle tree hashing scheme prevents the
   indirect spread of corrupt content, as peers will only forward
   content to others if its integrity checks out. Another protection
   mechanism is to not depend on hearsay (i.e., do not forward other
   peers' availability information), or to only use it when the
   information spread is self-certified by its subjects.

   Other attacks such as a malicious peer claiming it has content, but
   not replying are still possible. Or wasting CPU and bandwidth at a
   receiving peer by sending packets where the body doesn't match the
   HASH/Content-Merkle headers.


   - PPSP.SEC.REQ-5: The Merkle tree hashing scheme allows a receiving
   peer to detect a malicious or faulty sender, which it can
   subsequently close its connection to and ignore. Spreading this
   knowledge to other peers such that they know about this bad behavior
   is hearsay. 


   - PPSP.SEC.REQ-6: A risk in peer-to-peer streaming systems is that
   malicious peers launch an Eclipse [ECLIPSE] attack on the initial
   injectors of the content (in particular in live streaming). The
   attack tries to let the injector upload to just malicious peers which
   then do not forward the content to others, thus stopping the
 


Grishchenko, et al.     Expires January 12, 2012               [Page 24]

Internet-Draft                   swift                         July 2011


   distribution. An Eclipse attack could also be launched on an
   individual peer. Letting these injectors only use trusted trackers
   that provide true random samples of the population or using a secure
   peer sampling service [PUPPETCAST] can help negate such an attack. 


   - PPSP.SEC.REQ-7: swift supports decentralized tracking via PEX or
   additional mechanisms such as DHTs [SECDHTS], but self-certification
   of addresses is needed, see discussion of PPSP.PP.REQ-5 above.
   Content distribution can continue as long as there are peers that
   have it available.


   - PPSP.SEC.REQ-8: The verification of data via hashes obtained from a
   trusted source is well-established in the BitTorrent protocol
   [BITTORRENT]. The proposed Merkle tree scheme is a secure extension
   of this idea. Self-certification and not using hearsay are other
   lessons learned from existing distributed systems.



6. Security Considerations

   As any other network protocol, the swift faces a common set of
   security challenges. An implementation must consider the possibility
   of buffer overruns, DoS attacks and manipulation (i.e. reflection
   attacks). Any guarantee of privacy seems unlikely, as the user is
   exposing its IP address to the peers. A probable exception is the
   case of the user being hidden behind a public NAT or proxy.


7. Extensibility

7.1. 32 bit vs 64 bit

   While in principle the protocol supports bigger (>1TB) files, all
   the mentioned counters are 32-bit. It is an optimization, as using
   64-bit numbers on-wire may cost ~2% practical overhead. The 64-bit
   version of every message has typeid of 64+t, e.g. typeid 68 for
   64-bit hash message:
   44 000000000000000E 01234567890ABCDEF1234567890ABCDEF1234567

7.2. IPv6

   IPv6 versions of PEX messages use the same 64+t shift as in 7.1.


7.3. Congestion control algorithms
 


Grishchenko, et al.     Expires January 12, 2012               [Page 25]

Internet-Draft                   swift                         July 2011


   Congestion control algorithm is left to the implementation and may
   even vary from peer to peer. Congestion control is entirely
   implemented by the sending peer, the receiver only provides clues,
   such as hints, acknowledgments and timestamps. In general, it is
   expected that servers would use TCP-like congestion control schemes
   such as classic AIMD or CUBIC [CUBIC]. End-user peers are expected to
   use weaker-than-TCP (least than best effort) congestion control, such
   as [LEDBAT] to minimize seeding counter-incentives.


7.4. Piece picking algorithms

   Piece picking entirely depends on the receiving peer. The sender peer
   is made aware of preferred pieces by the means of HINT messages. In
   some scenarios it may be beneficial to allow the sender to ignore
   those hints and send unrequested data.


7.5. Reciprocity algorithms

   Reciprocity algorithms are the sole responsibility of the sender
   peer. Reciprocal intentions of the sender are not manifested by
   separate messages (as BitTorrent's CHOKE/UNCHOKE), as it does not
   guarantee anything anyway (the "snubbing" syndrome).


7.6. Different crypto/hashing schemes

   Once a flavor of swift will need to use a different crypto scheme
   (e.g., SHA-256), a message should be allocated for that. As the root
   hash is supplied in the handshake message, the crypto scheme in use
   will be known from the very beginning. As the root hash is the
   content's identifier, different schemes of crypto cannot be mixed in
   the same swarm; different swarms may distribute the same content
   using different crypto.


References

[RFC2119] Key words for use in RFCs to Indicate Requirement Levels
[HTTP1MLN] Richard Jones. "A Million-user Comet Application with
    Mochiweb", Part 3. http://www.metabrew.com/article/
    a-million-user-comet-application-with-mochiweb-part-3
[MOLNAT] J.J.D. Mol, J.A. Pouwelse, D.H.J. Epema and H.J. Sips:
    "Free-riding, Fairness, and Firewalls in P2P File-Sharing"
[LUCNAT] submitted
[BINMAP] V. Grishchenko, J. Pouwelse: "Binmaps: hybridizing bitmaps
    and binary trees"
 


Grishchenko, et al.     Expires January 12, 2012               [Page 26]

Internet-Draft                   swift                         July 2011


    http://bouillon.math.usu.ru/articles/binmaps-alenex.pdf
[SNP] B. Ford, P. Srisuresh, D. Kegel: "Peer-to-Peer Communication
    Across Network Address Translators",
    http://www.brynosaurus.com/pub/net/p2pnat/
[MERKLE] Merkle, R. "A Digital Signature Based on a Conventional
    Encryption Function". Proceedings CRYPTO'87, Santa Barbara, CA,
    USA, Aug 1987. pp 369-378.
[ABMRKL] Arno Bakker: "Merkle hash torrent extension", BEP 30,
    http://bittorrent.org/beps/bep_0030.html
[CUBIC] Injong Rhee, and Lisong Xu: "CUBIC: A New TCP-Friendly
    High-Speed TCP Variant",
    http://www4.ncsu.edu/~rhee/export/bitcp/cubic-paper.pdf
[LEDBAT] S. Shalunov: "Low Extra Delay Background Transport (LEDBAT)"
    http://www.ietf.org/id/draft-ietf-ledbat-congestion-00.txt
[TIT4TAT] Bram Cohen: "Incentives Build Robustness in BitTorrent", 2003,
    http://www.bittorrent.org/bittorrentecon.pdf
[BITTORRENT] B. Cohen, "The BitTorrent Protocol Specification",
     February 2008, http://www.bittorrent.org/beps/bep_0003.html
[RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
    Jacobson, "RTP: A Transport Protocol for Real-Time
    Applications", STD 64, RFC 3550, July 2003.
[I-D.ietf-ppsp-reqs] Zong, N., Zhang, Y., Pascual, V., Williams, C.,
    and L. Xiao, "P2P  Streaming Protocol (PPSP) Requirements", 
    draft-ietf-ppsp-reqs-03 (work in progress), July 2011.
[PPSPCHART] Stiemerling et al. "Peer to Peer Streaming Protocol (ppsp)
    Description of Working Group" 
   http://datatracker.ietf.org/wg/ppsp/charter/
[PERMIDS] A. Bakker et al. "Next-Share Platform M8--Specification
    Part", App. C. P2P-Next project deliverable D4.0.1 (revised), 
    June 2009.
    http://www.p2p-next.org/download.php?id=E7750C654035D8C2E04D836243E6526E
[PUPPETCAST] A. Bakker and M. van Steen. "PuppetCast: A Secure Peer
    Sampling Protocol". Proceedings 4th Annual European Conference on
    Computer Network Defense (EC2ND'08), pp. 3-10, Dublin, Ireland,
    11-12 December 2008.
[CLOSED] N. Borch, K. Michell, I. Arntzen, and D. Gabrijelcic: "Access
    control to BitTorrent swarms using closed swarms". In Proceedings
    of the 2010 ACM workshop on Advanced video streaming techniques
    for peer-to-peer networks and social networking (AVSTP2P '10).
    ACM, New York, NY, USA, 25-30. 
    http://doi.acm.org/10.1145/1877891.1877898 
[ECLIPSE] E. Sit and R. Morris, "Security Considerations for
    Peer-to-Peer Distributed Hash Tables", IPTPS '01: Revised Papers
    from the First International Workshop on Peer-to-Peer Systems, pp.
    261-269, Springer-Verlag, London, UK, 2002.
[SECDHTS] G. Urdaneta, G. Pierre, M. van Steen, "A Survey of DHT
   Security Techniques", ACM Computing Surveys, vol. 43(2), June 2011.
[HTTP] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, 
 


Grishchenko, et al.     Expires January 12, 2012               [Page 27]

Internet-Draft                   swift                         July 2011


   P. Leach, T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1",
   RFC2616, June 1999.
[SWIFTIMPL] V. Grishchenko, et al. "Swift M40 reference implementation",
   http://swarmplayer.p2p-next.org/download/Next-Share-M40.tar.bz2
   (subdirectory Next-Share/TUD/swift-trial-r2242/), July 2011.

Authors' addresses

   A. Bakker
   Technische Universiteit Delft 
   Department EWI/ST/PDS
   Room HB 9.160
   Mekelweg 4
   2628CD Delft
   The Netherlands

   Email: arno@cs.vu.nl


































Grishchenko, et al.     Expires January 12, 2012               [Page 28]