Transport Area Working Group B. Briscoe Internet-Draft BT & UCL Expires: April 20, 2006 A. Jacquet A. Salvatori BT October 17, 2005 Re-ECN: Adding Accountability for Causing Congestion to TCP/IP draft-briscoe-tsvwg-re-ecn-tcp-00 Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on April 20, 2006. Copyright Notice Copyright (C) The Internet Society (2005). Abstract This document introduces a new feedback protocol for explicit congestion notification (ECN), termed re-ECN. It arranges the ECN field of each packet so that, as it arrives at each router, the relative rates of each codepoint will give a truthful prediction of congestion on the remainder of the path. It also outlines mechanisms at the network edge that ensure the dominant selfish strategy of both Briscoe, et al. Expires April 20, 2006 [Page 1] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 network domains and end-points will be to set these codepoints honestly and to respond correctly to path congestion, despite conflicting interests. Although these mechanisms influence incentives, they use engineering mechanisms like throttling and dropping, rather than requiring changes to end-user pricing. The protocol can be deployed incrementally around unmodified routers without requiring changes to IP. Authors' Statement: Status This document is posted as an initial Internet-Draft with the intent (at least that of the authors) to eventually progress to standards track. However, it proposes using the ECN codepoints currently set aside for the experimental ECN nonce. The protocol proposed here aims to allow networks to be able to police cheating senders and receivers and to police neighbouring networks. On the other hand, the ECN nonce aims to allow senders to detect cheating receivers. Although the proposed scheme addresses a much more pressing problem, compromises in its strength have had to be introduced in order to make it incrementally deployable. We therefore seek the opinion of the Internet Community on whether the resulting strength is sufficient to warrant standards action. Briscoe, et al. Expires April 20, 2006 [Page 2] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Requirements notation . . . . . . . . . . . . . . . . . . . . 6 3. Protocol Overview . . . . . . . . . . . . . . . . . . . . . . 7 3.1. Imprecise Protocol Overview . . . . . . . . . . . . . . . 8 3.2. Precise Protocol Overview . . . . . . . . . . . . . . . . 9 4. Transport Layers . . . . . . . . . . . . . . . . . . . . . . . 12 4.1. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.1.1. Re-ECT mode: Full re-ECN capabable transport . . . . . 13 4.1.2. Re-ECT-Compat mode: Re-ECT Sender with a Vanilla or Nonce ECT Receiver . . . . . . . . . . . . . . . . 15 4.1.3. Capability Negotiation . . . . . . . . . . . . . . . . 16 4.1.4. Flow Start . . . . . . . . . . . . . . . . . . . . . . 17 4.2. Other Transports . . . . . . . . . . . . . . . . . . . . . 18 4.2.1. Guidelines for Adding Re-feedback to Other Transports . . . . . . . . . . . . . . . . . . . . . . 18 5. Network Layer . . . . . . . . . . . . . . . . . . . . . . . . 19 6. Non-Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 20 7. Applications . . . . . . . . . . . . . . . . . . . . . . . . . 20 7.1. Policing Per-Flow Congestion Response . . . . . . . . . . 20 7.1.1. Incentive Framework . . . . . . . . . . . . . . . . . 21 7.1.2. Dropper . . . . . . . . . . . . . . . . . . . . . . . 21 7.1.3. Per-flow Policer . . . . . . . . . . . . . . . . . . . 21 7.1.4. Inter-domain Policing . . . . . . . . . . . . . . . . 21 7.1.5. Limitations . . . . . . . . . . . . . . . . . . . . . 21 7.1.6. Other Applications . . . . . . . . . . . . . . . . . . 22 8. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 22 9. Architectural Rationale {ToDo:} . . . . . . . . . . . . . . . 23 10. Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 23 11. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 23 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 23 13. Security Considerations . . . . . . . . . . . . . . . . . . . 23 14. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 23 15. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 23 16. References . . . . . . . . . . . . . . . . . . . . . . . . . . 24 16.1. Normative References . . . . . . . . . . . . . . . . . . . 24 16.2. Informative References . . . . . . . . . . . . . . . . . . 24 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 26 Intellectual Property and Copyright Statements . . . . . . . . . . 27 Briscoe, et al. Expires April 20, 2006 [Page 3] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 1. Introduction The current Internet architecture trusts hosts to respond voluntarily to congestion. Limited evidence shows that the large majority of end-points on the Internet comply with a TCP-friendly response to congestion. But telephony (and increasingly video) services over the best efforts Internet are attracting the interest of major commercial operations. Most of these applications do not respond to congestion at all. Those that can switch to lower rate codecs, still have a lower bound below which they become unresponsive. Even TCP-friendly applications can cause a disproportionate amount of congestion, simply by using multiple flows or by transferring data continuously. Also, of course, the Internet Architecture has few defences against denial of service attacks that combine both problems: unresponsiveness to congestion and flooding with multiple flows. Applications that need (or choose) to be unresponsive to congestion can effectively steal whatever share of bottleneck resources they want from responsive flows. Whether or not such free-riding is common, inability to prevent it increases the risk of poor returns for investors in network infrastructure, leading to under-investment. An increasing proportion of unresponsive, free-riding demand coupled with persistent under-supply is a broken economic cycle. Therefore, if the current, largely co-operative consensus continues to erode, congestion collapse could become more common in various parts of the Internet [RFC3714]. The problem is architectural because the information needed for policing is at the other end of the Internet from the point where control is needed. Policing is only truly effective at the first ingress into an internetwork, but path congestion is only visible at the last egress. We believe non-architectural approaches (see Section 11) to this problem are unlikely to offer more than partial solutions. This document proposes a simple realignment of the Internet's feedback architecture, termed 're-feedback' [Re-fb]. The word is short for either receiver-aligned or re-inserted feedback. Changing the Internet's feedback architecture seems to imply considerable upheaval. But, this document proposes changes that could be deployed incrementally at the transport layer (we focus on TCP), around unmodified routers using the existing fields in IP (v4 or v6). However, we stress that the limited space in the IP header heavily reduces the responsiveness of the scheme for dealing with deliberate dynamic attacks (see Section 7.1.5 for this and other Briscoe, et al. Expires April 20, 2006 [Page 4] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 limitations). Conceptually, the solution could hardly be simpler. Packet header fields accumulate path information as data traverses the path, just as can already be done with time to live (TTL) or explicit congestion notification (ECN) [RFC3168] (this document focuses only on ECN). The ECN marking rate currently always starts at the datum of zero at the sender. Instead we expect the sender to try to make packets arrive at the destination with a marking rate averaging around an agreed datum. We define one of the ECT codepoints as negative so that we can set this datum to zero. As each receiver feeds back congestion marking arriving in packets, the sender continuously adjusts subsequent packets in order to continue to hit the zero target on average. For flows from transports using re-feedback each packet arrives at each network element carrying a view of its own downstream path, albeit a round trip ago and averaged over multiple packets. Most usefully, full path congestion becomes visible at the first ingress. "Accountability for congestion" implies being able to identify who is responsible. But congestion is a link/network/transport layer issue---not appropriate layers for adding individual or organisational identity. The approach we take simply relocates information about path congestion from the egress to the ingress. Then congestion can be associated with the interface that is directly and ultimately responsible for causing the congestion (whether intentionally or not). Identifying the ingress interface is a sufficient hook for tracing the (ir)responsible party if required. More usefully, it is the exactly the place to police or throttle in order to directly mitigate congestion, rather than having to trace the (ir)responsible party. Importantly, the scheme is recursive: a whole network harbouring users causing congestion in downstream networks can be held responsible or policed. But the scheme correctly discounts congestion in networks upstream of the inter-domain interface, which is their own problem. That is all well and good, but we still don't seem to have solved the problem. It seems naive to hold the sender accountable by trusting fields that depend on the honesty of both the sender and receiver--- those with most to gain from lying. However, having re-aligned the congestion marking datum to the receiver, we show how the egress operator can ensure that end users will lose more than they gain by being dishonest, so the dominant strategy will be honesty. Operators can deploy 'droppers' (Section 7.1.2) at their egress Briscoe, et al. Expires April 20, 2006 [Page 5] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 interfaces to their end-customers to check whether traffic leaving the internetwork is hitting the zero target. Flows persistently below the target at the destination must come from a source that is deliberately understating path congestion. The dropper can apply sanctions to these flows---to ensure they lose more than they gain. Further, in Section 7.1.1 we explain in outline why re-aligning feedback allows us to arrange for honesty to be everyone's dominant strategy---not only end-users, but also networks. Building on the resulting trustworthiness of downstream path congestion metrics at the ingress, it is possible to build a rate equation policer (or other types of congestion-based policers depending on what the network operator wants to achieve). The details are outside the scope of this document, but we describe the general principles in Section 7.1.3 using TCP as a concrete example. We also describe an example passive bulk policer for inter-domain boundaries. The structure of the rest of this document is as follows. First we provide an overview of how the re-ECN protocol works as a whole using TCP/IP as an example (Section 3). Then we describe the network layer functions generic to all transports (Section 5) before describing the changes necessary in TCP and guidelines on adding the re-feedback capability to other transports (Section 4). We complete the protocol engineering sections of the document by clarifying why some issues such as encryption and tunnels are actually non-issues by deliberate design (Section 6). The rest of the document starts by describing accountability applications that can be built over the protocol, such as the dropper and policer already outlined, but also briefly outlines other possibilities like DoS mitigation and using congestion accountability to achieve end-to-end differentiated QoS (Section 7). Then deployment issues discussed throughout the document are brought together in Section 8, which leads in to a brief section explaining the somewhat subtle rationale for the design, from an architectural perspective (Section 9). We end by referring to various simulations (Section 10), describing related work (Section 11), listing security considerations (Section 13) and finally drawing conclusions (Section 14). 2. Requirements notation The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. Briscoe, et al. Expires April 20, 2006 [Page 6] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 3. Protocol Overview First we briefly recap the essentials of the ECN protocol [RFC3168]. Two bits in the IP protocol (v4 or v6) are assigned to the ECN field. The sender clears the field to "00" (Not-ECT) if either end-point transport is not ECN-capable. Otherwise it indicates an ECN-capable transport (ECT) using either of the two code-points "10" or "01" (ECT(0) and ECT(1) resp.). Routers probabilistically set "11" if congestion is experienced (CE). The choice of two ECT code-points permitted future flexibility, optionally allowing the sender to encode the experimental ECN nonce [RFC3168] in the packet stream. The ECN nonce is an elegant scheme that allows the sender to detect if a receiver tries to claim no congestion was experienced when it fact it was (whether drop or ECN marking). The sender chooses between the two ECT codepoints in a pseudo-random sequence. Then, whenever the network marks a packet with CE, the receiver has to guess which ECT codepoint was overwritten, with only a 50:50 chance of being correct each time. We use the flexibility of the two ECT codepoints originally provided for the ECN nonce, in a scheme we call re-ECN. However, the re-ECN protocol addresses a much wider range of cheating problems, which includes the one addressed by the ECN nonce. The assumption behind the ECN nonce is that a sender will want to detect whether a receiver is suppressing congestion feedback. This is only true if the sender's interests are aligned with the network's, or with the community of users as a whole. This may be true for certain large senders, who are under close scrutiny and have a reputation to maintain. But we have to deal with a more hostile world, where traffic may be dominated by peer-to-peer transfers, rather than downloads from a few popular sites. Often the 'natural' self-interest of a sender is not aligned with the interests of other users. It wishes to transfer data quickly to the receiver as much as the receiver wants the data quickly. The re-ECN protocol enables policing of an agreed rate-response to congestion (eg TCP-friendliness) at the sender's interface with the internetwork. It also ensures downstream networks can police their upstream neighbours, to encourage them to police their users in turn. But most importantly, it requires the sender to declare path congestion to the network and it can remove traffic at the egress if this declaration is dishonest. So it can police correctly, irrespective of whether the receiver tries to suppress congestion feedback or whether the sender ignores genuine congestion feedback. Here we only discuss TCP/IP, not other IP transports. No changes to Briscoe, et al. Expires April 20, 2006 [Page 7] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 the IP or TCP wire protocols are REQUIRED, beyond those specified already for ECN [RFC3168]. No changes to the handling of IP in senders, receivers or routers are REQUIRED and the TCP receiver does not need changing either, only the TCP sender. However, later, we define RECOMMENDED changes to both the IP and TCP wire-protocols and to the TCP receiver (Section 8 gives the incremental deployment strategy). The re-ECN protocol makes no changes and has no effect on the TCP congestion control algorithm. Re-ECN is only concerned with setting the proportions of ECT(0) and ECT(1), which is completely orthogonal to congestion control. 3.1. Imprecise Protocol Overview We will first give an imprecise but intuitive overview of the re-ECN protocol, before being more exact. The general idea is to encode remaining downstream path congestion into the three ECN codepoints. Currently the ECN field only encodes the path congestion that has already been experienced upstream. Although we do not change the behaviour of ECN-capable routers, we need to highlight the way they accumulate ECN marking along a path. ECN-capable routers encode a time-varying congestion signal into a stream of packets by varying the rate at which they set the CE codepoint. Each ECN-capable router marks some packets with CE, the marking probability increasing with the length of the queue at its egress link (the RED algorithm [RFC2309]). The combined effect of the packet marking of all the routers along the path signals congestion of the whole path to the receiver. So, for example, if one router early in a path is marking 1% of packets and another later in a path is marking 2%, flows that pass through both routers will experience approximately 3% marking. With current ECN, the TCP receiver echoes CE marked packets back to the sender. The idea of re-ECN is for the sender at the head of the path to re-echo each echoed CE packet back into the forward data path using the ECT(0) field. In the above example it would set the ECT(0) codepoint on approximately 3% of the packets it sends. It sets the remaining 97% to the other ECT codepoint, ECT(1). Then downstream congestion can be measured to be the rate of ECT(0) marking minus the rate of CE marking (note, we are still being imprecise). Remaining downstream congestion can be measured at any point along the path, not just at the sender. Briscoe, et al. Expires April 20, 2006 [Page 8] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 ^ | | ECT(0) marking rate 3% |--------------------------------+===== | | 2% | | | CE marking rate | 1% | +-----------------------+ | | 0% +----------------------------------------> ^ 0 ^ 1 ^ resource index | ^ | ^ | 0 | 1 | 2 observation points 1.00% 2.00% marking rate Figure 1: A 2-Router Example (Imprecise) Figure 1 shows an example to illustrate this. The horizontal axis represents the index of each congestible resource (typically queues) along a path through the Internet. The two superimposed plots show the rate of each ECN codepoint observed along this path. +-------------------+-----------------------+ | Observation point | Downstream congestion | +-------------------+-----------------------+ | 0 | 3% - 0% = 3% | | 1 | 3% - 1% = 2% | | 2 | 3% - 3% = 0% | +-------------------+-----------------------+ Effectively, using the ECT(0) codepoint, the re-ECN sender superimposes the whole-path congestion rate into the congestion marking signal, without disturbing the signal representing upstream congestion given by CE marking. So all along the path, the whole- path congestion can be used as a reference against which to compare the upstream congestion. The difference predicts downstream congestion for the rest of the path. So downstream congestion, upstream congestion and whole path congestion are all encoded in 1.5 bits (3 of the 4 ECN codepoints). Being able to encode downstream congestion directly in the packet stream is the key to adding accountability to IP, as outlined in the introduction. 3.2. Precise Protocol Overview We will now repeat this description with more precision, because, of course, there is a delay between the two signals and binary marking Briscoe, et al. Expires April 20, 2006 [Page 9] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 is probabilistic not additive. We will end up able to precisely measure a moving average of downstream congestion at any point on a path. Attending so closely to precision issues might seem pedantic, but it avoids systematic bias in downstream congestion averages. The idea is on average to hit a target of zero downstream congestion at the destination. An average even slightly below zero then signifies someone is cheating. If we recommend a sender design that introduces a systematic bias, the distinction between cheating and honesty will be blurred. First some notation. j represents the index of each resource (typically queues) along a path, ranging from 0 at the first router to n-1 at the last. We use m_j to represent the rate at which a router marks packets using resource j. u_j is the rate of CE marking observable in packet headers arriving at resource j (before marking). Similarly, z_j is the rate of ECT(0) marking. (To aid readability, think m for *m*arking rate, u for *u*pstream congestion, z for ECT *z*ero.) All measurements are in terms of bytes, not packets, assuming that line resources are more congestible than packet processing. Observed rates of each particular codepoint (u, z as well as h and v below) have dimensions of data rate [b/s]. Router marking rate m is a dimensionless fraction, being the ratio of two data rates (marked and total). We define what is effectively a virtual header field h, where h_j = u_j - z_j at any node j on the path. As with current ECN, no packets are sent with CE set: u_0 = 0. And TCP feeds back to the source any CE arriving at the destination in the echo congestion experienced (ECE) field (see Section 4.1 for how the accuracy of ECE feedback can be improved). The sender tries to arrange the starting value h_0 of this virtual header so that it will reach zero at the destination. In other words, it tries to ensure the rates of ECT(0) and CE at the destination are equal: z_n = u_n. Of course it cannot achieve this for certain, but it can on average over a long enough time (we have to assume congestion processes are stationary over the short times we will be averaging, which is a fairly safe assumption). Even though the sender will only hit the target on average, it is important to avoid any systematic bias. So the sender MUST allow for some ECT(0) packets being changed to CE by downstream routers. If the rate of ECT(0) set by the sender is z_0, and, by the end of the path, any packet (of whatever value of ECT) might be changed to CE Briscoe, et al. Expires April 20, 2006 [Page 10] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 with a probability u_n, then the remaining rate of packets that have avoided having their ECT(0) marking removed is, z_n = z_0(1 - u_n). But we want z_n = u_n. So u_n = z_0(1 - u_n) or z_0 = u_n / (1 - u_n). .......(1) So, to avoid bias, rather than just re-echoing ECE, the sender should slightly inflate the rate it sends ECT(0) by 1/(1 - u_n) (Section 4.1 gives a simple TCP ack handler algorithm to do this). To be absolutely clear, all the marking rates we discuss here result from the behaviour of simple protocol handler algorithms---we are not saying the protocol handlers have to work with these rates directly. ^ | 3.07% ECT(0) marking rate |--------________________________ 3% | 3.04% +===== | | 2.98% 2% | | | CE marking rate | 1% | +-----------------------+ | 0.00% | 1.00% 0% +----------------------------------------> ^ 0 ^ n-1 ^ resource index, j | ^ | ^ | j=0 | j | j=n observation points | 1.00% | 2.00% | marking rate, m 0.00% 1.00% 2.98% upstr congestion, u 3.07% 3.04% 2.98% rate of ECT(0), z -3.07% -2.04% 0.00% virtual header, h 2.98% 2.00% 0.00% downstr congestion, v Figure 2: Measuring Upstream and Downstream Congestion (Precise) Figure 2 repeats our example from Figure 1, but with more precision. It shows how combining 1% and 2% marking leads to slightly less than 3% whole-path marking: u_n = 1 - (1 - m_0)(1 - m_1) = 100% - 99.00% x 98.00% = 2.98% Briscoe, et al. Expires April 20, 2006 [Page 11] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 The figure also shows the sender slightly inflating the rate it sets ECT(0) to 3.07% in order to hit the 2.98% target. z_0 = u_n / (1 - u_n) .......From Equation (1) = 2.98% / 97.02% = 3.07% We now introduce the notation v for the downstream congestion metric we need for accountability applications. For low levels of congestion (|h_j| << 1), the virtual headers h arriving in packets at any node on the path are themselves a good approximation to downstream congestion, that is v ~ -h (we have deliberately used a definition of h that usually makes it numerically negative). In other words, downstream congestion can be approximated simply by subtracting the rate of CE from that of ECT(0). But, because the rate of ECT(0) had to be inflated by the sender, it needs deflating again if a precise measure of downstream congestion is required. The following formula does the necessary deflation (derived in Appendix A.1 of [Re-fb]): v_j = 1 - 1 / (1 - h_j). .......(2) The last two rows in Figure 2 show the virtual header and this precise downstream congestion metric for our example scenario. Note that Equation (2) deliberately implies that the sender cannot declare downstream congestion of more than 50%, as it MUST not set the virtual header outside the bounds -1 <= h_j <= 0. We discuss this saturation in Section 7.1.5. 4. Transport Layers 4.1. TCP The ECN field in the IPv4 and IPv6 wire protocols and the names of the codepoints within it remain unchanged from their definition in [RFC3168]. Re-ECN capability at the sender is essential. At the receiver it is optional, as long as the receiver has a basic ('vanilla flavour') ECN-capable transport (ECT) to [RFC3168]. Given the number of combinations of sender and receiver capability has grown to 16, we give a table below summarising what happens with each combination. The sender of the TCP half-connection is host S and the receiver is R. There is a column for each flavour of host capability. The last column gives the mode the half-connection is in after the TCP handshake (we have made up the names for the nonce-related modes --- Briscoe, et al. Expires April 20, 2006 [Page 12] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 the ECN nonce RFC [RFC3540] doesn't specify behaviour in all cases). Only the first three rows concern us here, as all the rest are already outlined in the specifications of earlier flavours of ECN. +--------+-----------+-----+---------+---------------+ | Re-ECT | ECT-Nonce | ECT | Not-ECT | Mode | +--------+-----------+-----+---------+---------------+ | SR | | | | Re-ECT | | S | R | | | Re-ECT-Compat | | S | | R | | Re-ECT-Compat | | S | | | R | Not-ECT | | | SR | | | ECT-Nonce | | | S | R | | Half-Nonce | | | S | | R | Not-ECT | | | | SR | | ECT | | | | S | R | Not-ECT | | | | | SR | Not-ECT | | R | S | | | ECT-Nonce | | R | | S | | ECT-Nonce? | | R | | | S | Not-ECT | | | R | S | | ECT-Nonce? | | | R | | S | Not-ECT | | | | R | S | Not-ECT | +--------+-----------+-----+---------+---------------+ o Re-ECT: Full re-ECN capabable transport o Re-ECT-Compat: Re-ECN sender in compatibility mode with a vanilla [RFC3168] ECN receiver or an [RFC3540] ECN nonce-capable receiver. We will describe what happens in these two modes, then describe how they are negotiated. 4.1.1. Re-ECT mode: Full re-ECN capabable transport In full Re-ECT mode, for each half connection, both sender and receiver maintain an unsigned integer counter we will call ECI (echo congestion increment). It maintains a count, modulo 8, of how many times a CE marked packet has arrived at the receiver during the half- connection. Conceptually the three TCP option fields used for ECN- related functions in previous versions of ECN are used as a 3-bit field for the receiver to repeatedly tell the sender the current value of ECI. This conceptual field is shown in Figure 4, against how the TCP header is actually defined at the moment, including the addition of support for the ECN nonce in Figure 3. Every time a CE marked packet arrives at the receiver, the receiver transport increments its local value of ECI modulo 8 and immediately Briscoe, et al. Expires April 20, 2006 [Page 13] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 echoes its value to the sender in this conceptual ECI field in the TCP header. It repeats the same value of ECI in every subsequent ACK until the next CE event, when it increments ECI again. The increment of the local ECI values is modulo 8 so the field value simply wraps round back to zero when it overflows. The least significant bit is to the right (labelled bit 9). {ToDo: Unfortunately the 3-bit field crosses a byte boundary - how important is this?} 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | | | N | C | E | U | A | P | R | S | F | | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | | | | | R | E | G | K | H | T | N | N | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ Figure 3: The (post-ECN Nonce) definition of bytes 13 and 14 of the TCP Header 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | | | | U | A | P | R | S | F | | Header Length | Reserved | ECI | R | C | S | S | Y | I | | | | | G | K | H | T | N | N | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ Figure 4: Our alternative conceptual view of bytes 13 and 14 of the TCP Header On the arrival of every ACK, the sender compares the ECI field with its own copy, then replaces its local copy with that from the ACK. The difference is assumed to be the number of CE marked packets that have arrived at the receiver since the last ACK (but see below for its safety strategy). Each increment of the ECI field (or detection of a drop), the sender MUST set the ECT(0) field in the IP header of the next packet it sends, effectively re-echoing each increment to ECI. Otherwise the data sender sends all packets with ECT(1) set. As we have already emphasised in the protocol overview, the re-ECN protocol makes no changes and has no effect on the TCP congestion control algorithm. So, each increment of ECI (or detection of a drop) also triggers the standard TCP congestion response, but with no more than one congestion response per round trip, as usual. We chose this method for echoing congestion marking because a re-ECN sender needs to know about every CE mark arriving at the receiver, Briscoe, et al. Expires April 20, 2006 [Page 14] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 not just whether at least one arrives within a round trip time. But pure ACKs are not protected by TCP reliable delivery, so we repeat the same ECI value in every ACK until it changes. Even if many ACKs in a row are lost, as soon as one gets through, the ECI field it repeats from previous ACKs that didn't get through will update the sender on how many CE marks arrived since the last ACK got through. The sender will only lose a record of the arrival of a CE mark if /all/ the ACKS are lost (and all of them were pure ACKs) for a stream of data long enough to contain 8 or more CE marks. To protect against this extremely unlikely event, if the sender receives a ACK that acknowledges a sequence number 8 or more segments higher than the previously ack'd sequence number it should conservatively behave as if all the intervening ACKs echoed a new CE mark. {ToDo: This behaviour is ultra-ultra-conservative, so we need to check whether this ultra-safely is really necessary.} In order to slightly inflate the rate of ECT(0) marking, as described in the precise protocol overview above (Section 3.2), for each half- connection the TCP sender maintains a single EWMA, U, of the number of ACKed packets between successive increments of ECI (assuming equal-sized packets). It sets an extra ECT(0) in sent data every (U-1) increments of the ECI field. Maintaining this EWMA can be done with a shift and an add per packet, by choosing a power of two for the weight. 4.1.2. Re-ECT-Compat mode: Re-ECT Sender with a Vanilla or Nonce ECT Receiver If the half-connection is in Re-ECT-Compat mode, the receiver will not understand re-ECN but the sender can infer enough from the vanilla ECN feedback to set the ECT(0) marking rate reasonably well. Essentially, every time the receiver toggles the ECE field from 0 to 1, the Re-ECN sender does the same as it would do in full Re-ECT mode. That is, it sets ECT(0) on the next packet and maintains the two counters that enable it to inflate the rate of ECT(0). If a CE marked packet arrives at the receiver within a round trip time of a previous mark, the receiver will still be echoing ECE for the last CE mark. Therefore, such a mark will be missed by the sender. Of course, this isn't of concern for congestion control, but it does mean that the rate of ECT(0) will be occasionally understated. If there is a dropper at the egress, flows in Re-ECT- Compat mode may be mistaken for very lightly cheating flows and suffer a small number of packet drops. We expect Re-ECN would be deployed for some time before policers and droppers start to enforce it. So, given there is not much ECN deployment yet anyway, this minor problem may affect only a very small proportion of flows, Briscoe, et al. Expires April 20, 2006 [Page 15] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 reducing to nothing over the years as vanilla ECN sites upgrade. {ToDo: This decision will need to be reviewed in the light of experience at the time of re-ECN deployment.} Re-ECT-Compat mode is OPTIONAL. Re-ECN implementers who want to keep their code simple, MAY choose not to implement this mode. If they do not, a re-ECN sender SHOULD fall back to vanilla ECT mode in the presence of an ECN-capable receiver. It MAY choose to fall back to the ECT-Nonce mode, but if implementers don't want to be bothered with this mode, they probably won't want to bother with the nonce either. 4.1.3. Capability Negotiation During the TCP hand-shake at the start of a connection, the originator of the connection (host A) indicates that it has a re-ECN- capable transport (Re-ECT) by setting the TCP options NS=1, CWR=1 and ECE=1 in the initial SYN. A responding Re-ECT host (host B) should return a SYN ACK with flags NS=0, CWR=1 and ECE=0. We would also like to reserve the combination NS=1, CWR=1 and ECE=0 for future Re-ECN use {ToDo: describe this future use}. These handshakes are summarised in the table below. The handshakes used for the other flavours of ECN are also shown for comparison. +---------+----+-----+-----+-----------------------------+ | Phase | NS | CWR | ECE | Condition | +---------+----+-----+-----+-----------------------------+ | SYN | 1 | 1 | 1 | A is Re-ECT | | SYN ACK | 0 | 1 | 0 | B is Re-ECT | | SYN ACK | 1 | 1 | 0 | Reserved: future Re-ECT use | | SYN ACK | 0 | 0 | 1 | B is ECT | | SYN ACK | 1 | 0 | 1 | B is ECT-Nonce | +---------+----+-----+-----+-----------------------------+ The rationale for choosing these particular combinations of flags is as follows. Choice of SYN flags: The Re-ECN sender can work with vanilla ECN receivers so we wanted to use the same flags as would be used in an ECN-setup SYN [RFC3168]. But at the same time, we wanted a receiver that is Re-ECT to be able to recognise that the sender is also Re-ECT. We believe setting NS=1 achieves both these objectives, as it should be ignored by vanilla ECT receivers and by ECT-Nonce receivers, but senders that are not Re-ECT should not set NS=1. {ToDo: At the time ECN was defined, the NS flag was not Briscoe, et al. Expires April 20, 2006 [Page 16] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 defined, so setting NS=1 should be ignored by existing ECT receivers (but you never know).} Choice of SYN ACK flags: Choice of SYN ACK: The sender needs to be able to determine whether the receiver is Re-ECT. The original ECN specification required an ECT receiver to respond to an ECN- setup SYN with an ECN-setup SYN ACK of CWR=0 and ECE=1. There is no room to modify this by setting the NS flag, as that is already set in the SYN ACK of an ECT-Nonce receiver. So we used the only combination of CWR and ECE that would not be used by existing TCP receivers: CWR=1 and ECE=0. The original ECN specification defines this combination as a non-ECN-setup SYN ACK, which remains true for vanilla and Nonce ECTs. But for re-ECN we define it as a Re-ECN-setup SYN ACK. We didn't use a SYN ACK with both CWR and ECE cleared to 0 because that would be the likely response from most Not-ECT receivers. And we didn't use a SYN ACK with both CWR and ECE set to 1 either, as at least one broken receiver implementation echos whatever flags were in the SYN into its SYN ACK. Choice of Reserved SYN ACK: {ToDo:} 4.1.4. Flow Start Note that at the network layer a TCP SYN should have the ECN field cleared (Not-ECT). Contrary to the original ECN specification [RFC3168] a SYN ACK SHOULD have ECT set at the network layer, as currently being proposed [I-D.kuzmanovic-ecn-syn]. Specifically, a Re-ECT receiver SHOULD set the ECN field of the SYN ACK to ECT(0). Once the TCP flow starts, the TCP sender SHOULD set ECT(0) on the first packet it sends (after the SYN). The TCP sender for each half-connection MUST then follow the above procedure for determining whether to set ECT(1) or ECT(0) at the network layer for each subsequent packet. The first SYN and the first ECT packet in each half connection SHOULD have the CR flag cleared to 0 at the network layer (see Section 5). Subsequent packets MUST have the CR flag set to 1. The above behaviour for the very first ECT packet of a flow is necessary because the transport does not have the benefit of ECN feedback. So ECT(0) is set rather than ECT(1) as a conservative estimate that there might be congestion. And the CR flag is cleared to indicate that the ECT field has been set without certain knowledge of the path. It might seem pedantic worrying about these single packets, but this Briscoe, et al. Expires April 20, 2006 [Page 17] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 behaviour ensures the system is safe, even if the application mix on the Internet evolves to the point where the majority of flows consist of a single packet. It also allows denial of service attacks to be more easily isolated and prevented. Note that we have said SHOULD rather than MUST for this behaviour with initial ECT packets. This is to entertain the possibility of the TCP transport having the benefit of other knowledge of the path, which it re-uses from one flow to the benefit of a newly starting flow. For instance, multiple flows between the same hosts using a Congestion Manager [RFC3124]. Or a proxy host that is aggregating congestion information for large numbers of flows (indeed, if Internet traffic evolves to be dominated by single-packet flows we will need plenty of these). 4.2. Other Transports 4.2.1. Guidelines for Adding Re-feedback to Other Transports Re-ECT sender transports that have established the receiver transport is at least ECN-capable (not necessarily Re-ECN capable) MUST set the ECT(0) codepoint at least as often as the CE codepoint arrives at the receiver. As with the current ECN protocol, whenever ECT(0) is not set, the ECT(1) codepoint should be set. If the sender transport does not have sufficient feedback to even estimate the path's CE rate, it SHOULD set ECT(0) continuously. If the sender transport has some, perhaps stale, feedback to estimate the path's CE rate, the transport SHOULD set ECT(0) and ECT(1) in equal proportions. Alternating them, starting with ECT(0) would be sufficient. From Equation (2) above, the sender will then be declaring to the network ingress that it believes downstream congestion, v_0 = 50% or 33% respectively. Note: these estimates are NOT used for congestion control in the Re-ECN protocol. If this is an under-declaration of the actual (but unknown) path congestion, it will simply result in a strongly positive virtual header, h, at the destination. A Re-ECT sender transport that cannot estimate the CE rate at the receiver MUST also set the Certain flag (CR) (see Section 5). {ToDo: Give a brief outline of what would be expected for each of the following: o UDP Fire and Forget (e.g. DNS) o UDP Streaming with no F/b Briscoe, et al. Expires April 20, 2006 [Page 18] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 o UDP Streaming with F/b o DCCP o RSVP and/or NSIS: A separate I-D is in preparation describing how re-ECN can be used in an edge-to-edge rather than end-to-end scenario. It can then be used by downstream networks to police whether upstream networks are blocking new flows when downstream congestion is too high, even though the congestion is in other operators' downstream networks. This relates to current work in progress on Admission Control over Diffserv using Pre-Congestion Notification, being reported to the IETF TSVWG [CL-arch]. } 5. Network Layer The ECN field in the IP header remains unchanged. However, the semantics of the two ECT codepoints change. Previously routers and middleboxes treated these two codepoints identically, not taking any note of which codepoint was set. With the re-ECN protocol, routers and middleboxes may infer downstream or full-path congestion by monitoring the relative rates of the ECN codepoints. Routers and middleboxes that monitor this information SHOULD ignore packets with the CR flag (defined below) cleared. The CR flag MUST NOT be used to determine the treatment of a single packet. It is only meaningful to support the monitoring of averages over multiple packets. For IPv4, this document defines a new CR (Certain) control flag in place of the reserved control flag at bit 48 of the IPv4 header (counting from 0). Alternatively, some would call this bit 0 (counting from 0) of byte 7 (counting from 1) of the IPv4 header. 0 1 2 +---+---+---+ | C | D | M | | R | F | F | +---+---+---+ Figure 5: New Definition of the Control Flags at the start of Byte 7 of the IPv4 Header {ToDo: Include the IPv6 extension header design, including support for the CR flag. Also its integrated support for a future multi-bit congestion notification field, with a TTL hop count scheme to check that all routers on the path support it (similar to Quick-Start). So, if the whole path of routers doesn't support the extension, the Briscoe, et al. Expires April 20, 2006 [Page 19] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 end-points can fall back to re-ECN (or drop).} Guidelines on setting the CR flag are given in Section 4.2.1. When set, the CR flag also serves as an indication that the transports are re-ECN capable (Re-ECT). More generally, it will imply that the transport understands and is using re-feedback of other fields in the IP header, such as the TTL (see [Re-fb]), although this document does not define re-feedback behaviour for the TTL field. {ToDo: Describe how the sender falls back to clearing this flag if packets don't appear to be getting through (to work round a firewall discarding a packet it considers unusual).} {ToDo: We are sure there will probably be other claims pending on the use of this flag (we know of at least one {ToDo add ref to Adams}). There is a possibility that this flag as defined can simultaneously serve other purposes, particularly where the start of a flow needs distinguishing from packets later in the flow. For instance it could have been useful in tag switching? It is (nearly) Clark's state set-up bit to protect against memory exhaustion attacks?.} 6. Non-Issues {ToDo: This section will explain why the addition of Re-ECN does not interact with any of the following: o Integration with congestion notification in various link layers (Ethernet, ATM (and MPLS if it had a congestion notification capability added, which is not precluded for the EXP field [RFC3270]) o Tunnels, and Overlays that wish to support congestion notification (see also the brief discussion of edge-to-edge support for Re-ECN in RSVP or NSIS transports earlier) o Encryption and IPSec } 7. Applications 7.1. Policing Per-Flow Congestion Response Briscoe, et al. Expires April 20, 2006 [Page 20] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 7.1.1. Incentive Framework {ToDo: This section will largely repeat the discussion in Sections 3 "Incentives" and 3.1 "The Case Against Classic Feedback" of [Re-fb], but putting it into a standardisation context.} 7.1.2. Dropper {ToDo: This section will largely repeat the discussion in Section 3.2 "Honest Congestion Reporting" of [Re-fb], as an example implementation of a 'dropper' (but for 2-bit Re-ECN) that could be deployed at the egresses from the internetwork to detect cheating packets and flows.} 7.1.3. Per-flow Policer {ToDo: This section will largely repeat the discussion in Section 3.3 "Fair Congestion Response" of [Re-fb], as an example implementation of a 'policer' (but for 2-bit Re-ECN) that could be deployed at the ingresses into the internetwork to detect flows not complying to the agreed rate response to congestion.} 7.1.4. Inter-domain Policing {ToDo: This section will largely repeat the discussion in Section 3.4 "Interdomain incentive mechanisms" of [Re-fb], to give examples of how downstream networks can police the aggregate congestion response of their upstream neighbours, against different contractual arrangements. The goal being to ensure the upstream network in turn polices its upstream networks, eventually ensuring upstream networks will suffer financially if they do not police the rate response to congestion of their users.} 7.1.5. Limitations {ToDo: This is the most critical section, for the IETF audience, but unfortunately we have not had time to write it. This section will discuss the limitations of the re-feedback approach, particularly having fit it into the one remaining spare codepoint and bit of the IP header: o The ability of malicious users to turn off ECT, given Not-ECT traffic cannot be policed, and the implications on how we may have to treat Not-ECT traffic in the longer term. o Re-feedback for TTL would also be desirable at the same time Briscoe, et al. Expires April 20, 2006 [Page 21] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 o The ability of malicious users to launch dynamically changing attacks, exploiting the time it takes to detect an attack, given ECN marking is binary. o Tructation vs. Drop: The issue over whether it would be useful to truncate rather than drop packets that appear to be malicious, so that the feedback loop can continue even though useful data can be removed. o The apparently inherent need for at least some flow state at the egress dropper given the binary marking environment, and the consequent vulnerability to state exhaustion attacks. o Issues around saturation of the number range the Re-ECN protocol uses to signal marking rates (These issues are believed to be safe, but they need airing anyway). } 7.1.6. Other Applications {ToDo: Other applications of Re-ECN will be briefly outlined here (largely drawing from section 3 of [Re-fb]), such as: o Per-user (rather than per-flow) long term congestion policing o DDoS Mitigation o E2e QoS o Traffic Engineering o Inter-Provider Service Monitoring } 8. Incremental Deployment {ToDo: This section will bring together the features related to incremental deployment in the protocol specification as defined, to describe the overall deployment strategy. Particularly, the final step described in the next paragraph is neat:} We chose the rate of ECT(0) for z, rather than ECT(1) deliberately. Existing ECN sources set ECT(0) at either 50% (the nonce) or 100% (the default). So they will appear to a re-feedback policer as very highly congested paths. When policers are first deployed they can be Briscoe, et al. Expires April 20, 2006 [Page 22] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 configured permissively, allowing through both `legacy' ECN and misbehaving re-ECN flows. Then, as the threshold is set more strictly, the more legacy ECN sources will gain by upgrading to re- ECN. Thus, towards the end of the voluntary incremental deployment period, legacy transports can be given progressively stronger encouragement to upgrade. 9. Architectural Rationale {ToDo:} 10. Simulations This section will refer to the simulations or policer and dropper performance done since those in section 5 "Dropper Performance" of [Re-fb]. Some are in submission to a conference. 11. Related Work This section will largely be similar to the section 6 "Related Work" of [Re-fb], but in a more standards-related context. 12. IANA Considerations {ToDo:}This memo includes no request to IANA (yet). 13. Security Considerations {ToDo:} 14. Conclusions {ToDo:} 15. Acknowledgements {ToDo:} 16. References Briscoe, et al. Expires April 20, 2006 [Page 23] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 16.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, S., Wroclawski, J., and L. Zhang, "Recommendations on Queue Management and Congestion Avoidance in the Internet", RFC 2309, April 1998. [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, September 2001. [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit Congestion Notification (ECN) Signaling with Nonces", RFC 3540, June 2003. 16.2. Informative References [CL-arch] Briscoe and others, B., "A framework for admission control over Diffserv using Pre-congestion notification", I-D draft-briscoe-tsvwg-cl-architecture-01.txt, July 2005. (work in progress) [I-D.kuzmanovic-ecn-syn] Kuzmanovic, A., "Adding Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK Packets", draft-kuzmanovic-ecn-syn-00 (work in progress), October 2005. [RFC3124] Balakrishnan, H. and S. Seshan, "The Congestion Manager", RFC 3124, June 2001. [RFC3270] Le Faucheur, F., Wu, L., Davie, B., Davari, S., Vaananen, P., Krishnan, R., Cheval, P., and J. Heinanen, "Multi- Protocol Label Switching (MPLS) Support of Differentiated Services", RFC 3270, May 2002. [RFC3714] Floyd, S. and J. Kempf, "IAB Concerns Regarding Congestion Control for Voice Traffic in the Internet", RFC 3714, March 2004. [Re-fb] Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C., Salvatori, A., Soppera, A., and M. Koyabe, "Policing Briscoe, et al. Expires April 20, 2006 [Page 24] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 Congestion Response in an Internetwork Using Re-Feedback", ACM SIGCOMM CCR 35(4)277--288, August 2005, . Briscoe, et al. Expires April 20, 2006 [Page 25] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 Authors' Addresses Bob Briscoe BT & UCL B54/77, Adastral Park Martlesham Heath Ipswich IP5 3RE UK Phone: +44 1473 645196 Email: bob.briscoe@bt.com URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ Arnaud Jacquet BT B54/70, Adastral Park Martlesham Heath Ipswich IP5 3RE UK Phone: +44 1473 647284 Email: arnaud.jacquet@bt.com URI: Alessandro Salvatori BT B54/77, Adastral Park Martlesham Heath Ipswich IP5 3RE UK Phone: ? Email: sandr8@gmail.com URI: ? Briscoe, et al. Expires April 20, 2006 [Page 26] Internet-Draft Re-ECN: Adding Accountability to TCP/IP October 2005 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Copyright Statement Copyright (C) The Internet Society (2005). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Briscoe, et al. Expires April 20, 2006 [Page 27]