Congestion Exposure (ConEx) M. Kuehlewind, Ed.
Internet-Draft University of Stuttgart
Intended status: Experimental R. Scheffenegger
Expires: January 16, 2014 NetApp, Inc.
July 15, 2013

TCP modifications for Congestion Exposure


Congestion Exposure (ConEx) is a mechanism by which senders inform the network about the congestion encountered by previous packets on the same flow. This document describes the necessary modifications to use ConEx with the Transmission Control Protocol (TCP).

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on January 16, 2014.

Copyright Notice

Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents ( in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

Table of Contents

1. Introduction

Congestion Exposure (ConEx) is a mechanism by which senders inform the network about the congestion encountered by previous packets on the same flow. ConEx concepts and use cases are further explained in [RFC6789]. The abstract ConEx mechanism is explained in [draft-ietf-conex-abstract-mech]. This document describes the necessary modifications to use ConEx with the Transmission Control Protocol (TCP).

ConEx is defined as a destination option for IPv6 [draft-ietf-conex-destopt]. The use of four bits have been defined, namely the X (ConEx-capable), the L (loss experienced), the E (ECN experienced) and C (credit) bit.

The ConEx signal is based on loss or Explicit Congestion Notification (ECN) marks [RFC3168] as a congestion indication. This congestion information is retrieved by the sender based on existing feedback mechanisms from the receiver to the sender in TCP.

This document describes mechanisms for both TCP with and without the Selective Acknowledgment (SACK) extension [RFC2018]. However, ConEx benefits from more accurate information about the number of packets dropped in the network. We therefore recommend using the SACK extension when using TCP with ConEx.

While loss-based congestion feedback should be minimized, ECN could actually provide more fine-grained feedback information. ConEx-based traffic measurement or management mechanism would benefit from this. Unfortunately the current ECN does not reflect multiple congestion markings which occur within the same Round-Trip Time (RTT). A more accurate feedback extension to ECN is defined in a separate document [draft-kuehlewind-tcpm-accurate-ecn], as this is also useful for other mechanisms.

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].

2. Sender-side Modifications

A ConEx sender MUST negotiate for both SACK and ECN or the more accurate ECN feedback in the TCP handshake if these TCP extension are available at the sender. Thus a ConEx SHOULD also implement SACK and ECN. Depending on the capability of the receiver, the following operation modes exist:

A ConEx sender MUST expose all congestion information to the network according to the congestion information received by ECN or based on loss information provided by the TCP feedback loop. A TCP sender SHOULD account congestion byte-wise (and not packet-wise). A sender MUST mark subsequent packets (after the congestion notification) with the respective ConEx bit in the IP header.

With SACK only the number of lost payload bytes is known, but not the number of packets carrying these bytes. With classic ECN only an indication is given that a marking occurred which is not giving an exact number of payload bytes nor packets. As network congestion is usually byte-congestion [draft-briscoe-tsvwg-byte-pkt-mark], the exact number of bytes should be taken into account if available to make the ConEx signal as exact as possible.

The congestion accounting based on different operation modes is described in the next section and the handling of the IPv6 bits itself in the subsequent section afterwards.

3. Accounting congestion

A ConEx sender, thats accounts congestion byte-wise based on the congestion information received by ECN or loss detection provided by TCP, will maintain two different counters. These counters hold the number outstanding bytes that need to be ConEx marked either with the E bit or the L bit.

The outstanding bytes accounted based on ECN feedback information are maintained in the congestion exposure gauge (CEG). The accounting of these bytes from the ECN feedback is explained in more detail next in Section 3.1.

The outstanding bytes for congestion indications based on loss are maintained in the loss exposure gauge (LEG) and the accounting is explained in subsequent to the CEG accounting in Section 3.2.

Furthermore, those counters will be reduced every time a ConEx capable packet with the E or L bit set is sent. This is explained from both counters in Section 4.1.

Usually all bytes of an IP packet must be accounted. Therefore the sender SHOULD take the headers into account, too. If equal sized packets, or at least equally distributed packet sizes can be assumed, the sender MAY only account the TCP payload bytes. In this case there should be about the same number of ConEx marked packets as the original packets that were causing the congestion. Thus both contain about the same number of header bytes. This case is assumed in the following sections.

Otherwise if this is not the case and a sender sends different sized packets (with unequally distributed packet sizes), the sender needs to memorize or estimate the number of ECN-marked or lost packets. A sender might be able to reconstruct the number of packets and thus the header bytes if the packet sizes of the last RTT are known. Otherwise if no additional information is available the worst case number of headers should be estimated in a conservative way based on a minimum packet size (of all packets sent in the last RTT). If the number of ConEx marked packets is smaller (or larger) than the estimated number of ECN-marked or lost packets, the additional header bytes should the added to (or can be subtracted from) the respective counter.

3.1. ECN

ECN [RFC3168] is an IP/TCP mechanism that allows network nodes to mark packets with the Congestion Experienced (CE) mark instead of (early) dropping them when congestion occurs. As soon as a CE mark is seen at the receiver, with classic ECN it will feed this information back to the sender by setting the Echo Congestion Experienced (ECE) bit in the TCP header of all subsequent ACKs until a packet with Congestion Window Reduced (CWR) bit in the TCP header is received to acknowledge the reception of the congestion notification. The sender sets the CWR bit in the TCP header once when the first ECE of a congestion notification is received.

A receiver can support 'classic' ECN, a more accurate ECN feedback scheme, or neither. In the case ECN is not supported at all, of course, no ECN marks will occur, thus the E bit will never be set. Otherwise, a ConEx sender must maintain a counter, the congestion exposure gauge (CEG), for the number of outstanding bytes that have to be ConEx marked with the E bit.

The CEG is increased when ECN information is received from an ECN-capable receiver supporting the 'classic' ECN scheme or the accurate ECN feedback scheme. When the ConEx sender receives an ACK indicating one or more segments were received with a CE mark, CEG is increased by the appropriate number of bytes.

Unfortunately in case of duplicate acknowledgements the number of newly acknowledged bytes will be zero even though (CE marked) data has been received. Therefore, we increase the CEG by DeliveredData, as defined below:

DeliveredData covers the number of bytes which has been newly delivered to the receiver. Therefore on each arrival of an ACK, DeliveredData will be calculated by the newly acknowledged bytes (acked_bytes) as indicated by the current ACK, relative to all past ACKs. Moreover with SACK, DeliveredData is increased by the number of bytes provided by (new) SACK information (SACK_diff). Note, if less unacknowledged bytes are announced in the new SACK information than in the previous ACK, SACK_diff can be negative. In this case, data is newly acknowledged (in acked_byte), that has previously already been accounted to DeliveredData based on SACK information. Without SACK, DeliveredData is estimated to be 1 SMSS on duplicate acknowledgements. For the subsequent partial or full ACK, DeliveredData is estimated to be the the newly acknowledged bytes, minus one SMSS for each preceding duplicate ACK.

DeliveredData = acked_bytes + SACK_diff + (is_dup)*1SMSS - (is_after_dup)*num_dup*1SMSS

Thus is_dup is one if the current ACK is a duplicated ACK without SACK, and zero otherwise. is_after_dup is only one for the next full or partial ACK after a number of duplicated ACKs without SACK and num_dup counts the number of duplicated ACKs in a row.

The two cases, with and without more accurate ECN depending on the receiver capability, are discussed in the following sections.

3.1.1. Accurate ECN feedback

With a more accurate ECN feedback scheme either the number of marked packets/received CE marks or directly the number of marked bytes is known. In the later case the CEG can directly be increased by the number of marked bytes. Otherwise if D is assumed to be the number of marks, the gauge CEG has to be increased by the amount of bytes sent which were marked:

CEG += min(SMSS*D, DeliveredData)

3.1.2. Classic ECN support

A ConEx sender that communicates with a classic ECN receiver (conforming to [RFC3168] or [RFC5562]) MAY run in one of these modes:

3.2. Loss Detection with/without SACK

For all the data segments that are determined by a ConEx sender as lost, (at least) the same number of TCP payload bytes MUST be be sent with the ConEx L bit set. Loss detection typically happens by use of duplicate ACKs, or the firing of the retransmission timer. A ConEx sender MUST maintain a loss exposure gauge (LEG), indicating the number of outstanding bytes that must be sent with the ConEx L bit. When a data segment is retransmitted, LEG will be increased by the size of the TCP payload bytes containing the retransmission, assuming equal sized segments such that the retransmitted packet will have the same number of header as the original ones. When sending subsequent segments, the ConEx L bit is set as long as LEG is positive, and LEG is decreased by the size of the sent TCP payload bytes with the ConEx L bit set.

Any retransmission may be spurious. To accommodate that, a ConEx sender SHOULD make use of heuristics to detect such spurious retransmissions (e.g. F-RTO [RFC5682], DSACK [RFC3708], and Eifel [RFC3522], [RFC4015]). When such a heuristic has determined, that a certain number of packets were retransmitted erroneously, the ConEx sender should subtract the payload size of these TCP packets from LEG.

Note that the above heuristics delays the ConEx signal by one segment, and also decouples them from the retransmissions themselves, as some control packets (e.g. pure ACKs, window probes, or window updates) may be sent in between data segment retransmissions. A simpler approach would be to set the ConEx signal for each retransmitted data segment. However, it is important to remember, that a ConEx signal and TCP segments do not natively belong together.

If SACK is not available or SACK information has been reset for any reason, spurious retransmission are more likely. In this case it might be valuable to slightly delay the ConEx loss feedback until a spurious retransmission might be detected. But the ConEx signal MUST NOT be delayed more than one RTT.

4. Setting the ConEx Bits

ConEx is defined as a destination option for IPv6 [draft-ietf-conex-destopt]. The use of four bits have been defined, namely the X (ConEx-capable), the L (loss experienced), the E (ECN experienced) and C (credit) bit.

By setting the X bit a packet is marked as ConEx-capable. All packets carrying payload MUST be marked with the X bit set including retransmissions. No congestion feedback information are available about control packets as pure ACKs which are not carrying any payload. Thus these packet should not be taken into account when determining ConEx information. These packet MUST carry a ConEx Destination Option with the X bit unset.

4.1. Setting the E and the L Bit

As long as the CEG or LEG counter is positive, ConEx-capable packets SHOULD be marked with E or L respectively, and the CEG or LEG counter is decreased by the TCP payload bytes carried in this packet. If the CEG or LEG counter is negative, the respective counter SHOULD be reset to zero within one RTT after it was decreased the last time or one RTT after recovery if no further congestion occurred.

4.2. Credit Bits

The ConEx abstract mechanism requires that the transport SHOULD signal sufficient credit in advance to cover any reasonably expected congestion during its feedback delay. To be very conservative the number of credits would need to equal the number of packets in flight, as every packet could get lost or congestion marked. With a more moderate view, only an increase in the sending rate should cause loss while the number of ECN markings within one RTT depends on parameterization of the used Active Queue management (AQM). The average or maximum number of ECN marks per congestion event could potentially be estimated over time. This case is not further expanded here.

In TCP Slow Start the sending rate will increase exponentially and that means double every RTT. Thus the number of credits should equal at least half the number of packets in flight in every RTT. If the used AQM is not overly aggressive with ECN marking, maintaining the number of credit as half the number of packets in flight should be sufficient for both, congestion signaled by loss or ECN. Under the assumption that all ConEx marks will not get invalid for the whole Slow Start phase, marks of a previous RTT have to be added up. Thus the marking of every fourth packet will allow sufficient credits in Slow Start as it can be seen in Figure Figure 1.

RTT1  |------XC------>|
      |------X------->|   credit=1  in_flight=3
      |               |
RTT2  |------X------->|
      |------XC------>|   credit=3  in_flight=6
      |               |
RTT3  |------X------->|
      |------XC------>|   credit=6  in_flight=12
      |      .        |
      |      :        |

Figure 1: Credits in Slow Start (with an initial window of 3)

Moreover, a ConEx sender should maintain a counter of the sent credits c. In Congestion Avoidance phase, the sender should to monitor the number of packets in flight f. If f every gets larger than c, the ConEx sender should send new credits.

The audit might loose state due to e.g. rerouting or memory limitation. Therefore, the sender needs to detect this case and resend credits. Thus a ConEx sender should reset the credit count c if losses occur in two subsequent RTTs (assuming that the sending rate was correctly reduced based on the received congestion signal).

4.3. Loss of ConEx information

The audit can have wrong information if e.g. ConEx got lost on the channel (or a wrong number of ConEx marking has been estimated by the sender due to a lack of feedback information). In this case the audit might penalize a sender wrongly. The ConEx sender should detect this case and send further credits which should solve the situation (see Section 4.2).

5. Timeliness of the ConEx Signals

ConEx signals will anyway be evaluated with a slight time delay of about one RTT by a network node. Therefore, it is not absolutely necessary to immediately signal ConEx bits when they become known (e.g. L and E bits), but a sender SHOULD sent the ConEx signaling with the next available packet. In cases where it is preferable to slightly delay the ConEx signal, the sender MUST NOT delay the ConEx signal more than one RTT.

Multiple ConEx bits may become available for signaling at the same time, for example when an ACK is received by the sender, that indicates that at least one segment has been lost, and that one or more ECN marks were received at the same time. This may happen during excessive congestion, where buffer queues overflow and some packets are marked, while others have to be dropped nevertheless. Another possibility when this may happen are lost ACKs, so that a subsequent ACK carries summary information not previously available to the sender. As ConEx-capable packet can carry different ConEx marks at the same time, these information do not need to be distributed over several packets and thus can be sent without further delay.

6. Acknowledgements

The authors would like to thank Bob Briscoe who contributed with this initial ideas and valuable feedback. Moreover, thanks to Jana Iyengar who provided valuable feedback.

7. IANA Considerations

This document does not have any requests to IANA.

8. Security Considerations

With some of the advanced ECN compatibility modes it is possible to miss congestion notifications. Thus a sender will not decrease its sending rate. If the congestion is persistent, the likelihood to receive a congestion notification increases. In the worst case the sender will still react correctly to loss. This will prevent a congestion collapse.

9. References

9.1. Normative References

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC3168] Ramakrishnan, K., Floyd, S. and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, September 2001.
[RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP Selective Acknowledgment Options", RFC 2018, October 1996.
[RFC5681] Allman, M., Paxson, V. and E. Blanton, "TCP Congestion Control", RFC 5681, September 2009.
[draft-ietf-conex-destopt] Krishnan, S., Kuehlewind, M. and C. Ucendo, "IPv6 Destination Option for ConEx", Internet-Draft draft-ietf-conex-destopt-04, March 2013.
[draft-ietf-conex-abstract-mech] Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) Concepts and Abstract Mechanism", Internet-Draft draft-ietf-conex-abstract-mech-06, October 2012.

9.2. Informative References

[RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S. and K. Ramakrishnan, "Adding Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, June 2009.
[RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm for TCP", RFC 3522, April 2003.
[RFC3708] Blanton, E. and M. Allman, "Using TCP Duplicate Selective Acknowledgement (DSACKs) and Stream Control Transmission Protocol (SCTP) Duplicate Transmission Sequence Numbers (TSNs) to Detect Spurious Retransmissions", RFC 3708, February 2004.
[RFC4015] Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm for TCP", RFC 4015, February 2005.
[RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K. and M. Hata, "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting Spurious Retransmission Timeouts with TCP", RFC 5682, September 2009.
[RFC6789] Briscoe, B., Woundy, R. and A. Cooper, "Congestion Exposure (ConEx) Concepts and Use Cases", RFC 6789, December 2012.
[I-D.briscoe-tsvwg-re-ecn-tcp] Briscoe, B., Jacquet, A., Moncaster, T. and A. Smith, "Re-ECN: Adding Accountability for Causing Congestion to TCP/IP", Internet-Draft draft-briscoe-tsvwg-re-ecn-tcp-09, October 2010.
[draft-kuehlewind-tcpm-accurate-ecn] Kuehlewind, M. and R. Scheffenegger, "More Accurate ECN Feedback in TCP", Internet-Draft draft-kuehlewind-tcpm-accurate-ecn-02, Jun 2013.
[draft-briscoe-tsvwg-byte-pkt-mark] Briscoe, B. and J. Manner, "Byte and Packet Congestion Notification", Internet-Draft draft-briscoe-tsvwg-byte-pkt-mark-010, May 2013.
[DCTCP] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S. and M. Sridharan, "DCTCP: Efficient Packet Transport for the Commoditized Data Center", Jan 2010.

Appendix A. Revision history

RFC Editior: This section is to be removed before RFC publication.

00 ... initial draft, early submission to meet deadline.

01 ... refined draft, updated LEG "drain" from per-packet to RTT-based.

02 ... added Section 4.3 and expanded discussion about ECN interaction.

03 ... expanded the discussion around credit bits.

04 ... review comments of Jana addressed. (Change in full compliance mode.)

Authors' Addresses

Mirja Kuehlewind (editor) University of Stuttgart Pfaffenwaldring 47 Stuttgart, 70569 Germany EMail:
Richard Scheffenegger NetApp, Inc. Am Euro Platz 2 Vienna, 1120 Austria Phone: +43 1 3676811 3146 EMail: