More Accurate ECN Feedback in
TCPIndependentUKietf@bobbriscoe.nethttp://bobbriscoe.net/EricssonGermanyietf@kuehlewind.netNetAppViennaAustriaRichard.Scheffenegger@netapp.com
Transport
TCP Maintenance & Minor Extensions (tcpm)Congestion Control and ManagementCongestion NotificationFeedbackReliableOrderedProtocolECNExplicit Congestion Notification (ECN) is a mechanism where network
nodes can mark IP packets instead of dropping them to indicate incipient
congestion to the end-points. Receivers with an ECN-capable transport
protocol feed back this information to the sender. ECN is specified for
TCP in such a way that only one feedback signal can be transmitted per
Round-Trip Time (RTT). Recent new TCP mechanisms like Congestion
Exposure (ConEx), Data Center TCP (DCTCP) or Low Latency Low Loss
Scalable Throughput (L4S) need more accurate ECN feedback information
whenever more than one marking is received in one RTT. This document
specifies a scheme to provide more than one feedback signal per RTT in
the TCP header. Given TCP header space is scarce, it allocates a
reserved header bit, that was previously used for the ECN-Nonce which
has now been declared historic. It also overloads the two existing ECN
flags in the TCP header. The resulting extra space is exploited to feed
back the IP-ECN field received during the 3-way handshake as well.
Supplementary feedback information can optionally be provided in a new
TCP option, which is never used on the TCP SYN.Explicit Congestion Notification (ECN) is a
mechanism where network nodes can mark IP packets instead of dropping
them to indicate incipient congestion to the end-points. Receivers with
an ECN-capable transport protocol feed back this information to the
sender. In RFC 3168, ECN was specified for TCP in such a way that only
one feedback signal could be transmitted per Round-Trip Time (RTT).
Recently, proposed mechanisms like Congestion Exposure (ConEx ), DCTCP or L4S need to know when more than one
marking is received in one RTT which is information that cannot be
provided by the feedback scheme as specified in . This document specifies an update to the ECN
feedback scheme of RFC 3168 that provides more accurate information and
could be used by these and potentially other future TCP extensions. A
fuller treatment of the motivation for this specification is given in
the associated requirements document .This documents specifies a standards track scheme for ECN feedback in
the TCP header to provide more than one feedback signal per RTT. It will
be called the more accurate ECN feedback scheme, or AccECN for short.
This document updates RFC 3168 with respect to negotiation and use of
the feedback scheme for TCP. All aspects of RFC 3168 other than the TCP
feedback scheme, in particular the definition of ECN at the IP layer,
remain unchanged by this specification. gives a more detailed specification of
exactly which aspects of RFC 3168 this document updates.AccECN is intended to be a complete replacement for classic TCP/ECN
feedback, not a fork in the design of TCP. AccECN feedback complements
TCP's loss feedback and it can coexist alongside 'classic' TCP/ECN feedback. So its applicability is intended to
include all public and private IP networks (and even any non-IP networks
over which TCP is used today), whether or not any nodes on the path
support ECN, of whatever flavour. This document uses the term Classic
ECN when it needs to distinguish the RFC 3168 ECN TCP feedback scheme
from the AccECN TCP feedback scheme.AccECN feedback overloads the two existing ECN flags in the TCP
header and allocates the currently reserved flag (previously called NS)
in the TCP header, to be used as one three-bit counter field indicating
the number of congestion experienced marked packets. Given the new
definitions of these three bits, both ends have to support the new wire
protocol before it can be used. Therefore during the TCP handshake the
two ends use these three bits in the TCP header to negotiate the most
advanced feedback protocol that they can both support, in a way that is
backward compatible with .AccECN is solely a change to the TCP wire protocol; it covers the
negotiation and signaling of more accurate ECN feedback from a TCP Data
Receiver to a Data Sender. It is completely independent of how TCP might
respond to congestion feedback, which is out of scope, but ultimately
the motivation for accurate ECN feedback. Like Classic ECN feedback,
AccECN can be used by standard Reno congestion control to respond to the existence of at least one
congestion notification within a round trip. Or, unlike Reno, AccECN can
be used to respond to the extent of congestion notification over a round
trip, as for example DCTCP does in controlled environments . For congestion response, this specification refers
to RFC 3168, or ECN experiments such as those referred to in , namely: a TCP-based Low Latency Low Loss Scalable
(L4S) congestion control ; or
Alternative Backoff with ECN (ABE) .It is recommended that the AccECN protocol is implemented alongside
SACK and the experimental ECN++ protocol , which allows the ECN
capability to be used on TCP control packets. Therefore, this
specification does not discuss implementing AccECN alongside , which was an earlier experimental protocol with
narrower scope than ECN++.The following introductory section outlines the goals of AccECN
(). Then terminology is defined () and a recap of existing prerequisite
technology is given (). gives an informative overview of
the AccECN protocol. Then gives the
normative protocol specification, and clarifies which aspects of RFC 3168 are
updated by this specification. assesses the interaction of AccECN
with commonly used variants of TCP, whether standardized or not. summarizes the features and properties of
AccECN. summarizes the protocol
fields and numbers that IANA will need to assign and points to the aspects of the
protocol that will be of interest to the security community. gives pseudocode examples for
the various algorithms that AccECN uses and explains why AccECN uses flags in
the main TCP header and quantifies the space left for future use. enumerates requirements that a candidate
feedback scheme will need to satisfy, under the headings: resilience,
timeliness, integrity, accuracy (including ordering and lack of bias),
complexity, overhead and compatibility (both backward and forward). It
recognizes that a perfect scheme that fully satisfies all the
requirements is unlikely and trade-offs between requirements are
likely. presents the properties of
AccECN against these requirements and discusses the trade-offs
made.The requirements document recognizes that a protocol as ubiquitous
as TCP needs to be able to serve as-yet-unspecified requirements.
Therefore an AccECN receiver aims to act as a generic (dumb) reflector
of congestion information so that in future new sender behaviours can
be deployed unilaterally.The more accurate ECN feedback scheme will
be called AccECN for short.the ECN protocol specified in .the feedback aspect of the ECN
protocol specified in , including
generation, encoding, transmission and decoding of feedback, but
not the Data Sender's subsequent response to that feedback.A TCP acknowledgement, with or without a data
payload (ACK=1).A TCP acknowledgement without a data
payload.A packet or segment
that passes the acceptability tests in
and .The TCP stack that originates a
connection.The TCP stack that responds to a
connection request.The endpoint of a TCP half-connection
that receives data and sends AccECN feedback.The endpoint of a TCP half-connection
that sends data and receives AccECN feedback.The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in BCP 14 when, and only when, they
appear in all capitals, as shown here.ECN uses two bits in the IP header. Once
ECN has been negotiated with the receiver at the transport layer, an
ECN sender can set two possible codepoints (ECT(0) or ECT(1)) in the
IP header to indicate an ECN-capable transport (ECT). If both ECN bits are
zero, the packet is considered to have been sent by a Not-ECN-capable
Transport (Not-ECT). When a network node experiences congestion, it
will occasionally either drop or mark a packet, with the choice
depending on the packet's ECN codepoint. If the codepoint is Not-ECT,
only drop is appropriate. If the codepoint is ECT(0) or ECT(1), the
node can mark the packet by setting both ECN bits, which is termed
'Congestion Experienced' (CE), or loosely a 'congestion mark'. summarises these codepoints.IP-ECN codepointCodepoint nameDescription0b00Not-ECTNot ECN-Capable Transport0b01ECT(1)ECN-Capable Transport (1)0b10ECT(0)ECN-Capable Transport (0)0b11CECongestion ExperiencedIn the TCP header the first two bits in byte 14 are defined as
flags for the use of ECN (CWR and ECE in ). A TCP client
indicates it supports ECN by setting ECE=CWR=1 in the SYN, and an
ECN-enabled server confirms ECN support by setting ECE=1 and CWR=0 in
the SYN/ACK. On reception of a CE-marked packet at the IP layer, the
Data Receiver starts to set the Echo Congestion Experienced (ECE) flag
continuously in the TCP header of ACKs, which ensures the signal is
received reliably even if ACKs are lost. The TCP sender confirms that
it has received at least one ECE signal by responding with the
congestion window reduced (CWR) flag, which allows the TCP receiver to
stop repeating the ECN-Echo flag. This always leads to a full RTT of
ACKs with ECE set. Thus any additional CE markings arriving within
this RTT cannot be fed back.The last bit in byte 13 of the TCP header was defined as the Nonce
Sum (NS) for the ECN Nonce . In the absence of
widespread deployment RFC 3540 has been reclassified as historic and the respective flag has been marked as
"reserved", making this TCP flag available for use by the AccECN
experiment instead.This section provides an informative overview of the AccECN protocol
that will be normatively specified in Like the original TCP approach, the Data Receiver of each TCP
half-connection sends AccECN feedback to the Data Sender on TCP
acknowledgements, reusing data packets of the other half-connection
whenever possible.The AccECN protocol has had to be designed in two parts:an essential part that re-uses ECN TCP header bits to feed back
the number of arriving CE marked packets. This provides more
accuracy than classic ECN feedback, but limited resilience against
ACK loss;a supplementary part using a new AccECN TCP Option that provides
additional feedback on the number of bytes that arrive marked with
each of the three ECN codepoints (not just CE marks). This provides
greater resilience against ACK loss than the essential feedback, but
it is more likely to suffer from middlebox interference. The two part design was necessary, given limitations on the
space available for TCP options and given the possibility that certain
incorrectly designed middleboxes prevent TCP using any new options.The essential part overloads the previous definition of the three
flags in the TCP header that had been assigned for use by ECN. This
design choice deliberately replaces the classic ECN feedback protocol,
rather than leaving classic ECN feedback intact and adding more accurate
feedback separately because:this efficiently reuses scarce TCP header space, given TCP option
space is approaching saturation;a single upgrade path for the TCP protocol is preferable to a
fork in the design;otherwise classic and accurate ECN feedback could give
conflicting feedback on the same segment, which could open up new
security concerns and make implementations unnecessarily
complex;middleboxes are more likely to faithfully forward the TCP ECN
flags than newly defined areas of the TCP header.AccECN is designed to work even if the supplementary part is removed
or zeroed out, as long as the essential part gets through.AccECN is a change to the wire protocol of the main TCP header,
therefore it can only be used if both endpoints have been upgraded to
understand it. The TCP client signals support for AccECN on the
initial SYN of a connection and the TCP server signals whether it
supports AccECN on the SYN/ACK. The TCP flags on the SYN that the
client uses to signal AccECN support have been carefully chosen so
that a TCP server will interpret them as a request to support the most
recent variant of ECN feedback that it supports. Then the client falls
back to the same variant of ECN feedback.An AccECN TCP client does not send the new AccECN Option on the SYN
as SYN option space is limited. The TCP server sends the AccECN Option
on the SYN/ACK and the client sends it on the first ACK to test
whether the network path forwards the option correctly.A Data Receiver maintains four counters initialized at the start of
the half-connection. Three count the number of arriving payload bytes
marked CE, ECT(1) and ECT(0) respectively. The fourth counts the
number of packets arriving marked with a CE codepoint (including
control packets without payload if they are CE-marked).The Data Sender maintains four equivalent counters for the half
connection, and the AccECN protocol is designed to ensure they will
match the values in the Data Receiver's counters, albeit after a
little delay.Each ACK carries the three least significant bits (LSBs) of the
packet-based CE counter using the ECN bits in the TCP header, now
renamed the Accurate ECN (ACE) field (see later). The 24 LSBs of each byte counter
are carried in the AccECN Option.With both the ACE and the AccECN Option mechanisms, the Data
Receiver continually repeats the current LSBs of each of its
respective counters. There is no need to acknowledge these continually
repeated counters, so the congestion window reduced (CWR) mechanism is
no longer used. Even if some ACKs are lost, the Data Sender should be
able to infer how much to increment its own counters, even if the
protocol field has wrapped.The 3-bit ACE field can wrap fairly frequently. Therefore, even if
it appears to have incremented by one (say), the field might have
actually cycled completely then incremented by one. The Data Receiver
is not allowed to delay sending an ACK to such an extent that the ACE
field would cycle. However cycling is still a possibility at the Data
Sender because a whole sequence of ACKs carrying intervening values of
the field might all be lost or delayed in transit.The fields in the AccECN Option are larger, but they will increment
in larger steps because they count bytes not packets. Nonetheless,
their size has been chosen such that a whole cycle of the field would
never occur between ACKs unless there had been an infeasibly long
sequence of ACK losses. Therefore, as long as the AccECN Option is
available, it can be treated as a dependable feedback channel.If the AccECN Option is not available, e.g. it is being stripped by
a middlebox, the AccECN protocol will only feed back information on CE
markings (using the ACE field). Although not ideal, this will be
sufficient, because it is envisaged that neither ECT(0) nor ECT(1)
will ever indicate more severe congestion than CE, even though future
uses for ECT(0) or ECT(1) are still unclear .
Because the 3-bit ACE field is so small, when it is the only field
available the Data Sender has to interpret it assuming the most likely
wrap, but with a degree of conservatism.Certain specified events trigger the Data Receiver to include an
AccECN Option on an ACK. The rules are designed to ensure that the
order in which different markings arrive at the receiver is
communicated to the sender (as long as options are reaching the sender
and as long as there is no ACK loss). Implementations are encouraged
to send an AccECN Option more frequently, but this is left up to the
implementer.The CE packet counter in the ACE field and the CE byte counter in
the AccECN Option both provide feedback on received CE-marks. The CE
packet counter includes control packets that do not have payload data,
while the CE byte counter solely includes marked payload bytes. If
both are present, the byte counter in the option will provide the more
accurate information needed for modern congestion control and policing
schemes, such as L4S, DCTCP or ConEx. If the option is stripped, a
simple algorithm to estimate the number of marked bytes from the ACE
field is given in .Feedback in bytes is recommended in order to protect against the
receiver using attacks similar to 'ACK-Division' to artificially
inflate the congestion window, which is why
now recommends that TCP counts acknowledged bytes not packets.The ACE field provides information about CE markings on both data
and control packets. According to the Data
Sender is meant to set control packets to Not-ECT. However, mechanisms
in certain private networks (e.g. data centres) set control packets to
be ECN capable because they are precisely the packets that performance
depends on most.For this reason, AccECN is designed to be a generic reflector of
whatever ECN markings it sees, whether or not they are compliant with
a current standard. Then as standards evolve, Data Senders can upgrade
unilaterally without any need for receivers to upgrade too. It is also
useful to be able to rely on generic reflection behaviour when senders
need to test for unexpected interference with markings (for instance
, and of the present document and para 2
of Section 20.2 of ).The initial SYN is the most critical control packet, so AccECN
provides feedback on its ECN marking. Although RFC 3168 prohibits an
ECN-capable SYN, providing feedback of ECN marking on the SYN supports
future scenarios in which SYNs might be ECN-enabled (without
prejudging whether they ought to be). For instance, updates this aspect of RFC 3168 to allow
experimentation with ECN-capable TCP control packets.Even if the TCP client (or server) has set the SYN (or SYN/ACK) to
not-ECT in compliance with RFC 3168, feedback on the state of the ECN
field when it arrives at the receiver could still be useful, because
middleboxes have been known to overwrite the ECN IP field as if it is
still part of the old Type of Service (ToS) field . If a TCP client has set the SYN to Not-ECT,
but receives feedback that the ECN field on the SYN arrived with a
different codepoint, it can detect such middlebox interference and
send Not-ECT for the rest of the connection. Today, if a TCP server
receives ECT or CE on a SYN, it cannot know whether it is invalid (or
valid) because only the TCP client knows whether it originally marked
the SYN as Not-ECT (or ECT). Therefore, prior to AccECN, the server's
only safe course of action was to disable ECN for the connection.
Instead, the AccECN protocol allows the server to feed back the
received ECN field to the client, which then has all the information
to decide whether the connection has to fall-back from supporting ECN
(or not).Given the ECN Nonce has been
reclassified as historic , the present
specification re-allocates the TCP flag at bit 7 of the TCP header,
which was previously called NS (Nonce Sum), as the AE (Accurate ECN)
flag (see IANA Considerations in ) as shown below.During the TCP handshake at the start of a connection, to request
more accurate ECN feedback the TCP client (host A) MUST set the TCP
flags AE=1, CWR=1 and ECE=1 in the initial SYN segment.If a TCP server (B) that is AccECN-enabled receives a SYN with
the above three flags set, it MUST set both its half connections
into AccECN mode. Then it MUST set the TCP flags on the SYN/ACK to
one of the 4 values shown in the top block of to confirm that it supports
AccECN. The TCP server MUST NOT set one of these 4 combination of
flags on the SYN/ACK unless the preceding SYN requested support for
AccECN as above.A TCP server in AccECN mode MUST set the AE, CWR and ECE TCP
flags on the SYN/ACK to the value in that feeds back the IP-ECN field
that arrived on the SYN. This applies whether or not the server
itself supports setting the IP-ECN field on a SYN or SYN/ACK (see
for rationale).Once a TCP client (A) has sent the above SYN to declare that it
supports AccECN, and once it has received the above SYN/ACK segment
that confirms that the TCP server supports AccECN, the TCP client
MUST set both its half connections into AccECN mode.Once in AccECN mode, a TCP client or server has the rights and
obligations to participate in the ECN protocol defined in .The procedure for the client to follow if a SYN/ACK does not
arrive before its retransmission timer expires is given in .The three flags set to 1 to indicate AccECN support on the SYN
have been carefully chosen to enable natural fall-back to prior
stages in the evolution of ECN, as above. tabulates all the negotiation
possibilities for ECN-related capabilities that involve at least one
AccECN-capable host. The entries in the first two columns have been
abbreviated, as follows: More Accurate ECN Feedback (the present
specification)ECN Nonce feedback 'Classic' ECN feedback Not-ECN-capable. Implicit congestion
notification using packet drop.ABSYN A->BSYN/ACK B->AFeedback ModeAE CWR ECEAE CWR ECEAccECNAccECN1 1 10 1 0AccECN (no ECT on SYN)AccECNAccECN1 1 10 1 1AccECN (ECT1 on SYN)AccECNAccECN1 1 11 0 0AccECN (ECT0 on SYN)AccECNAccECN1 1 11 1 0AccECN (CE on SYN)AccECNNonce1 1 11 0 1(Reserved)AccECNECN1 1 10 0 1classic ECNAccECNNo ECN1 1 10 0 0Not ECNNonceAccECN0 1 10 0 1classic ECNECNAccECN0 1 10 0 1classic ECNNo ECNAccECN0 0 00 0 0Not ECNAccECNBroken1 1 11 1 1Not ECN is divided into blocks
each separated by an empty row.The top block shows the case already described in where both endpoints support
AccECN and how the TCP server (B) indicates congestion
feedback.The second block shows the cases where the TCP client (A)
supports AccECN but the TCP server (B) supports some earlier
variant of TCP feedback, indicated in its SYN/ACK. Therefore, as
soon as an AccECN-capable TCP client (A) receives the SYN/ACK
shown it MUST set both its half connections into the feedback
mode shown in the rightmost column. If it has set itself into
classic ECN feedback mode it MUST then comply with .The server response
called 'Nonce' in the table is now historic. For an AccECN
implementation, there is no need to recognize or support ECN
Nonce feedback , which has been
reclassified as historic . AccECN is
compatible with alternative ECN feedback integrity approaches
(see ).The third block shows the cases where the TCP server (B)
supports AccECN but the TCP client (A) supports some earlier
variant of TCP feedback, indicated in its SYN.When an AccECN-enabled TCP server (B) receives a
SYN with AE,CWR,ECE = 0,1,1 it MUST do one of the
following:set both its half connections into the classic ECN
feedback mode and return a SYN/ACK with AE, CWR, ECE = 0,0,1
as shown. Then it MUST comply with .set both its half-connections into No ECN mode and return
a SYN/ACK with AE,CWR,ECE = 0,0,0, then continue with ECN
disabled. This latter case is unlikely to be desirable, but
it is allowed as a possibility, e.g. for minimal TCP
implementations.When an AccECN-enabled TCP server (B) receives a SYN
with AE,CWR,ECE = 0,0,0 it MUST set both its half connections
into the Not ECN feedback mode, return a SYN/ACK with AE,CWR,ECE
= 0,0,0 as shown and continue with ECN disabled.The fourth block displays a combination labelled `Broken'.
Some older TCP server implementations incorrectly set the
reserved flags in the SYN/ACK by reflecting those in the SYN.
Such broken TCP servers (B) cannot support ECN, so as soon as an
AccECN-capable TCP client (A) receives such a broken SYN/ACK it
MUST fall back to Not ECN mode for both its half connections and
continue with ECN disabled.The following additional rules do not fit the structure of the
table, but they complement it:An originating AccECN Host (A),
having sent a SYN with AE=1, CWR=1 and ECE=1, might receive
another SYN from host B. Host A MUST then enter the same
feedback mode as it would have entered had it been a responding
host and received the same SYN. Then host A MUST send the same
SYN/ACK as it would have sent had it been a responding host.Many TCP
implementations create a new TCP connection if they receive an
in-window SYN packet during TIME-WAIT state. When a TCP host
enters TIME-WAIT or CLOSED state, it should ignore any previous
state about the negotiation of AccECN for that connection and
renegotiate the feedback mode according to .If a TCP server that implements AccECN receives a SYN with the
three TCP header flags (AE, CWR and ECE) set to any combination
other than 000, 011 or 111, it MUST negotiate the use of AccECN as
if they had been set to 111. This ensures that future uses of the
other combinations on a SYN can rely on consistent behaviour from
the installed base of AccECN servers.For the avoidance of doubt, the behaviour described in the
present specification applies whether or not the three remaining
reserved TCP header flags are zero.If the sender of an AccECN SYN times out before receiving the
SYN/ACK, the sender SHOULD attempt to negotiate the use of AccECN at
least one more time by continuing to set all three TCP ECN flags on
the first retransmitted SYN (using the usual retransmission
time-outs). If this first retransmission also fails to be
acknowledged, the sender SHOULD send subsequent retransmissions of
the SYN with the three TCP-ECN flags cleared (AE=CWR=ECE=0). A
retransmitted SYN MUST use the same ISN as the original SYN.Retrying once before fall-back adds delay in the case where a
middlebox drops an AccECN (or ECN) SYN deliberately. However,
current measurements imply that a drop is less likely to be due to
middlebox interference than other intermittent causes of loss, e.g.
congestion, wireless interference, etc.Implementers MAY use other fall-back strategies if they are found
to be more effective (e.g. attempting to negotiate AccECN on the SYN
only once or more than twice (most appropriate during high levels of
congestion). However, other fall-back strategies will need to follow
all the rules in ,
which concern behaviour when SYNs or SYN/ACKs negotiating different
types of feedback have been sent within the same connection.Further it may make sense to also remove any other new or
experimental fields or options on the SYN in case a middlebox might
be blocking them, although the required behaviour will depend on the
specification of the other option(s) and any attempt to co-ordinate
fall-back between different modules of the stack.Whichever fall-back strategy is used, the TCP initiator SHOULD
cache failed connection attempts. If it does, it SHOULD NOT give up
attempting to negotiate AccECN on the SYN of subsequent connection
attempts until it is clear that the blockage is persistently and
specifically due to AccECN. The cache should be arranged to expire
so that the initiator will infrequently attempt to check whether the
problem has been resolved.The fall-back procedure if the TCP server receives no ACK to
acknowledge a SYN/ACK that tried to negotiate AccECN is specified in
. describes the only ways
that a host can enter AccECN mode, whether as a client or as a
server.As a Data Sender, a host in AccECN mode has the rights and
obligations concerning the use of ECN defined below, which build on
those in as updated by :Using ECT:It can set an ECT codepoint in the IP header of packets
to indicate to the network that the transport is capable and
willing to participate in ECN for this packet.It does not have to set ECT on any packet (for instance
if it has reason to believe such a packet would be
blocked).Switching feedback negotiation (e.g. fall-back):It SHOULD NOT set ECT on any packet if it has received at
least one valid SYN or Acceptable SYN/ACK with AE=CWR=ECE=0.
A "valid SYN" has the same port numbers and the same ISN as
the SYN that caused the server to enter AccECN mode.It MUST NOT send an ECN-setup SYN within the same connection as it has sent
a SYN requesting AccECN feedback.It MUST NOT send an ECN-setup SYN/ACK within the same connection as it has sent
a SYN/ACK agreeing to use AccECN feedback.The above rules are necessary because, when one peer
negotiates the feedback mode in two different types of
handshake, it is not possible for the other peer to know for
certain which handshake packet(s) the other end eventually
receives or in which order it receives them. So the two peers
can end up using difference feedback modes without knowing
it.Congestion response:It is still obliged to respond appropriately to AccECN
feedback with congestion indications on packets it had
previously sent, as defined in Section 6.1 of and updated by Sections 2.1 and 4.1 of
.The commitment to respond appropriately to incoming
indications of congestion remains even if it sends a SYN
packet with AE=CWR=ECE=0, in a later transmission within the
same TCP connection.Unlike an RFC 3168 data sender, it MUST NOT set CWR to
indicate it has received and responded to indications of
congestion (for the avoidance of doubt, this does not
preclude it from setting the bits of the ACE counter field,
which includes an overloaded use of the same bit).As a Data Receiver:a host in AccECN mode MUST feed back the information in the
IP-ECN field on incoming packets using Accurate ECN feedback, as
specified in below.if it receives an ECN-setup SYN or ECN-setup SYN/ACK during the same connection as it receives a
SYN requesting AccECN feedback or a SYN/ACK agreeing to use
AccECN feedback, it MUST reset the connection with a RST
packet.If for any reason it is not willing to provide ECN feedback
on a particular TCP connection, to indicate this unwillingness
it SHOULD clear the AE, CWR and ECE flags in all SYN and/or
SYN/ACK packets that it sends.it MUST NOT use reception of packets with ECT set in the
IP-ECN field as an implicit signal that the peer is ECN-capable.
Reason: ECT at the IP layer does not explicitly confirm the peer
has the correct ECN feedback logic, and the packets could have
been mangled at the IP layer.Each Data Receiver of each half connection maintains four counters,
r.cep, r.ceb, r.e0b and r.e1b:The Data Receiver MUST increment the CE packet counter (r.cep),
for every Acceptable packet that it receives with the CE code
point in the IP ECN field, including CE marked control packets but
excluding CE on SYN packets (SYN=1; ACK=0).The Data Receiver MUST increment the r.ceb, r.e0b or r.e1b byte
counters by the number of TCP payload octets in Acceptable packets
marked respectively with the CE, ECT(0) and ECT(1) codepoint in
their IP-ECN field, including any payload octets on control
packets, but not including any payload octets on SYN packets
(SYN=1; ACK=0).Each Data Sender of each half connection maintains four counters,
s.cep, s.ceb, s.e0b and s.e1b intended to track the equivalent
counters at the Data Receiver.A Data Receiver feeds back the CE packet counter using the Accurate
ECN (ACE) field, as explained in . And it
feeds back all the byte counters using the AccECN TCP Option, as
specified in .Whenever a host feeds back the value of any counter, it MUST report
the most recent value, no matter whether it is in a pure ACK, an ACK
with new payload data or a retransmission. Therefore the feedback
carried on a retransmitted packet is unlikely to be the same as the
feedback on the original packet.When a host first enters AccECN mode, in its role as a Data
Receiver it initializes its counters to r.cep = 5, r.e0b = 1 and
r.ceb = r.e1b.= 0, Non-zero initial values are used to support a stateless handshake
(see ) and to be
distinct from cases where the fields are incorrectly zeroed (e.g. by
middleboxes - see ).When a host enters AccECN mode, in its role as a Data Sender it
initializes its counters to s.cep = 5, s.e0b = 1 and s.ceb = s.e1b.=
0. After AccECN has been negotiated on the SYN and SYN/ACK, both
hosts overload the three TCP flags (AE, CWR and ECE) in the main TCP
header as one 3-bit field. Then the field is given a new name, ACE,
as shown in .The original definition of these three flags in the TCP header,
including the addition of support for the ECN Nonce, is shown for
comparison in . This specification
does not rename these three TCP flags to ACE unconditionally; it
merely overloads them with another name and definition once an
AccECN connection has been established.With one exception (), a host
with both of its half-connections in AccECN mode MUST interpret the
AE, CWR and ECE flags as the 3-bit ACE counter on a segment with the
SYN flag cleared (SYN=0). On such a packet, a Data Receiver MUST
encode the three least significant bits of its r.cep counter into
the ACE field that it feeds back to the Data Sender. A host MUST NOT
interpret the 3 flags as a 3-bit ACE field on any segment with SYN=1
(whether ACK is 0 or 1), or if AccECN negotiation is incomplete or
has not succeeded.Both parts of each of these conditions are equally important. For
instance, even if AccECN negotiation has been successful, the ACE
field is not defined on any segments with SYN=1 (e.g. a
retransmission of an unacknowledged SYN/ACK, or when both ends send
SYN/ACKs after AccECN support has been successfully negotiated
during a simultaneous open).A TCP client (A) in AccECN mode MUST feed back which of the 4
possible values of the IP-ECN field was on the SYN/ACK by writing
it into the ACE field of a pure ACK with no SACK blocks using the
binary encoding in (which
is the same as that used on the SYN/ACK in ). This shall be called the
handshake encoding of the ACE field, and it is the only exception
to the rule that the ACE field carries the 3 least significant
bits of the r.cep counter on packets with SYN=0.Normally, a TCP client acknowledges a SYN/ACK with an ACK that
satisfies the above conditions anyway (SYN=0, no data, no SACK
blocks). If an AccECN TCP client intends to acknowledge the
SYN/ACK with a packet that does not satisfy these conditions (e.g.
it has data to include on the ACK), it SHOULD first send a pure
ACK that does satisfy these conditions (see ), so that it can feed back
which of the four values of the IP-ECN field arrived on the
SYN/ACK. A valid exception to this "SHOULD" would be where the
implementation will only be used in an environment where mangling
of the ECN field is unlikely.IP-ECN codepoint on SYN/ACKACE on pure ACK of SYN/ACKr.cep of client in AccECN modeNot-ECT0b0105ECT(1)0b0115ECT(0)0b1005CE0b1106When an AccECN server in SYN-RCVD state receives a pure ACK
with SYN=0 and no SACK blocks, instead of treating the ACE field
as a counter, it MUST infer the meaning of each possible value of
the ACE field from , which
also shows the value that an AccECN server MUST set s.cep to as a
result.Given this encoding of the ACE field on the ACK of a SYN/ACK is
exceptional, an AccECN server using large receive offload (LRO)
might prefer to disable LRO until such an ACK has transitioned it
out of SYN-RCVD state.ACE on ACK of SYN/ACKIP-ECN codepoint on SYN/ACK inferred by servers.cep of server in AccECN mode0b000{Notes 1, 3}Disable ECN0b001{Notes 2, 3}50b010Not-ECT50b011ECT(1)50b100ECT(0)50b101Currently Unused {Note 2}50b110CE60b111Currently Unused {Note 2}5{Note 1}: If the server is in AccECN mode, the value of zero
raises suspicion of zeroing of the ACE field on the path (see
).{Note 2}: If the server is in AccECN mode, these values are
Currently Unused but the AccECN server's behaviour is still
defined for forward compatibility. Then the designer of a future
protocol can know for certain what AccECN servers will do with
these codepoints.{Note 3}: In the case where a server that implements AccECN is
also using a stateless handshake (termed a SYN cookie) it will not
remember whether it entered AccECN mode. The values 0b000 or 0b001
will remind it that it did not enter AccECN mode, because AccECN
does not use them (see for details). If a
stateless server that implements AccECN receives either of these
two values in the ACK, its action is implementation-dependent and
outside the scope of this spec, It will certainly not take the
action in the third column because, after it receives either of
these values, it is not in AccECN mode. I.e., it will not disable
ECN (at least not just because ACE is 0b000) and it will not set
s.cep.Whenever the Data Receiver sends an ACK with SYN=0 (with or
without data), unless the handshake encoding in applies, the Data Receiver MUST
encode the least significant 3 bits of its r.cep counter into the
ACE field (see ).Whenever the Data Sender receives an ACK with SYN=0 (with or
without data), it first checks whether it has already been
superseded by another ACK in which case it ignores the ECN
feedback. If the ACK has not been superseded, and if the special
handshake encoding in does not
apply, the Data Sender decodes the ACE field as follows (see for examples).It takes the least significant 3 bits of its local s.cep
counter and subtracts them from the incoming ACE counter to
work out the minimum positive increment it could apply to
s.cep (assuming the ACE field only wrapped at most once).It then follows the safety procedures in to calculate or estimate how
many packets the ACK could have acknowledged under the
prevailing conditions to determine whether the ACE field might
have wrapped more than once.The encode/decode procedures during the three-way handshake are
exceptions to the general rules given so far, so they are spelled
out step by step below for clarity:If a TCP server in AccECN mode receives a CE mark in the
IP-ECN field of a SYN (SYN=1, ACK=0), it MUST NOT increment
r.cep (it remains at its initial value of 5). Reason: It would be redundant for the server
to include CE-marked SYNs in its r.cep counter, because it
already reliably delivers feedback of any CE marking on the
SYN/ACK using the encoding in . This also ensures that,
when the server starts using the ACE field, it has not
unnecessarily consumed more than one initial value, given they
can be used to negotiate variants of the AccECN protocol (see
).If a TCP client in AccECN mode receives CE feedback in the
TCP flags of a SYN/ACK, it MUST NOT increment s.cep (it
remains at its initial value of 5), so that it stays in step
with r.cep on the server. Nonetheless, the TCP client still
triggers the congestion control actions necessary to respond
to the CE feedback.If a TCP client in AccECN mode receives a CE mark in the
IP-ECN field of a SYN/ACK, it MUST increment r.cep, but no
more than once no matter how many CE-marked SYN/ACKs it
receives (i.e. incremented from 5 to 6, but no further).
Reason: Incrementing r.cep ensures the
client will eventually deliver any CE marking to the server
reliably when it starts using the ACE field. Even though the
client also feeds back any CE marking on the ACK of the
SYN/ACK using the encoding in , this ACK is not delivered
reliably, so it can be considered as a timely notification
that is redundant but unreliable. The client does not
increment r.cep more than once, because the server can only
increment s.cep once (see next bullet). Also, this limits the
unnecessarily consumed initial values of the ACE field to
two.If a TCP server in AccECN mode and in SYN-RCVD state
receives CE feedback in the TCP flags of a pure ACK with no
SACK blocks, it MUST increment s.cep (from 5 to 6). The TCP
server then triggers the congestion control actions necessary
to respond to the CE feedback.Reasoning: The TCP server can only increment
s.cep once, because the first ACK it receives will cause it to
transition out of SYN-RCVD state. The server's congestion
response would be no different even if it could receive
feedback of more than one CE-marked SYN/ACK.Once the TCP server transitions to ESTABLISHED
state, it might later receive other pure ACK(s) with the
handshake encoding in the ACE field. The conditions for this
to occur are quite unusual, but not impossible, e.g. a SYN/ACK
(or ACK of the SYN/ACK) that is delayed for longer than the
server's retransmission timeout; or packet duplication by the
network. Nonetheless, once in the ESTABLISHED state, the
server will consider the ACE field to be encoded as the normal
ACE counter on all packets with SYN=0 (given it will be
following the above rule in this bullet). The server MAY
include a test to avoid this case. required the Data Receiver to
initialize the r.cep counter to a non-zero value. Therefore, in
either direction the initial value of the ACE counter ought to be
non-zero.If AccECN has been successfully negotiated, the Data Sender
SHOULD check the value of the ACE counter in the first packet
(with or without data) that arrives with SYN=0. If the value of
this ACE field is zero (0b000), the Data Sender disables sending
ECN-capable packets for the remainder of the half-connection by
setting the IP/ECN field in all subsequent packets to Not-ECT.
Usually, the server checks the ACK of the SYN/ACK from the
client, while the client checks the first data segment from the
server. However, if reordering occurs, "the first packet ... that
arrives" will not necessarily be the same as the first packet in
sequence order. The test has been specified loosely like this to
simplify implementation, and because it would not have been any
more precise to have specified the first packet in sequence order,
which would not necessarily be the first ACE counter that the Data
Receiver fed back anyway, given it might have been a
retransmission.The possibility of re-ordering means that there is a small
chance that the ACE field on the first packet to arrive is
genuinely zero (without middlebox interference). This would cause
a host to unnecessarily disable ECN for a half connection.
Therefore, in environments where there is no evidence of the ACE
field being zeroed, implementations can skip this test.Note that the Data Sender MUST NOT test whether the arriving
counter in the initial ACE field has been initialized to a
specific valid value - the above check solely tests whether the
ACE fields have been incorrectly zeroed. This allows hosts to use
different initial values as an additional signalling channel in
future.The value of the ACE field on the SYN/ACK indicates the value
of the IP/ECN field when the SYN arrived at the server. The client
can compare this with how it originally set the IP/ECN field on
the SYN. If this comparison implies an unsafe transition (see
below) of the IP/ECN field, for the remainder of the connection
the client MUST NOT send ECN-capable packets, but it MUST continue
to feed back any ECN markings on arriving packets.The value of the ACE field on the last ACK of the 3WHS
indicates the value of the IP/ECN field when the SYN/ACK arrived
at the client. The server can compare this with how it originally
set the IP/ECN field on the SYN/ACK. If this comparison implies an
unsafe transition of the IP/ECN field, for the remainder of the
connection the server MUST NOT send ECN-capable packets, but it
MUST continue to feed back any ECN markings on arriving
packets.The ACK of the SYN/ACK is not reliably delivered (nonetheless,
the count of CE marks is still eventually delivered reliably). If
this ACK does not arrive, the server can continue to send
ECN-capable packets without having tested for mangling of the
IP/ECN field on the SYN/ACK.Invalid transitions of the IP/ECN field are defined in and repeated here for convenience:the not-ECT codepoint changes;either ECT codepoint transitions to not-ECT;the CE codepoint changes.RFC 3168 says that a router that changes ECT to not-ECT is
invalid but safe. However, from a host's viewpoint, this
transition is unsafe because it could be the result of two
transitions at different routers on the path: ECT to CE (safe)
then CE to not-ECT (unsafe). This scenario could well happen where
an ECN-enabled home router congests its upstream mobile broadband
bottleneck link, then the ingress to the mobile network clears the
ECN field .Once a Data Sender has entered AccECN mode it SHOULD check
whether all feedback received for the first three or four round
indicated that every packet it sent was CE-marked. If so, for the
remainder of the connection, the Data Sender SHOULD NOT send
ECN-capable packets, but it MUST continue to feed back any ECN
markings on arriving packets.The above fall-back behaviours are necessary in case mangling
of the IP/ECN field is asymmetric, which is currently common over
some mobile networks . Then one end
might see no unsafe transition and continue sending ECN-capable
packets, while the other end sees an unsafe transition and stops
sending ECN-capable packets.If too many CE-marked segments are acknowledged at once, or if
a long run of ACKs is lost or thinned out, the 3-bit counter in
the ACE field might have cycled between two ACKs arriving at the
Data Sender. The following safety procedures minimize this
ambiguity.An AccECN Data Receiver:SHOULD immediately send an ACK whenever a data packet
marked CE arrives after the previous data packet was not
CE.MUST immediately send an ACK once 'n' CE marks have
arrived since the previous ACK, where 'n' SHOULD be 2 and
MUST be no greater than 6.These rules for when to send an ACK are designed to be
complemented by those in ,
which concern whether the AccECN TCP Option ought to be included
on ACKs.For the avoidance of doubt, the change-triggered ACK
mechanism is deliberately worded to solely apply to data
packets, and to ignore the arrival of a control packet with no
payload, because it is important that TCP does not acknowledge
pure ACKs. The change-triggered ACK approach can lead to some
additional ACKs but it feeds back the timing and the order in
which ECN marks are received with minimal additional complexity.
If only CE marks are infrequent, or there are multiple marks in
a row, the additional load will be low. Other marking patterns
could increase the load significantly.Even though the first bullet is stated as a "SHOULD", it is
important for a transition to immediately trigger an ACK if at
all possible, so that the Data Sender can rely on
change-triggered ACKs to detect queue growth as soon as
possible, e.g. at the start of a flow. This requirement can only
be relaxed if certain offload hardware needed for high
performance cannot support change-triggered ACKs (although high
performance protocols such as DCTCP already successfully use
change-triggered ACKs). One possible compromise would be for the
receiver to heuristically detect whether the sender is in
slow-start, then to implement change-triggered ACKs while the
sender is in slow-start, and offload otherwise.If the Data Sender has not received AccECN TCP Options to
give it more dependable information, and it detects that the ACE
field could have cycled, it SHOULD deem whether it cycled by
taking the safest likely case under the prevailing conditions.
It can detect if the counter could have cycled by using the jump
in the acknowledgement number since the last ACK to calculate or
estimate how many segments could have been acknowledged. An
example algorithm to implement this policy is given in . An implementer MAY develop an
alternative algorithm as long as it satisfies these
requirements.If missing acknowledgement numbers arrive later (reordering)
and prove that the counter did not cycle, the Data Sender MAY
attempt to neutralize the effect of any action it took based on
a conservative assumption that it later found to be
incorrect.The Data Sender can estimate how many packets (of any
marking) an ACK acknowledges. If the ACE counter on an ACK seems
to imply that the minimum number of newly CE-marked packets is
greater that the number of newly acknowledged packets, the Data
Sender SHOULD believe the ACE counter, unless it can be sure
that it is counting all control packets correctly.The AccECN Option is defined as shown in . The initial 'E' of each field name
stands for 'Echo'. shows two option field orders;
order 0 and order 1. They both consists of three 24-bit fields.
Order 0 provides the 24 least significant bits of the r.e0b, r.ceb
and r.e1b counters, respectively. Order 1 provides the same fields,
but in the opposite order. On each packet, the Data Receiver can use
whichever order is more efficient. When a Data Receiver sends an AccECN Option, it MUST set the Kind
field to TBD0 if using Order 0, or to TBD1 if using Order 1. These
two new TCP Option Kinds are registered in and called respectively
AccECN0 and AccECN1.Note that there is no field to feed back Not-ECT bytes.
Nonetheless an algorithm for the Data Sender to calculate the number
of payload bytes received as Not-ECT is given in .Whenever a Data Receiver sends an AccECN Option, the rules in
expect it to usually send a
full-length option. To cope with option space limitations, it can
omit unchanged fields from the tail of the option, as long as it
preserves the order of the remaining fields and includes any field
that has changed. The length field MUST indicate which fields are
present as follows:LengthType 0Type 111EE0B, ECEB, EE1BEE1B, ECEB, EE0B8EE0B, ECEBEE1B, ECEB5EE0BEE1B2(empty)(empty)The empty option of Length=2 is provided to allow for a case
where an AccECN Option has to be sent (e.g. on the SYN/ACK to test
the path), but there is very limited space for the option.All implementations of a Data Sender that read any AccECN Option
MUST be able to read in AccECN Options of any of the above lengths.
For forward compatibility, if the AccECN Option is of any other
length, implementations MUST use those whole 3-octet fields that fit
within the length and ignore the remainder of the option.The AccECN Option has to be optional to implement, because both
sender and receiver have to be able to cope without the option
anyway - in cases where it does not traverse a network path. It is
RECOMMENDED to implement both sending and receiving of the AccECN
Option. If sending of the AccECN Option is implemented, the
fall-backs described in this document will need to be implemented as
well (unless solely for a controlled environment where path
traversal is not considered a problem). Even if a developer does not
implement sending of the AccECN Option, it is RECOMMENDED that they
still implement logic to receive and understand any AccECN Options
sent by remote peers.If a Data Receiver intends to send the AccECN Option at any time
during the rest of the connection it is strongly recommended to also
test path traversal of the AccECN Option as specified in .Whenever the Data Receiver includes any of the counter fields
(ECEB, EE0B, EE1B) in an AccECN Option, it MUST encode the 24
least significant bits of the current value of the associated
counter into the field (respectively r.ceb, r.e0b, r.e1b).Whenever the Data Sender receives ACK carrying an AccECN
Option, it first checks whether the ACK has already been
superseded by another ACK in which case it ignores the ECN
feedback. If the ACK has not been superseded, the Data Sender MUST
decode the fields in the AccECN Option as follows. For each field,
it takes the least significant 24 bits of its associated local
counter (s.ceb, s.e0b or s.e1b) and subtracts them from the
counter in the associated field of the incoming AccECN Option
(respectively ECEB, EE0B, EE1B), to work out the minimum positive
increment it could apply to s.ceb, s.e0b or s.e1b (assuming the
field in the option only wrapped at most once). gives an example
algorithm for the Data Receiver to encode its byte counters into
the AccECN Option, and for the Data Sender to decode the AccECN
Option fields into its byte counters.Note that, as specified in ,
any data on the SYN (SYN=1, ACK=0) is not included in any of the
locally held octet counters nor in the AccECN Option on the
wire.The TCP client MUST NOT include the AccECN TCP Option on the
SYN. (A fall-back strategy for the loss of the SYN (possibly due
to middlebox interference) is specified in .)A TCP server that confirms its support for AccECN (in
response to an AccECN SYN from the client as described in ) SHOULD include an AccECN TCP
Option on the SYN/ACK.A TCP client that has successfully negotiated AccECN SHOULD
include an AccECN Option in the first ACK at the end of the
3WHS. However, this first ACK is not delivered reliably, so the
TCP client SHOULD also include an AccECN Option on the first
data segment it sends (if it ever sends one).A host MAY NOT include an AccECN Option in any of these three
cases if it has cached knowledge that the packet would be likely
to be blocked on the path to the other host if it included an
AccECN Option.If
after the normal TCP timeout the TCP server has not received an
ACK to acknowledge its SYN/ACK, the SYN/ACK might just have been
lost, e.g. due to congestion, or a middlebox might be blocking
the AccECN Option. To expedite connection setup, the TCP server
SHOULD retransmit the SYN/ACK repeating the same AE, CWR and ECE
TCP flags as on the original SYN/ACK but with no AccECN Option.
If this retransmission times out, to expedite connection setup,
the TCP server SHOULD disable AccECN and ECN for this connection
by retransmitting the SYN/ACK with AE=CWR=ECE=0 and no AccECN
Option.Implementers MAY use other fall-back strategies if they are
found to be more effective (e.g. retrying the AccECN Option for
a second time before fall-back - most appropriate during high
levels of congestion). However, other fall-back strategies will
need to follow all the rules in , which concern
behaviour when SYNs or SYN/ACKs negotiating different types of
feedback have been sent within the same connection.If the TCP client detects that the first data segment it sent
with the AccECN Option was lost, it SHOULD fall back to no
AccECN Option on the retransmission. Again, implementers MAY use
other fall-back strategies such as attempting to retransmit a
second segment with the AccECN Option before fall-back, and/or
caching whether the AccECN Option is blocked for subsequent
connections. further
discusses caching of TCP parameters and status information.If a host falls back to not sending the AccECN Option, it
will continue to process any incoming AccECN Options as
normal.Either host MAY include the AccECN Option in a subsequent
segment to retest whether the AccECN Option can traverse the
path.If the TCP server receives a second SYN with a request for
AccECN support, it should resend the SYN/ACK, again confirming
its support for AccECN, but this time without the AccECN Option.
This approach rules out any interference by middleboxes that may
drop packets with unknown options, even though it is more likely
that the SYN/ACK would have been lost due to congestion. The TCP
server MAY try to send another packet with the AccECN Option at
a later point during the connection but should monitor if that
packet got lost as well, in which case it SHOULD disable the
sending of the AccECN Option for this half-connection.Similarly, an AccECN end-point MAY separately memorize which
data packets carried an AccECN Option and disable the sending of
AccECN Options if the loss probability of those packets is
significantly higher than that of all other data packets in the
same connection.If the TCP client has successfully negotiated AccECN but does
not receive an AccECN Option on the SYN/ACK (e.g. because is has
been stripped by a middlebox or not sent by the server), the
client switches into a mode that assumes that the AccECN Option
is not available for this half connection.Similarly, if the TCP server has successfully negotiated
AccECN but does not receive an AccECN Option on the first
segment that acknowledges sequence space at least covering the
ISN, it switches into a mode that assumes that the AccECN Option
is not available for this half connection.While a host is in this mode that assumes incoming AccECN
Options are not available, it MUST adopt the conservative
interpretation of the ACE field discussed in . However, it cannot make any
assumption about support of outgoing AccECN Options on the other
half connection, so it SHOULD continue to send the AccECN Option
itself (unless it has established that sending the AccECN Option
is causing packets to be blocked as in ).If a host is in the mode that assumes incoming AccECN Options
are not available, but it receives an AccECN Option at any later
point during the connection, this clearly indicates that the
AccECN Option is not blocked on the respective path, and the
AccECN endpoint MAY switch out of the mode that assumes the
AccECN Option is not available for this half connection.For a related test for invalid initialization of the ACE
field, see required the Data Receiver
to initialize the r.e0b counter to a non-zero value. Therefore,
in either direction the initial value of the EE0B field in the
AccECN Option (if one exists) ought to be non-zero. If AccECN
has been negotiated:the TCP server MAY check the initial value of the EE0B
field in the first segment that acknowledges sequence space
that at least covers the ISN plus 1. If the initial value of
the EE0B field is zero, the server will switch into a mode
that ignores the AccECN Option for this half connection.the TCP client MAY check the initial value of the EE0B
field on the SYN/ACK. If the initial value of the EE0B field
is zero, the client will switch into a mode that ignores the
AccECN Option for this half connection.While a host is in the mode that ignores the AccECN Option it
MUST adopt the conservative interpretation of the ACE field
discussed in .Note that the Data Sender MUST NOT test whether the arriving
byte counters in the initial AccECN Option have been initialized
to specific valid values - the above checks solely test whether
these fields have been incorrectly zeroed. This allows hosts to
use different initial values as an additional signalling channel
in future. Also note that the initial value of either field
might be greater than its expected initial value, because the
counters might already have been incremented. Nonetheless, the
initial values of the counters have been chosen so that they
cannot wrap to zero on these initial segments.When the AccECN Option is available it supplements but does
not replace the ACE field. An endpoint using AccECN feedback
MUST always consider the information provided in the ACE field
whether or not the AccECN Option is also available.If the AccECN option is present, the s.cep counter might
increase while the s.ceb counter does not (e.g. due to a
CE-marked control packet). The sender's response to such a
situation is out of scope, and needs to be dealt with in a
specification that uses ECN-capable control packets.
Theoretically, this situation could also occur if a middlebox
mangled the AccECN Option but not the ACE field. However, the
Data Sender has to assume that the integrity of the AccECN
Option is sound, based on the above test of the well-known
initial values and optionally other integrity tests ().If either end-point detects that the s.ceb counter has
increased but the s.cep has not (and by testing ACK coverage it
is certain how much the ACE field has wrapped), this invalid
protocol transition has to be due to some form of feedback
mangling. So, the Data Sender MUST disable sending ECN-capable
packets for the remainder of the half-connection by setting the
IP/ECN field in all subsequent packets to Not-ECT.If the Data Receiver intends to use the AccECN TCP Option to
provide feedback, the following rules determine when a Data
Receiver in AccECN mode sends an ACK with the AccECN TCP Option,
and which fields to include:If an arriving packet
increments a different byte counter to that incremented by the
previous packet, the Data Receiver SHOULD immediately send an
ACK with an AccECN Option, without waiting for the next
delayed ACK (this is in addition to the safety recommendation
in against ambiguity of the
ACE field). Even though this bullet is
stated as a "SHOULD", it is important for a transition to
immediately trigger an ACK if at all possible, as already
argued when specifying change-triggered ACKs for the ACE.Otherwise, if arriving
packets continue to increment the same byte counter, the Data
Receiver can include an AccECN Option on most or all (delayed)
ACKs, but it does not have to.It SHOULD include a counter that has continued to
increment on the next scheduled ACK following a
change-triggered ACK;while the same counter continues to increment, it
SHOULD include the counter every n ACKs as consistently as
possible, where n can be chosen by the implementer;It SHOULD always include an AccECN Option if the r.ceb
counter is incrementing and it MAY include an AccECN
Option if r.ec0b or r.ec1b is incrementingIt SHOULD, include each counter at least once for every
2^22 bytes incremented to prevent overflow during
continual repetition.If the smallest allowed AccECN Option would leave
insufficient space for two SACK blocks on a particular ACK,
the Data Receiver MUST give precedence to the SACK option
(total 18 octets), because loss feedback is more critical.It MAY exclude
counter(s) that have not changed for the whole connection (but
beacons still include all fields - see below). It SHOULD
include counter(s) that have incremented at some time during
the connection. It MUST include the counter(s) that have
incremented since the previous AccECN Option and it MUST only
truncate fields from the right-hand tail of the option to
preserve the order of the remaining fields (see );Nonetheless, it
MUST include a full-length AccECN TCP Option on at least three
ACKs per RTT, or on all ACKs if there are less than three per
RTT (see for an example
algorithm that satisfies this requirement).The above rules complement those in , which determine when to generate an
ACK irrespective of whether an AccECN TCP Option is to be
included.The following example series of arriving IP/ECN fields
illustrates when a Data Receiver will emit an ACK with an AccECN
Option if it is using a delayed ACK factor of 2 segments and
change-triggered ACKs: 01 -> ACK, 01, 01 -> ACK, 10 ->
ACK, 10, 01 -> ACK, 01, 11 -> ACK, 01 -> ACK.Even though first bullet is stated as a "SHOULD", it is
important for a transition to immediately trigger an ACK if at all
possible, so that the Data Sender can rely on change-triggered
ACKs to detect queue growth as soon as possible, e.g. at the start
of a flow. This requirement can only be relaxed if certain offload
hardware needed for high performance cannot support
change-triggered ACKs (although high performance protocols such as
DCTCP already successfully use change-triggered ACKs). One
possible experimental compromise would be for the receiver to
heuristically detect whether the sender is in slow-start, then to
implement change-triggered ACKs while the sender is in slow-start,
and offload otherwise.For the avoidance of doubt, this change-triggered ACK mechanism
is deliberately worded to ignore the arrival of a control packet
with no payload, which therefore does not alter any byte counters,
because it is important that TCP does not acknowledge pure ACKs.
The change-triggered ACK approach can lead to some additional ACKs
but it feeds back the timing and the order in which ECN marks are
received with minimal additional complexity. If only CE marks are
infrequent, or there are multiple marks in a row, the additional
load will be low. Other marking patterns could increase the load
significantly, Investigating the additional load is a goal of the
proposed experiment.Implementation note: sending an AccECN Option each time a
different counter changes and including a full-length AccECN
Option on every delayed ACK will satisfy the requirements
described above and might be the easiest implementation, as long
as sufficient space is available in each ACK (in total and in the
option space). gives an example
algorithm to estimate the number of marked bytes from the ACE
field alone, if the AccECN Option is not available.If a host has determined that segments with the AccECN Option
always seem to be discarded somewhere along the path, it is no
longer obliged to follow the above rules.A large class of middleboxes split TCP connections. Such a
middlebox would be compliant with the AccECN protocol if the TCP
implementation on each side complied with the present AccECN
specification and each side negotiated AccECN independently of the
other side.Another large class of middleboxes intervenes to some degree at
the transport layer, but attempts to be transparent (invisible) to
the end-to-end connection. A subset of this class of middleboxes
attempts to `normalize' the TCP wire protocol by checking that all
values in header fields comply with a rather narrow and often
outdated interpretation of the TCP specifications. To comply with
the present AccECN specification, such a middlebox MUST NOT change
the ACE field or the AccECN Option.A middlebox claiming to be transparent at the transport layer
MUST forward the AccECN TCP Option unaltered, whether or not the
length value matches one of those specified in , and whether or not the initial values of
the byte-counter fields are correct. This is because blocking
apparently invalid values does not improve security (because AccECN
hosts are required to ignore invalid values anyway), while it
prevents the standardized set of values being extended in future
(because outdated normalizers would block updated hosts from using
the extended AccECN standard).A node that implements ACK filtering (aka. thinning or
coalescing) and itself also implements ECN marking will not need to
filter ACKs from connections that use AccECN feedback. Therefore,
such a node SHOULD detect connections that have negotiated the use
of AccECN feedback during the handshake (see ) and it SHOULD preserve the timing
of each ACK (if it coalesced ACKs it would not be AccECN-compliant,
but the requirement is stated as a "SHOULD" in order to allow leeway
for pre-existing ACK filtering functions to be brought into line).
A node that implements ACK filtering and does not itself
implement ECN marking does not need to treat AccECN connections any
differently from other TCP connections. Nonetheless, it is
RECOMMENDED that such nodes implement ECN marking and comply with
the requirements of the previoius paragraph. This should be a better
way than ACK filtering to improve the performance of AccECN TCP
connections.The rationale for these requirements is that AccECN feedback
provides sufficient information to a data receiver for it to be able
to monitor ECN marking of the ACKs it has sent, so that it can thin
the ACK stream itself. This will eventually mean that ACK filtering
in the network gives no performance advantage. Then TCP will be able
to maintain its own control over ACK coalescing. This will also
allow the TCP Data Sender to use the timing of ACK arrivals to more
reliably infer further information about the path congestion
level.Note that the specification of AccECN in TCP does not presume to
rely on the above ACK filtering behaviour in the network, because it
has to be robust against pre-existing network nodes that still
filter AccECN ACKs, and robust against ACK loss during overload.
Section 5.2.1 of gives best current
practice on ACK filtering (aka. thinning or coalescing). It
gives no advice on ACKs carrying ECN feedback, because at the time
is said that "ECN remain areas of ongoing research". This section
updates that advice for a TCP connection that supports AccECN
feedback.Hardware to offload certain TCP processing represents another
large class of middleboxes (even though it is often a function of a
host's network interface and rarely in its own 'box'). The ACE field changes with every received CE marking, so today's
receive offloading could lead to many interrupts in high congestion
situations. Although that would be useful (because congestion
information is received sooner), it could also significantly
increase processor load, particularly in scenarios such as DCTCP or
L4S where the marking rate is generally higher.Current offload hardware ejects a segment from the coalescing
process whenever the TCP ECN flags change. Thus Classic ECN causes
offload to be inefficient. In data centres it has been fortunate for
this offload hardware that DCTCP-style feedback changes less often
when there are long sequences of CE marks, which is more common with
a step marking threshold (but less likely the more short flows are
in the mix). The ACE counter approach has been designed so that
coalescing can continue over arbitrary patterns of marking and only
needs to stop when the counter wraps. Nonetheless, until the
particular offload hardware in use implements this more efficient
approach, it is likely to be more efficient for AccECN connections
to implement this counter-style logic using software segmentation
offload.ECN encodes a varying signal in the ACK stream, so it is
inevitable that offload hardware will ultimately need to handle any
form of ECN feedback exceptionally. The ACE field has been designed
as a counter so that it is straightforward for offload hardware to
pass on the highest counter, and to push a segment from its cache
before the counter wraps. The purpose of working towards
standardized TCP ECN feedback is to reduce the risk for hardware
developers, who would otherwise have to guess which scheme is likely
to become dominant. The above process has been designed to enable a continuing
incremental deployment path - to more highly dynamic congestion
control. Once DCTCP offload hardware supports AccECN, it will be
able to coalesce efficiently for any sequence of marks, instead of
relying for efficiency on the long marking sequences from step
marking. In the next stage, DCTCP marking can evolve from a step to
a ramp function. That in turn will allow host congestion control
algorithms to respond faster to dynamics, while being backwards
compatible with existing host algorithms.Normative statements in the following sections of RFC3168 are updated
by the present AccECN specification: The whole of "6.1.1 TCP Initialization" of is updated by
of the present specification.In "6.1.2. The TCP Sender" of , all
mentions of a congestion response to an ECN-Echo (ECE) ACK packet
are updated by of the present
specification to mean an increment to the sender's count of
CE-marked packets, s.cep. And the requirements to set the CWR flag
no longer apply, as specified in of the present
specification. Otherwise, the remaining requirements in "6.1.2. The
TCP Sender" still stand.It will be noted
that RFC 8311 already updates, or potentially updates, a number of
the requirements in "6.1.2. The TCP Sender". Section 6.1.2 of RFC
3168 extended standard TCP congestion control to cover ECN marking as well as packet drop.
Whereas, RFC 8311 enables experimentation with alternative responses
to ECN marking, if specified for instance by an experimental RFC on
the IETF document stream. RFC 8311 also strengthened the statement
that "ECT(0) SHOULD be used" to a "MUST" (see for the details).The whole of "6.1.3. The TCP Receiver" of is updated by of
the present specification, with the exception of the last paragraph
(about congestion response to drop and ECN in the same round trip),
which still stands. Incidentally, this last paragraph is in the
wrong section, because it relates to TCP sender behaviour.The following text within "6.1.5. Retransmitted TCP packets":
"the TCP data receiver SHOULD ignore the ECN field on
arriving data packets that are outside of the receiver's current
window." is updated by more stringent acceptability tests for any
packet (not just data packets) in the present specification.
Specifically, in the normative specification of AccECN () only 'Acceptable' packets contribute to the
ECN counters at the AccECN receiver and defines an Acceptable packet as one
that passes the acceptability tests in both
and .Sections 5.2, 6.1.1, 6.1.4, 6.1.5 and 6.1.6 of prohibit use of ECN on TCP control packets and
retransmissions. The present specification does not update that
aspect of RFC 3168, but it does say what feedback an AccECN Data
Receiver should provide if it receives an ECN-capable control packet
or retransmission. This ensures AccECN is forward compatible with
any future scheme that allows ECN on these packets, as provided for
in section 4.3 of and as proposed in .This section is informative, not normative.A TCP server can use SYN Cookies (see Appendix A of ) to protect itself from SYN flooding attacks. It
places minimal commonly used connection state in the SYN/ACK, and
deliberately does not hold any state while waiting for the subsequent
ACK (e.g. it closes the thread). Therefore it cannot record the fact
that it entered AccECN mode for both half-connections. Indeed, it
cannot even remember whether it negotiated the use of classic ECN
.Nonetheless, such a server can determine that it negotiated AccECN
as follows. If a TCP server using SYN Cookies supports AccECN and if
it receives a pure ACK that acknowledges an ISN that is a valid SYN
cookie, and if the ACK contains an ACE field with the value 0b010 to
0b111 (decimal 2 to 7), it can assume that:the TCP client must have requested AccECN support on the
SYNit (the server) must have confirmed that it supported
AccECNTherefore the server can switch itself into AccECN mode, and
continue as if it had never forgotten that it switched itself into
AccECN mode earlier.If the pure ACK that acknowledges a SYN cookie contains an ACE
field with the value 0b000 or 0b001, these values indicate that the
client did not request support for AccECN and therefore the server
does not enter AccECN mode for this connection. Further, 0b001 on the
ACK implies that the server sent an ECN-capable SYN/ACK, which was
marked CE in the network, and the non-AccECN client fed this back by
setting ECE on the ACK of the SYN/ACK.AccECN is compatible (at least on paper) with the most commonly
used TCP options: MSS, time-stamp, window scaling, SACK and TCP-AO. It
is also compatible with the recent promising experimental TCP options
TCP Fast Open (TFO ) and Multipath TCP (MPTCP
). AccECN is friendly to all these protocols,
because space for TCP options is particularly scarce on the SYN, where
AccECN consumes zero additional header space.When option space is under pressure from other options, provides guidance on how important it
is to send an AccECN Option and whether it needs to be a full-length
option.Implementers of TFO need to take careful note of the recommendation
in . That section recommends that,
if the client has successfully negotiated AccECN, when acknowledging
the SYN/ACK, even if it has data to send, it sends a pure ACK
immediately before the data. Then it can reflect the IP-ECN field of
the SYN/ACK on this pure ACK, which allows the server to detect ECN
mangling.Three alternative mechanisms are available to assure the integrity
of ECN and/or loss signals. AccECN is compatible with any of these
approaches:The Data Sender can test the integrity of the receiver's ECN
(or loss) feedback by occasionally setting the IP-ECN field to a
value normally only set by the network (and/or deliberately
leaving a sequence number gap). Then it can test whether the Data
Receiver's feedback faithfully reports what it expects (similar to
para 2 of Section 20.2 of ). Unlike the
ECN Nonce , this approach does not waste
the ECT(1) codepoint in the IP header, it does not require
standardization and it does not rely on misbehaving receivers
volunteering to reveal feedback information that allows them to be
detected. However, setting the CE mark by the sender might conceal
actual congestion feedback from the network and should therefore
only be done sparingly.Networks generate congestion signals when they are becoming
congested, so networks are more likely than Data Senders to be
concerned about the integrity of the receiver's feedback of these
signals. A network can enforce a congestion response to its ECN
markings (or packet losses) using congestion exposure (ConEx)
audit . Whether the receiver or a
downstream network is suppressing congestion feedback or the
sender is unresponsive to the feedback, or both, ConEx audit can
neutralize any advantage that any of these three parties would
otherwise gain. ConEx is a change to the
Data Sender that is most useful when combined with AccECN. Without
AccECN, the ConEx behaviour of a Data Sender would have to be more
conservative than would be necessary if it had the accurate
feedback of AccECN.The TCP authentication option (TCP-AO )
can be used to detect any tampering with AccECN feedback between
the Data Receiver and the Data Sender (whether malicious or
accidental). The AccECN fields are immutable end-to-end, so they
are amenable to TCP-AO protection, which covers TCP options by
default. However, TCP-AO is often too brittle to use on many
end-to-end paths, where middleboxes can make verification fail in
their attempts to improve performance or security, e.g. by
resegmentation or shifting the sequence space.Originally the ECN Nonce was
proposed to ensure integrity of congestion feedback. With minor
changes AccECN could be optimized for the possibility that the ECT(1)
codepoint might be used as an ECN Nonce. However, given RFC 3540 has
been reclassified as historic, the AccECN design has been generalized
so that it ought to be able to support other possible uses of the
ECT(1) codepoint, such as a lower severity or a more instant
congestion signal than CE.This section is informative not normative. It describes how well the
protocol satisfies the agreed requirements for a more accurate ECN
feedback protocol .From each ACK, the Data Sender can infer the
number of new CE marked segments since the previous ACK. This
provides better accuracy on CE feedback than classic ECN. In
addition if the AccECN Option is present (not blocked by the network
path) the number of bytes marked with CE, ECT(1) and ECT(0) are
provided.The AccECN scheme is divided into two parts.
The essential part reuses the 3 flags already assigned to ECN in the
IP header. The supplementary part adds an additional TCP option
consuming up to 11 bytes. However, no TCP option is consumed in the
SYN.The order in which marks arrive at the Data
Receiver is preserved in AccECN feedback, because the Data Receiver
is expected to send an ACK immediately whenever a different mark
arrives.While the same ECN markings are arriving
continually at the Data Receiver, it can defer ACKs as TCP does
normally, but it will immediately send an ACK as soon as a different
ECN marking arrives.Change-Triggered ACKs are
intended to enable latency-sensitive uses of ECN feedback by
capturing the timing of transitions but not wasting resources while
the state of the signalling system is stable. Within the constraints
of the change-triggered ACK rules, the receiver can control how
frequently it sends the AccECN TCP Option and therefore to some
extent it can control the overhead induced by AccECN.All information is provided based on
counters. Therefore if ACKs are lost, the counters on the first ACK
following the losses allows the Data Sender to immediately recover
the number of the ECN markings that it missed. And if data or ACKs
are reordered, stale congestion information can be identified and
ignored.Because feedback is based on
repetition of counters, random losses do not remove any information,
they only delay it. Therefore, even though some ACKs are
change-triggered, random losses will not alter the proportions of
the different ECN markings in the feedback.If space is limited in some
segments (e.g. because more options are needed on some segments,
such as the SACK option after loss), the Data Receiver can send
AccECN Options less frequently or truncate fields that have not
changed, usually down to as little as 5 bytes. However, it has to
send a full-sized AccECN Option at least three times per RTT, which
the Data Sender can rely on as a regular beacon or checkpoint.Ordering
information and the timing of transitions cannot be communicated in
three cases: i) during ACK loss; ii) if something on the path strips
the AccECN Option; or iii) if the Data Receiver is unable to support
Change-Triggered ACKs. Following ACK reordering, the Data Sender can
reconstruct the order in which feedback was sent, but not until all
the missing feedback has arrived.An AccECN implementation solely involves
simple counter increments, some modulo arithmetic to communicate the
least significant bits and allow for wrap, and some heuristics for
safety against fields cycling due to prolonged periods of ACK loss.
Each host needs to maintain eight additional counters. The hosts
have to apply some additional tests to detect tampering by
middleboxes, but in general the protocol is simple to understand,
simple to implement and requires few cycles per packet to
execute.AccECN is compatible with at least three
approaches that can assure the integrity of ECN feedback. If the
AccECN Option is stripped the resolution of the feedback is
degraded, but the integrity of this degraded feedback can still be
assured.If only one endpoint supports
the AccECN scheme, it will fall-back to the most advanced ECN
feedback scheme supported by the other end.If the AccECN Option is
stripped by a middlebox, AccECN still provides basic congestion
feedback in the ACE field. Further, AccECN can be used to detect
mangling of the IP ECN field; mangling of the TCP ECN flags;
blocking of ECT-marked segments; and blocking of segments carrying
the AccECN Option. It can detect these conditions during TCP's 3WHS
so that it can fall back to operation without ECN and/or operation
without the AccECN Option.The behaviour of endpoints and
middleboxes is carefully defined for all reserved or currently
unused codepoints in the scheme. Then, the designers of security
devices can understand which currently unused values might appear in
future. So, even if they choose to treat such values as anomalous
while they are not widely used, any blocking will at least be under
policy control not hard-coded. Then, if previously unused values
start to appear on the Internet (or in standards), such policies
could be quickly reversed.This document reassigns bit 7 of the TCP header flags to the AccECN
experiment. This bit was previously called the Nonce Sum (NS) flag , but RFC 3540 has been reclassified as historic . The flag will now be defined as:BitNameReference7AE (Accurate ECN)RFC XXXX[TO BE REMOVED: IANA is requested to update the existing entry in the
Transmission Control Protocol (TCP) Header Flags registration
(https://www.iana.org/assignments/tcp-header-flags/tcp-header-flags.xhtml#tcp-header-flags-1)
for Bit 7 to "AE (Accurate ECN), previously used as NS (Nonce Sum) by
[RFC3540], which is now Historic [RFC8311]" and change the reference to
this RFC-to-be instead of RFC8311.]This document also defines two new TCP options for AccECN, assigned
values of TBD0 and TBD1 (decimal) from the TCP option space. These
values are defined as:KindLengthMeaningReferenceTBD0NAccurate ECN Order 0 (AccECN0)RFC XXXXTBD1NAccurate ECN Order 1 (AccECN1)RFC XXXX[TO BE REMOVED: This registration should take place at the following
location:
http://www.iana.org/assignments/tcp-parameters/tcp-parameters.xhtml#tcp-parameters-1
]Early implementations using experimental option 254 per with the single magic number 0xACCE (16 bits), as
allocated in the IANA "TCP Experimental Option Experiment Identifiers
(TCP ExIDs)" registry, SHOULD migrate to use these new option kinds
(TBD0 & TBD1).[TO BE REMOVED: The description of the 0xACCE value in the TCP ExIDs
registry should be changed to "AccECN (current and new implementations
SHOULD use option kinds TBD0 and TBD1)" at the following location:
https://www.iana.org/assignments/tcp-parameters/tcp-parameters.xhtml#tcp-exids
]If ever the supplementary part of AccECN based on the new AccECN TCP
Option is unusable (due for example to middlebox interference) the
essential part of AccECN's congestion feedback offers only limited
resilience to long runs of ACK loss (see ). These problems are unlikely to be due to
malicious intervention (because if an attacker could strip a TCP option
or discard a long run of ACKs it could wreak other arbitrary havoc).
However, it would be of concern if AccECN's resilience could be
indirectly compromised during a flooding attack. AccECN is still
considered safe though, because if the option is not presented, the
AccECN Data Sender is then required to switch to more conservative
assumptions about wrap of congestion indication counters (see and ). describes how a TCP
server can negotiate AccECN and use the SYN cookie method for mitigating
SYN flooding attacks.There is concern that ECN markings could be altered or suppressed,
particularly because a misbehaving Data Receiver could increase its own
throughput at the expense of others. AccECN is compatible with the three
schemes known to assure the integrity of ECN feedback (see for details). If the AccECN Option is
stripped by an incorrectly implemented middlebox, the resolution of the
feedback will be degraded, but the integrity of this degraded
information can still be assured.There is a potential concern that a receiver could deliberately omit
the AccECN Option pretending that it had been stripped by a middlebox.
No known way can yet be contrived to take advantage of this downgrade
attack, but it is mentioned here in case someone else can contrive
one.The AccECN protocol is not believed to introduce any new privacy
concerns, because it merely counts and feeds back signals at the
transport layer that had already been visible at the IP layer.We want to thank Koen De Schepper, Praveen Balasubramanian, Michael
Welzl, Gorry Fairhurst, David Black, Spencer Dawkins, Michael Scharf,
Michael Tuexen, Yuchung Cheng, Kenjiro Cho, Olivier Tilmans, Ilpo
Järvinen and Neal Cardwell for their input and discussion. The idea
of using the three ECN-related TCP flags as one field for more accurate
TCP-ECN feedback was first introduced in the re-ECN protocol that was
the ancestor of ConEx.Bob Briscoe was part-funded by the Comcast Innovation Fund, the
European Community under its Seventh Framework Programme through the
Reducing Internet Transport Latency (RITE) project (ICT-317700) and
through the Trilogy 2 project (ICT-317756), and the Research Council of
Norway through the TimeIn project. The views expressed here are solely
those of the authors.Mirja Kuehlewind was partly supported by the European Commission
under Horizon 2020 grant agreement no. 688421 Measurement and
Architecture for a Middleboxed Internet (MAMI), and by the Swiss State
Secretariat for Education, Research, and Innovation under contract no.
15.0268. This support does not imply endorsement.Comments and questions are encouraged and very welcome. They can be
addressed to the IETF TCP maintenance and minor modifications working
group mailing list <tcpm@ietf.org>, and/or to the authors.Measuring ECN++: Good News for ++, Bad News for ECN over
MobileUC3MSimulaSimulaUC3MSimulaThis appendix is informative, not normative. It gives example
algorithms that would satisfy the normative requirements of the AccECN
protocol. However, implementers are free to choose other ways to
implement the requirements.The
example algorithms below show how a Data Receiver in AccECN mode could
encode its CE byte counter r.ceb into the ECEB field within the AccECN
TCP Option, and how a Data Sender in AccECN mode could decode the ECEB
field into its byte counter s.ceb. The other counters for bytes marked
ECT(0) and ECT(1) in the AccECN Option would be similarly encoded and
decoded.It is assumed that each local byte counter is an unsigned integer
greater than 24b (probably 32b), and that the following constant has
been assigned:DIVOPT = 2^24Every time a CE marked data segment arrives, the Data Receiver
increments its local value of r.ceb by the size of the TCP Data.
Whenever it sends an ACK with the AccECN Option, the value it writes
into the ECEB field is ECEB = r.ceb % DIVOPTwhere '%' is the remainder operator.On the arrival of an AccECN Option, the Data Sender first makes
sure the ACK has not been superseded in order to avoid winding the
s.ceb counter backwards. It uses the TCP acknowledgement number and
any SACK options to calculate newlyAckedB, the amount of new data that
the ACK acknowledges in bytes (newlyAckedB can be zero but not
negative). If newlyAckedB is zero, either the ACK has been superseded
or CE-marked packet(s) without data could have arrived. To break the
tie for the latter case, the Data Sender could use timestamps (if
present) to work out newlyAckedT, the amount of new time that the ACK
acknowledges. If the Data Sender determines that the ACK has been
superseded it ignores the AccECN Option. Otherwise, the Data Sender
calculates the minimum non-negative difference d.ceb between the ECEB
field and its local s.ceb counter, using modulo arithmetic as
follows:For example, if s.ceb is 33,554,433 and ECEB is 1461 (both
decimal), thenThe example algorithms below show how a Data Receiver in AccECN
mode could encode its CE packet counter r.cep into the ACE field, and
how the Data Sender in AccECN mode could decode the ACE field into its
s.cep counter. The Data Sender's algorithm includes code to
heuristically detect a long enough unbroken string of ACK losses that
could have concealed a cycle of the congestion counter in the ACE
field of the next ACK to arrive.Two variants of the algorithm are given: i) a more conservative
variant for a Data Sender to use if it detects that the AccECN Option
is not available (see and ); and ii) a less conservative
variant that is feasible when complementary information is available
from the AccECN Option.It is assumed that each local packet counter is a sufficiently
sized unsigned integer (probably 32b) and that the following
constant has been assigned:DIVACE = 2^3Every time an Acceptable CE marked packet arrives (), the Data Receiver increments
its local value of r.cep by 1. It repeats the same value of ACE in
every subsequent ACK until the next CE marking arrives, whereACE = r.cep % DIVACE.If the Data Sender received an earlier value of the counter that
had been delayed due to ACK reordering, it might incorrectly
calculate that the ACE field had wrapped. Therefore, on the arrival
of every ACK, the Data Sender ensures the ACK has not been
superseded using the TCP acknowledgement number, any SACK options
and timestamps (if available) to calculate newlyAckedB, as in . If the ACK has not been
superseded, the Data Sender calculates the minimum difference d.cep
between the ACE field and its local s.cep counter, using modulo
arithmetic as follows: expects the Data Sender to
assume that the ACE field cycled if it is the safest likely case
under prevailing conditions. The 3-bit ACE field in an arriving ACK
could have cycled and become ambiguous to the Data Sender if a row
of ACKs goes missing that covers a stream of data long enough to
contain 8 or more CE marks. We use the word `missing' rather than
`lost', because some or all the missing ACKs might arrive
eventually, but out of order. Even if some of the missing ACKs were
piggy-backed on data (i.e. not pure ACKs) retransmissions will not
repair the lost AccECN information, because AccECN requires
retransmissions to carry the latest AccECN counters, not the
original ones.The phrase `under prevailing conditions' allows for
implementation-dependent interpretation. A Data Sender might take
account of the prevailing size of data segments and the prevailing
CE marking rate just before the sequence of missing ACKs. However,
we shall start with the simplest algorithm, which assumes segments
are all full-sized and ultra-conservatively it assumes that ECN
marking was 100% on the forward path when ACKs on the reverse path
started to all be dropped. Specifically, if newlyAckedB is the
amount of data that an ACK acknowledges since the previous ACK, then
the Data Sender could assume that this acknowledges newlyAckedPkt
full-sized segments, where newlyAckedPkt = newlyAckedB/MSS. Then it
could assume that the ACE field incremented byFor example, imagine an ACK acknowledges newlyAckedPkt=9 more
full-size segments than any previous ACK, and that ACE increments by
a minimum of 2 CE marks (d.cep=2). The above formula works out that
it would still be safe to assume 2 CE marks (because 9 - ((9-2) % 8)
= 2). However, if ACE increases by a minimum of 2 but acknowledges
10 full-sized segments, then it would be necessary to assume that
there could have been 10 CE marks (because 10 - ((10-2) % 8) =
10).ACKs that acknowledge a large stretch of packets might be common
in data centres to achieve a high packet rate or might be due to ACK
thinning by a middlebox. In these cases, cycling of the ACE field
would often appear to have been possible, so the above algorithm
would be over-conservative, leading to a false high marking rate and
poor performance. Therefore it would be reasonable to only use
dSafer.cep rather than d.cep if the moving average of newlyAckedPkt
was well below 8.Implementers could build in more heuristics to estimate
prevailing average segment size and prevailing ECN marking. For
instance, newlyAckedPkt in the above formula could be replaced with
newlyAckedPktHeur = newlyAckedPkt*p*MSS/s, where s is the prevailing
segment size and p is the prevailing ECN marking probability.
However, ultimately, if TCP's ECN feedback becomes inaccurate it
still has loss detection to fall back on. Therefore, it would seem
safe to implement a simple algorithm, rather than a perfect one.The simple algorithm for dSafer.cep above requires no monitoring
of prevailing conditions and it would still be safe if, for example,
segments were on average at least 5% of full-sized as long as ECN
marking was 5% or less. Assuming it was used, the Data Sender would
increment its packet counter as follows:s.cep += dSafer.cepIf missing acknowledgement numbers arrive later (due to
reordering), says "the Data
Sender MAY attempt to neutralize the effect of any action it took
based on a conservative assumption that it later found to be
incorrect". To do this, the Data Sender would have to store the
values of all the relevant variables whenever it made assumptions,
so that it could re-evaluate them later. Given this could become
complex and it is not required, we do not attempt to provide an
example of how to do this.When the AccECN Option is available on the ACKs before and after
the possible sequence of ACK losses, if the Data Sender only needs
CE-marked bytes, it will have sufficient information in the AccECN
Option without needing to process the ACE field. If for some reason
it needs CE-marked packets, if dSafer.cep is different from d.cep,
it can determine whether d.cep is likely to be a safe enough
estimate by checking whether the average marked segment size (s =
d.ceb/d.cep) is less than the MSS (where d.ceb is the amount of
newly CE-marked bytes - see ). Specifically, it could use
the following algorithm:The chart below shows when the above algorithm will consider
d.cep can replace dSafer.cep as a safe enough estimate of the number
of CE-marked packets:The following examples give the reasoning behind the algorithm,
assuming MSS=1460 [B]:if d.cep=0, dSafer.cep=8 and d.ceb=1460, then s=infinity and
sSafer=182.5.Therefore even though the
average size of 8 data segments is unlikely to have been as
small as MSS/8, d.cep cannot have been correct, because it would
imply an average segment size greater than the MSS.if d.cep=2, dSafer.cep=10 and d.ceb=1460, then s=730 and
sSafer=146.Therefore d.cep is safe
enough, because the average size of 10 data segments is unlikely
to have been as small as MSS/10.if d.cep=7, dSafer.cep=15 and d.ceb=10200, then s=1457 and
sSafer=680.Therefore d.cep is safe
enough, because the average data segment size is more likely to
have been just less than one MSS, rather than below MSS/2.If pure ACKs were allowed to be ECN-capable, missing ACKs would
be far less likely. However, because
currently precludes this, the above algorithm assumes that pure ACKs
are not ECN-capable.If the AccECN Option is not available, the Data Sender can only
decode CE-marking from the ACE field in packets. Every time an ACK
arrives, to convert this into an estimate of CE-marked bytes, it needs
an average of the segment size, s_ave. Then it can add or subtract
s_ave from the value of d.ceb as the value of d.cep increments or
decrements. Some possible ways to calculate s_ave are outlined below.
The precise details will depend on why an estimate of marked bytes is
needed.The implementation could keep a record of the byte numbers of all
the boundaries between packets in flight (including control packets),
and recalculate s_ave on every ACK. However it would be simpler to
merely maintain a counter packets_in_flight for the number of packets
in flight (including control packets), which is reset once per RTT.
Either way, it would estimate s_ave as:s_ave ~= flightsize / packets_in_flight,where flightsize is the variable that TCP already maintains
for the number of bytes in flight. To avoid floating point arithmetic,
it could right-bit-shift by lg(packets_in_flight), where lg() means
log base 2.An alternative would be to maintain an exponentially weighted
moving average (EWMA) of the segment size:s_ave = a * s + (1-a) * s_ave,where a is the decay constant for the EWMA. However, then it
is necessary to choose a good value for this constant, which ought to
depend on the number of packets in flight. Also the decay constant
needs to be power of two to avoid floating point arithmetic. requires a Data Receiver to
beacon a full-length AccECN Option at least 3 times per RTT. This
could be implemented by maintaining a variable to store the number of
ACKs (pure and data ACKs) since a full AccECN Option was last sent and
another for the approximate number of ACKs sent in the last round trip
time:For optimized integer arithmetic, BEACON_FREQ = 4 could be used,
rather than 3, so that the division could be implemented as an integer
right bit-shift by lg(BEACON_FREQ).In certain operating systems, it might be too complex to maintain
acks_in_round. In others it might be possible by tagging each data
segment in the retransmit buffer with the number of ACKs sent at the
point that segment was sent. This would not work well if the Data
Receiver was not sending data itself, in which case it might be
necessary to beacon based on time instead, as follows:This time-based approach does not work well when all the ACKs are
sent early in each round trip, as is the case during slow-start. In
this case few options will be sent (evtl. even less than 3 per RTT).
However, when continuously sending data, data packets as well as ACKs
will spread out equally over the RTT and sufficient ACKs with the
AccECN option will be sent.A Data Sender in AccECN mode can infer the amount of TCP payload
data arriving at the receiver marked Not-ECT from the difference
between the amount of newly ACKed data and the sum of the bytes with
the other three markings, d.ceb, d.e0b and d.e1b. Note that, because
r.e0b is initialized to 1 and the other two counters are initialized
to 0, the initial sum will be 1, which matches the initial offset of
the TCP sequence number on completion of the 3WHS.For this approach to be precise, it has to be assumed that spurious
(unnecessary) retransmissions do not lead to double counting. This
assumption is currently correct, given that RFC 3168 requires that the
Data Sender marks retransmitted segments as Not-ECT. However, the
converse is not true; necessary retransmissions will result in
under-counting.However, such precision is unlikely to be necessary. The only known
use of a count of Not-ECT marked bytes is to test whether equipment on
the path is clearing the ECN field (perhaps due to an out-dated
attempt to clear, or bleach, what used to be the ToS field). To detect
bleaching it will be sufficient to detect whether nearly all bytes
arrive marked as Not-ECT. Therefore there should be no need to keep
track of the details of retransmissions.AccECN uses a rather unorthodox approach to negotiate the highest
version TCP ECN feedback scheme that both ends support, as justified
below. It follows from the original TCP ECN capability negotiation
, in which the client set the 2 least
significant of the original reserved flags in the TCP header, and fell
back to no ECN support if the server responded with the 2 flags
cleared, which had previously been the default.ECN originally used header flags rather than a TCP option because
it was considered more efficient to use a header flag for 1 bit of
feedback per ACK, and this bit could be overloaded to indicate support
for ECN during the handshake. During the development of ECN, 1 bit
crept up to 2, in order to deliver the feedback reliably and to work
round some broken hosts that reflected the reserved flags during the
handshake.In order to be backward compatible with RFC 3168, AccECN continues
this approach, using the 3rd least significant TCP header flag that
had previously been allocated for the ECN nonce (now historic). Then,
whatever form of server an AccECN client encounters, the connection
can fall back to the highest version of feedback protocol that both
ends support, as explained in .If AccECN had used the more orthodox approach of a TCP option, it
would still have had to set the two ECN flags in the main TCP header,
in order to be able to fall back to Classic RFC 3168 ECN, or to
disable ECN support, without another round of negotiation. Then AccECN
would also have had to handle all the different ways that servers
currently respond to settings of the ECN flags in the main TCP header,
including all the conflicting cases where a server might have said it
supported one approach in the flags and another approach in the new
TCP option. And AccECN would have had to deal with all the additional
possibilities where a middlebox might have mangled the ECN flags, or
removed the TCP option. Thus, usage of the 3rd reserved TCP header
flag simplified the protocol.The third flag was used in a way that could be distinguished from
the ECN nonce, in case any nonce deployment was encountered. Previous
usage of this flag for the ECN nonce was integrated into the original
ECN negotiation. This further justified the 3rd flag's use for AccECN,
because a non-ECN usage of this flag would have had to use it as a
separate single bit, rather than in combination with the other 2 ECN
flags.Indeed, having overloaded the original uses of these three flags
for its handshake, AccECN overloads all three bits again as a 3-bit
counter.Of the 8 possible codepoints that the 3 TCP header flags can
indicate on the SYN/ACK, 4 already indicated earlier (or broken)
versions of ECN support. In the early design of AccECN, an AccECN
server could use only 2 of the 4 remaining codepoints. They both
indicated AccECN support, but one fed back that the SYN had arrived
marked as CE. Even though ECN support on a SYN is not yet on the
standards track, the idea is for either end to act as a dumb
reflector, so that future capabilities can be unilaterally deployed
without requiring 2-ended deployment (justified in ).During traversal testing it was discovered that the ECN field in
the SYN was mangled on a non-negligible proportion of paths. Therefore
it was necessary to allow the SYN/ACK to feed all four IP/ECN
codepoints that the SYN could arrive with back to the client. Without
this, the client could not know whether to disable ECN for the
connection due to mangling of the IP/ECN field (also explained in
). This development consumed the
remaining 2 codepoints on the SYN/ACK that had been reserved for
future use by AccECN in earlier versions.Despite availability of usable TCP header space being extremely
scarce, the AccECN protocol has taken all possible steps to ensure
that there is space to negotiate possible future variants of the
protocol, either if the experiment proves that a variant of AccECN is
required, or if a completely different ECN feedback approach is
needed:When the AccECN capability
is negotiated during TCP's 3WHS, the rows in tagged as 'Nonce' and 'Broken'
in the column for the capability of node B are unused by any
current protocol in the RFC series. These could be used by TCP
servers in future to indicate a variant of the AccECN protocol. In
recent measurement studies in which the response of large numbers
of servers to an AccECN SYN has been tested, e.g. , a very small number of SYN/ACKs arrive
with the pattern tagged as 'Nonce', and a small but more
significant number arrive with the pattern tagged as 'Broken'. The
'Nonce' pattern could be a sign that a few servers have
implemented the ECN Nonce , which has now
been reclassified as historic , or it
could be the random result of some unknown middlebox behaviour.
The greater prevalence of the 'Broken' pattern suggests that some
instances still exist of the broken code that reflects the
reserved flags on the SYN.The requirement
not to reject unexpected initial values of the ACE counter (in the
main TCP header) in the last para of ensures that 3 unused
codepoints on the ACK of the SYN/ACK, 6 unused values on the first
SYN=0 data packet from the client and 7 unused values on the first
SYN=0 data packet from the server could be used to declare future
variants of the AccECN protocol. The word 'declare' is used rather
than 'negotiate' because, at this late stage in the 3WHS, it would
be too late for a negotiation between the endpoints to be
completed. A similar requirement not to reject unexpected initial
values in the TCP option ()
is for the same purpose. If traversal of the TCP option were
reliable, this would have enabled a far wider range of future
variation of the whole AccECN protocol. Nonetheless, it could be
used to reliably negotiate a wide range of variation in the
semantics of the AccECN Option.Five codepoints out of
the 8 possible in the 3 TCP header flags used by AccECN are unused
on the initial SYN (in the order AE,CWR,ECE): 001, 010, 100, 101,
110. ensures that the
installed base of AccECN servers will all assume these are
equivalent to AccECN negotiation with 111 on the SYN. These
codepoints would not allow fall-back to Classic ECN support for a
server that did not understand them, but this approach ensures
they are available in future, perhaps for uses other than ECN
alongside the AccECN scheme. All possible combinations of SYN/ACK
could be used in response except either 000 or reflection of the
same values sent on the SYN. Of course,
other ways could be resorted to in order to extend AccECN or ECN
in future, although their traversal properties are likely to be
inferior. They include a new TCP option; using the remaining
reserved flags in the main TCP header (preferably extending the
3-bit combinations used by AccECN to 4-bit combinations, rather
than burning one bit for just one state); a non-zero urgent
pointer in combination with the URG flag cleared; or some other
unexpected combination of fields yet to be invented.