ECN++: Adding Explicit Congestion Notification (ECN)
to TCP Control Packets
Universidad Carlos III de
Madrid
Av. Universidad 30
Leganes
Madrid
28911
SPAIN
34 91 6249500
marcelo@it.uc3m.es
http://www.it.uc3m.es
Independent
UK
ietf@bobbriscoe.net
http://bobbriscoe.net/
This document describes an experimental modification to ECN when used
with TCP. It allows the use of ECN on the following TCP packets: SYNs,
pure ACKs, Window probes, FINs, RSTs and retransmissions.
RFC 3168 specifies support of Explicit
Congestion Notification (ECN) in IP (v4 and v6). By using the ECN
capability, network elements (e.g. routers, switches) performing Active
Queue Management (AQM) can use ECN marks instead of packet drops to
signal congestion to the endpoints of a communication. This results in
lower packet loss and increased performance. RFC 3168 also specifies
support for ECN in TCP, but solely on data packets. For various reasons
it precludes the use of ECN on TCP control packets (TCP SYN, TCP
SYN-ACK, pure ACKs, Window probes) and on retransmitted packets. RFC
3168 is silent about the use of ECN on RST and FIN packets. RFC 5562
is an experimental modification to ECN that
enables ECN support for TCP SYN-ACK packets.
This document defines an experimental modification to ECN that shall be called ECN++. It enables ECN support on
all the aforementioned types of TCP packet. RFC 5562 (which was called
ECN+) is obsoleted by the present specification, because it has the same
goal of enabling ECT, but on only one type of control packet. The
mechanisms proposed in this document have been defined conservatively
and with safety in mind, possibly in some cases at the expense of
performance.
ECN++ uses a sender-only deployment model. It works whether the two
ends of the TCP connection use classic ECN feedback or experimental Accurate ECN feedback (AccECN ), the two ECN feedback mechanisms
for TCP being standardized at the time of writing.
Using ECN on initial SYN packets provides significant benefits, as we
describe in the next subsection. However, only AccECN provides a way to
feed back whether the SYN was CE marked, and RFC 3168 does not.
Therefore, implementers of ECN++ are RECOMMENDED to also implement
AccECN. Conversely, if AccECN (or an equivalent safety mechanism) is not
implemented with ECN++, this specification rules out ECN on the SYN.
ECN++ is designed for compatibility with a number of latency
improvements to TCP such as TCP Fast Open (TFO ), initial window of 10 SMSS (IW10 ) and Low latency Low Loss Scalable Transport (L4S
), but they can all be
implemented and deployed independently. is a
standards track procedural device that relaxes requirements in RFC 3168
and other standards track RFCs that would otherwise preclude the
experimental modifications needed for ECN++ and other ECN
experiments.
The absence of ECN support on TCP control packets and
retransmissions has a potential harmful effect. In any ECN deployment,
non-ECN-capable packets suffer a penalty when they traverse a
congested bottleneck. For instance, with a drop probability of 1%, 1%
of connection attempts suffer a timeout of about 1 second before the
SYN is retransmitted, which is highly detrimental to the performance
of short flows. TCP control packets, particularly TCP SYNs and
SYN-ACKs, are important for performance, so dropping them is best
avoided.
Not using ECN on control packets can be particularly detrimental to
performance in environments where the ECN marking level is high. For
example, shows that in a controlled private
data centre (DC) environment where ECN is used (in conjunction with
DCTCP ), the probability of being able to
establish a new connection using a non-ECN SYN packet drops to close
to zero even when there are only 16 ongoing TCP flows transmitting at
full speed. The issue is that DCTCP exhibits a much more aggressive
response to packet marking (which is why it is only applicable in
controlled environments). This leads to a high marking probability for
ECN-capable packets, and in turn a high drop probability for non-ECN
packets. Therefore non-ECN SYNs are dropped aggressively, rendering it
nearly impossible to establish a new connection in the presence of
even mild traffic load.
Finally, there are ongoing experimental efforts to promote the
adoption of a slightly modified variant of DCTCP (and similar
congestion controls) over the Internet to achieve low latency, low
loss and scalable throughput (L4S) for all communications . In such an approach, L4S packets
identify themselves using an ECN codepoint . With L4S, preventing TCP control
packets from obtaining the benefits of ECN would not only expose them
to the prevailing level of congestion loss, but it would also classify
them into a different queue. Then only L4S data packets would be
classified into the L4S queue that is expected to have lower latency,
while the packets controlling and retransmitting these data packets
would still get stuck behind the queue induced by non-L4S-enabled TCP
traffic.
The goal of the experimental modifications defined in this document
is to allow the use of ECN on all TCP packets. Experiments are
expected in the public Internet as well as in controlled environments
to understand the following issues:
How SYNs, Window probes, pure ACKs, FINs, RSTs and
retransmissions that carry the ECT(0), ECT(1) or CE codepoints are
processed by the TCP endpoints and the network (including routers,
firewalls and other middleboxes). In particular we would like to
learn if these packets are frequently blocked or if these packets
are usually forwarded and processed.
The scale of deployment of the different flavours of ECN,
including , ,
and .
How much the performance of TCP communications is improved by
allowing ECN marking of each packet type.
To identify any issues (including security issues) raised by
enabling ECN marking of these packets.
To conduct the specific experiments identified in the text by
the strings "EXPERIMENTATION NEEDED" or "MEASUREMENTS NEEDED".
The data gathered through the experiments described in this
document, particularly under the first 2 bullets above, will help in
the redesign of the final mechanism (if needed) for adding ECN support
to the different packet types considered in this document.
Success criteria: The experiment will be a success if we obtain
enough data to have a clearer view of the deployability and benefits
of enabling ECN on all TCP packets, as well as any issues. If the
results of the experiment show that it is feasible to deploy such
changes; that there are gains to be achieved through the changes
described in this specification; and that no other major issues may
interfere with the deployment of the proposed changes; then it would
be reasonable to adopt the proposed changes in a standards track
specification that would update RFC 3168.
The remainder of this document is structured as follows. In , we present the terminology used in the rest of the
document. In , we specify the modifications to
provide ECN support to TCP SYNs, pure ACKs, Window probes, FINs, RSTs
and retransmissions. We describe both the network behaviour and the
endpoint behaviour. discusses
variations of the specification that will be necessary to interwork
with a number of popular variants or derivatives of TCP. RFC 3168
provides a number of specific reasons why ECN support is not
appropriate for each packet type. In , we
revisit each of these arguments for each packet type to justify why it
is reasonable to conduct this experiment.
The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD,
SHOULD NOT, RECOMMENDED, NOT RECOMMENDED, MAY, and OPTIONAL in this
document, are to be interpreted as described in BCP 14 when and only when they appear in all capitals .
Pure ACK: A TCP segment with the ACK flag set and no data
payload.
SYN: A TCP segment with the SYN (synchronize) flag set.
Window probe: Defined in , a window probe is
a TCP segment with only one byte of data sent to learn if the receive
window is still zero.
FIN: A TCP segment with the FIN (finish) flag set.
RST: A TCP segment with the RST (reset) flag set.
Retransmission: A TCP segment that has been retransmitted by the TCP
sender.
TCP client: The initiating end of a TCP connection. Also called the
initiator.
TCP server: The responding end of a TCP connection. Also called the
responder.
ECT: ECN-Capable Transport. One of the two codepoints ECT(0) or
ECT(1) in the ECN field of the IP header (v4 or
v6). An ECN-capable sender sets one of these to indicate that both
transport end-points support ECN. When this specification says the
sender sets an ECT codepoint, by default it means ECT(0). Optionally, it
could mean ECT(1), which is in the process of being redefined for use by
L4S experiments .
Not-ECT: The ECN codepoint set by senders that indicates that the
transport is not ECN-capable.
CE: Congestion Experienced. The ECN codepoint that an intermediate
node sets to indicate congestion . A node sets
an increasing proportion of ECT packets to CE as the level of congestion
increases.
The experimental ECN++ changes to the specification of TCP over ECN
defined here primarily alter the behaviour of
the sending host for each half-connection. However, there are
subsections for forwarding elements and receivers below, which recommend
that they accept the new packets - they should do already, but might
not. This will allow implementers to check the receive side code while
they are altering the send-side code. All changes can be deployed at
each end-point independently of others and independent of any network
behaviour.
The feedback behaviour at the receiver depends on whether classic ECN
TCP feedback or Accurate ECN (AccECN) TCP
feedback has been
negotiated. Nonetheless, neither receiver feedback behaviour is altered
by the present specification.
Previously the specification of ECN for TCP required the sender to set not-ECT on TCP control
packets and retransmissions. Some readers of RFC 3168 might have
erroneously interpreted this as a requirement for firewalls, intrusion
detection systems, etc. to check and enforce this behaviour. Section
4.3 of updates RFC 3168 to remove this
ambiguity. It requires firewalls or any intermediate nodes not to
treat certain types of ECN-capable TCP segment differently (except
potentially in one attack scenario). This is likely to only involve a
firewall rule change in a fraction of cases (at most 0.4% of paths
according to the tests reported in ).
In case a TCP sender encounters a middlebox blocking ECT on certain
TCP segments, the specification below includes behaviour to fall back
to non-ECN. However, this loses the benefit of ECN on control packets.
So operators are RECOMMENDED to alter their firewall rules to comply
with the requirement referred to above (section 4.3 of ).
For each type of control packet or retransmission, the following
sections detail changes to the sender's behaviour in two respects: i)
whether it sets ECT; and ii) its response to congestion feedback.
summarises these two behaviours
for each type of packet, but the relevant subsection below should be
referred to for the detailed behaviour. The subsection on the SYN is
more complex than the others, because it has to include fall-back
behaviour if the ECT packet appears not to have got through, and
caching of the outcome to detect persistent failures.
TCP packet type
ECN field if AccECN f/b negotiated*
ECN field if RFC3168 f/b negotiated*
Congestion Response
SYN
ECT
not-ECT
If AccECN, reduce IW
SYN-ACK
ECT
ECT
Reduce IW
Pure ACK
ECT
not-ECT
If AccECN, usual cwnd response and optionally
W Probe
ECT
ECT
Usual cwnd response
FIN
ECT
ECT
None or optionally
RST
ECT
ECT
N/A
Re-XMT
ECT
ECT
Usual cwnd response
Window probe and retransmission are abbreviated to W
Probe an Re-XMT. * For a SYN, "negotiated" means
"requested".
It can be seen that we recommend against the sender setting ECT on
the SYN if it is not requesting AccECN feedback. Therefore it is
RECOMMENDED that the experimental AccECN specification is implemented, along with the
ECN++ experiment, because it is expected that ECT on the SYN will give
the most significant performance gain, particularly for short
flows.
Nonetheless, this specification also caters for the case where an
ECN++ TCP sender is not using AccECN. This could be because it does
not support AccECN or because the other end of the TCP connection does
not (AccECN can only be used for a connection if both ends support
it).
With classic ECN feedback, the SYN was
not expected to be ECN-capable, so the flag provided to feed back
congestion was put to another use (it is used in combination with
other flags to indicate that the responder supports ECN). In
contrast, Accurate ECN (AccECN) feedback provides a codepoint in the
SYN-ACK for the responder to feed back whether the SYN arrived
marked CE. Therefore the setting of the IP/ECN field on the SYN is
specified separately for each case in the following two
subsections.
For the ECN++ experiment, if the SYN is requesting AccECN
feedback, the TCP sender will also set ECT on the SYN. It can
ignore the prohibition in section 6.1.1 of RFC 3168 against
setting ECT on such a SYN, as per Section 4.3 of .
If the SYN sent by a TCP initiator does not attempt to
negotiate Accurate ECN feedback, or does not use an equivalent
safety mechanism, it MUST still comply with RFC 3168, which says
that a TCP initiator "MUST NOT set ECT on a SYN".
The only envisaged examples of "equivalent safety mechanisms"
are: a) some future TCP ECN feedback protocol, perhaps evolved
from AccECN, that feeds back CE marking on a SYN; b) setting the
initial window to 1 SMSS. IW=1 is NOT RECOMMENDED because it
could degrade performance, but might be appropriate for certain
lightweight TCP implementations.
See for discussion
and rationale.
If the TCP initiator does not set ECT on the SYN, the rest of
does not apply.
This subsection only applies if the ECN++ TCP client set ECTs
on the SYN and supports AccECN.
Until AccECN servers become widely deployed, a TCP initiator
that sets ECT on a SYN (which typically implies the same SYN also
requests AccECN, as above) SHOULD also maintain a cache entry per
server to record servers that it is not worth sending an ECT SYN
to, e.g. because they do not support AccECN and therefore have no
logic for congestion markings on the SYN. Mobile hosts MAY
maintain a cache entry per access network to record 'non-ECT SYN'
entries against proxies (see ). This cache can be
implemented as part of the shared state across multiple TCP
connections, following .
Subsequently the initiator will not set ECT on a SYN to such a
server or proxy, but it can still always request AccECN support
(because the response will state any earlier stage of ECN
evolution that the server supports with no performance penalty).
If a server subsequently upgrades to support AccECN, the initiator
will discover this as soon as it next connects, then it can remove
the server from its cache and subsequently always set ECT for that
server.
The client can limit the size of its cache of 'non-ECT SYN'
servers. Then, while AccECN is not widely deployed, it will only
cache the 'non-ECT SYN' servers that are most used and most
recently used by the client. As the client accesses servers that
have been expelled from its cache, it will simply use ECT on the
SYN by default.
Servers that do not support ECN as a whole do not need to be
recorded separately from non-support of AccECN because the
response to a request for AccECN immediately states which stage in
the evolution of ECN the server supports (AccECN , classic ECN or no ECN).
The above strategy is named "optimistic ECT and cache
failures". It is believed to be sufficient based on three
measurement studies and assumptions detailed in . However, gives two other
strategies and the choice between them depends on the
implementer's goals and the deployment prevalence of ECN variants
in the network and on servers, not to mention the prevalence of
some significant bugs.
If the initiator times out without seeing a SYN-ACK, it will
separately cache this fact (see fall-back in for details).
As explained above, this subsection only applies if the ECN++
TCP client sets ECT on the initial SYN.
If the SYN-ACK returned to the TCP initiator confirms that the
server supports AccECN, it will also be able to indicate whether
or not the SYN was CE-marked. If the SYN was CE-marked, and if the
initial window is greater than 1 MSS, then, the initiator MUST
reduce its Initial Window (IW) and SHOULD reduce it to 1 SMSS
(sender maximum segment size). The rationale is the same as that
for the response to CE on a SYN-ACK ().
If the initiator has set ECT on the SYN and if the SYN-ACK
shows that the server does not support feedback of a CE on the SYN
(e.g. it does not support AccECN) and if the initial congestion
window of the initiator is greater than 1 MSS, then the TCP
initiator MUST conservatively reduce its Initial Window and SHOULD
reduce it to 1 SMSS. A reduction to greater than 1 SMSS MAY be
appropriate (see ).
Conservatism is necessary because the SYN-ACK cannot show whether
the SYN was CE-marked.
If the TCP initiator (host A) receives a SYN from the remote
end (host B) after it has sent a SYN to B, it indicates the
(unusual) case of a simultaneous open. Host A will respond with a
SYN-ACK. Host A will probably then receive a SYN-ACK in response
to its own SYN, after which it can follow the appropriate one of
the two paragraphs above.
In all the above cases, the initiator does not have to back off
its retransmission timer as it would in response to a timeout
following no response to its SYN , because
both the SYN and the SYN-ACK have been successfully delivered
through the network. Also, the initiator does not need to exit
slow start or reduce ssthresh, which is not even required when a
SYN is lost .
If an initial window of more than 3 segments is implemented
(e.g. IW10 ), gives additional
recommendations.
As explained above, this subsection only applies if the ECN++
TCP client also sets ECT on the initial SYN.
An ECT SYN might be lost due to an over-zealous path element
(or server) blocking ECT packets that do not conform to RFC 3168.
Some evidence of this was found in a 2014 study , but in a more recent study using 2017 data
extensive measurements found no case
where ECT on TCP control packets was treated any differently from
ECT on TCP data packets. Loss is commonplace for numerous other
reasons, e.g. congestion loss at a non-ECN queue on the forward or
reverse path, transmission errors, etc. Alternatively, the cause
of the loss might be the associated attempt to negotiate AccECN,
or possibly other unrelated options on the SYN.
Therefore, if the timer expires after the TCP initiator has
sent the first ECT SYN, it SHOULD make one more attempt to
retransmit the SYN with ECT set (backing off the timer as usual).
If the retransmission timer expires again, it SHOULD retransmit
the SYN with the not-ECT codepoint in the IP header, to expedite
connection set-up. If other experimental fields or options were on
the SYN, it will also be necessary to follow their specifications
for fall-back too. It would make sense to coordinate all the
strategies for fall-back in order to isolate the specific cause of
the problem.
If the TCP initiator is caching failed connection attempts, it
SHOULD NOT give up using ECT on the first SYN of subsequent
connection attempts until it is clear that a blockage persistently
and specifically affects ECT on SYNs. This is because loss is so
commonplace for other reasons. Even if it does eventually decide
to give up setting ECT on the SYN, it will probably not need to
give up on AccECN on the SYN. In any case, if a cache is used, it
SHOULD be arranged to expire so that the initiator will
infrequently attempt to check whether the problem has been
resolved.
Other fall-back strategies MAY be adopted where applicable (see
for suggestions, and
the conditions under which they would apply).
For the ECN++ experiment, the TCP implementation will set ECT
on SYN-ACKs. It can ignore the requirement in section 6.1.1 of RFC
3168 to set not-ECT on a SYN-ACK, as per Section 4.3 of .
A host that sets ECT on SYN-ACKs MUST reduce its initial window
in response to any congestion feedback, whether using classic ECN
or AccECN (see ). It
SHOULD reduce it to 1 SMSS. This is different to the behaviour
specified in an earlier experiment that set ECT on the SYN-ACK
. This is justified in .
The responder does not have to back off its retransmission
timer because the ECN feedback proves that the network is
delivering packets successfully and is not severely overloaded.
Also the responder does not have to leave slow start or reduce
ssthresh, which is not even required when a SYN-ACK has been
lost.
The congestion response to CE-marking on a SYN-ACK for a server
that implements either the TCP Fast Open experiment (TFO ) or experimentation with an initial window of
more than 3 segments (e.g. IW10 ) is
discussed in .
After the responder sends a SYN-ACK with ECT set, if its
retransmission timer expires it SHOULD retransmit one more SYN-ACK
with ECT set (and back-off its timer as usual). If the timer
expires again, it SHOULD retransmit the SYN-ACK with not-ECT in
the IP header. If other experimental fields or options were on the
initial SYN-ACK, it will also be necessary to follow their
specifications for fall-back. It would make sense to co-ordinate
all the strategies for fall-back in order to isolate the specific
cause of the problem.
This fall-back strategy attempts to use ECT one more time than
the strategy for ECT SYN-ACKs in (which
is made obsolete, being superseded by the present specification).
Other fall-back strategies MAY be adopted if found to be more
effective, e.g. fall-back to not-ECT on the first retransmission
attempt.
The server MAY cache failed connection attempts, e.g. per
client access network. A client-based alternative to caching at
the server is given in . If the TCP server is
caching failed connection attempts, it SHOULD NOT give up using
ECT on the first SYN-ACK of subsequent connection attempts until
it is clear that the blockage persistently and specifically
affects ECT on SYN-ACKs. This is because loss is so commonplace
for other reasons (see ).
If a cache is used, it SHOULD be arranged to expire so that the
server will infrequently attempt to check whether the problem has
been resolved.
A Pure ACK is an ACK packet that does not carry data, which
includes the Pure ACK at the end of TCP's 3-way handshake.
For the ECN++ experiment, whether a TCP implementation sets ECT
on a Pure ACK depends on whether or not Accurate ECN TCP feedback
has been successfully
negotiated for a particular TCP connection, as specified in the
following two subsections.
If AccECN has not been successfully negotiated for a
connection, ECT MUST NOT be set on Pure ACKs by either end.
For the ECN++ experiment, if AccECN has been successfully
negotiated, either end of the connection will set ECT on Pure
ACKs. They can ignore the requirement in section 6.1.4 of RFC 3168
to set not-ECT on a pure ACK, as per Section 4.3 of .
MEASUREMENTS NEEDED: Measurements are needed to learn how
the deployed base of network elements and RFC 3168 servers
react to pure ACKs marked with the ECT(0)/ECT(1)/CE
codepoints, i.e. whether they are dropped, codepoint cleared
or processed and the congestion indication fed back on a
subsequent packet.
See for the implications if
a host receives a CE-marked Pure ACK.
As explained above, this subsection only applies if AccECN
has been successfully negotiated for the TCP connection.
A host that sets ECT on pure ACKs SHOULD respond to the
congestion signal resulting from pure ACKs being marked with the
CE codepoint. The specific response will need to be defined as
an update to each congestion control specification. Possible
responses to congestion feedback include reducing the congestion
window (CWND) and/or regulating the pure ACK rate (see ).
Note that, in comparison, TCP Congestion Control does not require a TCP to detect or respond
to loss of pure ACKs at all; it requires no reduction in
congestion window or ACK rate.
For the ECN++ experiment, the TCP sender will set ECT on window
probes. It can ignore the prohibition in section 6.1.6 of RFC 3168
against setting ECT on a window probe, as per Section 4.3 of .
A window probe contains a single octet, so it is no different
from a regular TCP data segment. Therefore a TCP receiver will feed
back any CE marking on a window probe as normal (either using
classic ECN feedback or AccECN feedback). The sender of the probe
will then reduce its congestion window as normal.
A receive window of zero indicates that the application is not
consuming data fast enough and does not imply anything about network
congestion. Once the receive window opens, the congestion window
might become the limiting factor, so it is correct that CE-marked
probes reduce the congestion window. This complements cwnd
validation , which reduces cwnd as more time
elapses without having used available capacity. However, CE-marking
on window probes does not reduce the rate of the probes themselves.
This is unlikely to present a problem, given the duration between
window probes doubles as long as the
receiver is advertising a zero window (currently minimum 1 second,
maximum at least 1 minute ).
MEASUREMENTS NEEDED: Measurements are needed to learn how the
deployed base of network elements and servers react to Window
probes marked with the ECT(0)/ECT(1)/CE codepoints, i.e. whether
they are dropped, codepoint cleared or processed.
A TCP implementation can set ECT on a FIN.
See for the implications if a
host receives a CE-marked FIN.
A congestion response to a CE-marking on a FIN is not
required.
After sending a FIN, the endpoint will not send any more data in
the connection. Therefore, even if the FIN-ACK indicates that the
FIN was CE-marked (whether using classic or AccECN feedback),
reducing the congestion window will not affect anything.
After sending a FIN, a host might send one or more pure ACKs. If
it is using one of the techniques in to
regulate the delayed ACK ratio for pure ACKs, it could equally be
applied after a FIN. But this is not required.
MEASUREMENTS NEEDED: Measurements are needed to learn how the
deployed base of network elements and servers react to FIN
packets marked with the ECT(0)/ECT(1)/CE codepoints, i.e.
whether they are dropped, codepoint cleared or processed.
A TCP implementation can set ECT on a RST.
See for the implications if a
host receives a CE-marked RST.
A congestion response to a CE-marking on a RST is not required
(and actually not possible).
MEASUREMENTS NEEDED: Measurements are needed to learn how the
deployed base of network elements and servers react to RST
packets marked with the ECT(0)/ECT(1)/CE codepoints, i.e.
whether they are dropped, codepoint cleared or processed.
Implementers SHOULD ensure that RST packets (and control packets
generally) are always sent out with the same ECN field regardless of
the TCP state machine. Otherwise the ECN field could reveal internal
TCP state. For instance, the ECN field on a RST ought not to reveal
any distinction between a non-listening port, a recently in-use
port, and a closed session port.
For the ECN++ experiment, the TCP sender will set ECT on
retransmitted segments. It can ignore the prohibition in section
6.1.5 of RFC 3168 against setting ECT on retransmissions, as per
Section 4.3 of .
See for the implications if
a host receives a CE-marked retransmission.
If the TCP sender receives feedback that a retransmitted packet
was CE-marked, it will react as it would to any feedback of
CE-marking on a data packet.
MEASUREMENTS NEEDED: Measurements are needed to learn how the
deployed base of network elements and servers react to
retransmissions marked with the ECT(0)/ECT(1)/CE codepoints,
i.e. whether they are dropped, codepoint cleared or
processed.
Extensive measurements in fixed and mobile networks have found no evidence of blockages due to
ECT being set on any type of TCP control packet.
In case traversal problems arise in future, fall-back measures
have been specified above, but only for the cases where ECT on the
initial packet of a half-connection (SYN or SYN-ACK) is persistently
failing to get through.
Fall-back measures for blockage of ECT on other TCP control
packets MAY be implemented. However they are not specified here
given the lack of any evidence they will be needed. justifies this advice in
more detail.
The present ECN++ specification primarily concerns the behaviour
for sending TCP control packets or retransmissions. Below are a few
changes to the receive side of an implementation that are recommended
while updating its send side. Nonetheless, where deployment is
concerned, ECN++ is still a sender-only deployment, because it does
not depend on receivers complying with any of these
recommendations.
RFC8311 is a standards track update to RFC 3168 in order to
(amongst other things) "...allow the use of ECT codepoints on SYN
packets, pure acknowledgement packets, window probe packets, and
retransmissions of packets..., provided that the changes from RFC
3168 are documented in an Experimental RFC in the IETF document
stream."
Section 4.3 of RFC 8311 amends every statement in RFC 3168 that
precludes the use of ECT on control packets and retransmissions to
add "unless otherwise specified by an Experimental RFC in the IETF
document stream". The present specification is such an Experimental
RFC. Therefore, In order for the present RFC 8311 experiment to be
useful, TCP receivers will need to satisfy the following
requirements:
Any TCP implementation SHOULD accept receipt of any valid TCP
control packet or retransmission irrespective of its IP/ECN
field. If any existing implementation does not, it SHOULD be
updated to do so.
A TCP implementation taking part in the experiments proposed
here MUST accept receipt of any valid TCP control packet or
retransmission irrespective of its IP/ECN field.
The following sections give further requirements specific
to each type of control packet.
These measures are derived from the robustness principle of "...
be liberal in what you accept from others", not only to ensure
compatibility with the present experimental specification, but also
any future protocol changes that allow ECT on any TCP packet.
RFC 3168 negotiates the use of ECN for the connection end-to-end
using the ECN flags in the TCP header. RFC 3168 originally said that
"A host MUST NOT set ECT on SYN ... packets." but it was silent as
to what a TCP server ought to do if it receives a SYN packet with a
non-zero IP/ECN field anyway.
For the avoidance of doubt, the normative statements for all TCP
control packets in are
interpreted for the specific case when a SYN is received as
follows:
Any TCP server implementation SHOULD accept receipt of a
valid SYN that requests ECN support for the connection,
irrespective of the IP/ECN field of the SYN. If any existing
implementation does not, it SHOULD be updated to do so.
A TCP implementation taking part in the ECN++ experiment MUST
accept receipt of a valid SYN, irrespective of its IP/ECN
field.
If the SYN is CE-marked and the server has no logic to feed
back a CE mark on a SYN-ACK (e.g. it does not support AccECN),
it has to ignore the CE-mark (the client detects this case and
behaves conservatively in mitigation - see ).
Rationale: At the time of the writing, some implementations of
TCP servers (see )
assume that, if a host receives a SYN with a non-zero IP/ECN field,
it must be due to network mangling, and they disable ECN for the
rest of the connection. cites a measurement
study run in 2017 that found no occurrence of this type of network
mangling. However, a year earlier, when ECN was enabled on
connections from Apple clients, there was a case of a whole network
that re-marked the ECN field of every packet to CE (it was rapidly
fixed).
When ECN was not allowed on SYNs, it made sense to look for a
non-zero ECN field on the SYN to detect this type of network
mangling. But now that ECN is being allowed on a SYN, detection
needs to be more nuanced. A server needs to disable the test on the
SYN alone for AccECN SYNs (which was done for Linux RFC 3168 servers
in 2019 ) and for RFC 3168 SYNs it
needs to watch for three or four packets all set to CE at the start
of a flow. If such mangling is indeed now so rare, it would also be
preferable to log each case detected and manually report it to the
responsible network, so that the problem will eventually be
eliminated.
For the avoidance of doubt, the normative statements for all TCP
control packets in are
interpreted for the specific case when a Pure ACK is received as
follows:
Any TCP implementation SHOULD accept receipt of a pure ACK
with a non-zero ECN field, despite current RFCs precluding the
sending of such packets.
A TCP implementation taking part in the ECN++ experiment MUST
accept receipt of a pure ACK with a non-zero ECN field.
The question of whether and how the receiver of pure ACKs is
required to feed back any CE marks on them is outside the scope of
the present specification because it is a matter for the relevant
feedback specification ( or ). AccECN feedback is required
to count CE marking of any control packet including pure ACKs.
Whereas RFC 3168 is silent on this point, so feedback of CE-markings
might be implementation specific (see ).
The TCP data receiver MUST ignore the CE codepoint on incoming
FINs that fail any validity check. The validity check in section 5.2
of is RECOMMENDED.
The "challenge ACK" approach to checking the validity of RSTs
(section 3.2 of is RECOMMENDED at the data
receiver.
The TCP data receiver MUST ignore the CE codepoint on incoming
segments that fail any validity check. The validity check in section
5.2 of is RECOMMENDED. This will
effectively mitigate an attack that uses spoofed data packets to
fool the receiver into feeding back spoofed congestion indications
to the sender, which in turn would be fooled into continually
reducing its congestion window.
This section is informative, not normative. It presents
counter-arguments against the justifications in the RFC series for
disabling ECN on TCP control segments and retransmissions. It also gives
rationale for why ECT is safe on control segments that have not, so far,
been mentioned in the RFC series. First it addresses over-arching
arguments used for most packet types, then it addresses the specific
arguments for each packet type in turn.
Section 5.2 of RFC 3168 states:
"To ensure the reliable delivery of the congestion indication
of the CE codepoint, an ECT codepoint MUST NOT be set in a packet
unless the loss of that packet [at a subsequent node] in the
network would be detected by the end nodes and interpreted as an
indication of congestion."
We believe this argument is misplaced. TCP does not deliver most
control packets reliably. So it is more important to allow control
packets to be ECN-capable, which greatly improves reliable delivery of
the control packets themselves (see motivation in ). ECN also improves the reliability
and latency of delivery of any congestion notification on control
packets, particularly because TCP does not detect the loss of most
types of control packet anyway. Both these points outweigh by far the
concern that a CE marking applied to a control packet by one node
might subsequently be dropped by another node.
The principle to determine whether a packet can be ECN-capable
ought to be "do no extra harm", meaning that the reliability of a
congestion signal's delivery ought to be no worse with ECN than
without. In particular, setting the CE codepoint on the very same
packet that would otherwise have been dropped fulfills this criterion,
since either the packet is delivered and the CE signal is delivered to
the endpoint, or the packet is dropped and the original congestion
signal (packet loss) is delivered to the endpoint.
The concern about a CE marking being dropped at a subsequent node
might be motivated by the idea that ECN-marking a packet at the first
node does not remove the packet, so it could go on to worsen
congestion at a subsequent node. However, it is not useful to reason
about congestion by considering single packets. The departure rate
from the first node will generally be the same (fully utilized) with
or without ECN, so this argument does not apply.
RFC 5562 presents two arguments against ECT marking of SYN packets
(quoted verbatim):
"First, when the TCP SYN packet is sent, there are no
guarantees that the other TCP endpoint (node B in Figure 2) is
ECN-Capable, or that it would be able to understand and react if
the ECN CE codepoint was set by a congested router.
Second, the ECN-Capable codepoint in TCP SYN packets could be
misused by malicious clients to "improve" the well-known TCP SYN
attack. By setting an ECN-Capable codepoint in TCP SYN packets, a
malicious host might be able to inject a large number of TCP SYN
packets through a potentially congested ECN-enabled router,
congesting it even further."
The first point actually describes two subtly different
issues. So below three arguments are countered in turn.
This argument certainly applied at the time RFC 5562 was written,
when no ECN responder mechanism had any logic to recognize a CE
marking on a SYN and, even if logic were added, there was no field
in the SYN-ACK to feed it back. The problem was that, during the
3WHS, the flag in the TCP header for ECN feedback (called Echo
Congestion Experienced) had been overloaded to negotiate the use of
ECN itself.
The accurate ECN (AccECN) protocol has since been designed to
solve this problem. Two features are important here:
An AccECN server uses the 3 'ECN' bits in the TCP header of
the SYN-ACK to respond to the client. 4 of the possible 8
codepoints provide enough space for the server to feed back
which of the 4 IP/ECN codepoints was on the incoming SYN
(including CE of course).
If any of these 4 codepoints are in the SYN-ACK, it confirms
that the server supports AccECN and, if another codepoint is
returned, it confirms that the server doesn't support
AccECN.
This still does not seem to allow a client to set ECT on a SYN,
it only finds out whether the server would have supported it
afterwards. The trick the client uses for ECN++ is to set ECT on the
SYN optimistically then, if the SYN-ACK reveals that the server
wouldn't have understood CE on the SYN, the client responds
conservatively as if the SYN was marked with CE.
The recommended conservative congestion response is to reduce the
initial window, which does not affect the performance of very
popular protocols such as HTTP, since it is extremely rare for an
HTTP client to send more than one packet as its initial request
anyway (for data on HTTP/1 & HTTP/2 request sizes see Fig 3 in
). Any clients that do frequently use a
larger initial window for their first message to the server can
cache which servers will not understand ECT on a SYN (see below). If caching is not
practical, such clients could reduce the initial window to say IW2
or IW3.
EXPERIMENTATION NEEDED: Experiments will be needed to
determine any better strategy for reducing IW in response to
congestion on a SYN, when the server does not support congestion
feedback on the SYN-ACK (whether cached or discovered
explicitly).
Given, until now, ECT-marked SYN packets have been prohibited, it
cannot be assumed they will be accepted, by TCP middleboxes or
servers.
According to a study using 2014 data
from a limited range of fixed vantage points, for the top 1M Alexa
web sites, adding the ECN capability to SYNs was increasing
connection establishment failures by about 0.4%.
From a wider range of fixed and mobile vantage points, a more
recent study in Jan-May 2017 found no
occurrences of blocking of ECT on SYNs. However, in more than half
the mobile networks tested it found wiping of the ECN codepoint at
the first hop.
MEASUREMENTS NEEDED: As wiping at the first hop is
remedied, measurements will be needed to check whether SYNs
with ECT are sometimes blocked deeper into the path.
Silent failures introduce a retransmission timeout delay
(default 1 second) at the initiator before it attempts any fall
back strategy (whereas explicit RSTs can be dealt with
immediately). Ironically, making SYNs ECN-capable is intended to
avoid the timeout when a SYN is lost due to congestion.
Fortunately, if there is any discard of ECN-capable SYNs due to
policy, it will occur predictably, not randomly like congestion.
So the initiator should be able to avoid it by caching those sites
that do not support ECN-capable SYNs (see the last paragraph of
).
A study conducted in Nov 2017
found that, of the 82% of the Alexa top 50k web servers that
supported ECN, 84% disabled ECN if the IP/ECN field on the SYN was
ECT0, CE or either. Given most web servers use Linux, this
behaviour can most likely be traced to a patch contributed in May
2012 that was first distributed in v3.5 of the Linux kernel . The comment says "RFC3168 : 6.1.1 SYN
packets must not have ECT/ECN bits set. If we receive a SYN packet
with these bits set, it means a network is playing bad games with
TOS bits. In order to avoid possible false congestion
notifications, we disable TCP ECN negociation." Of course, some of
the 84% might be due to similar code in other OSs.
For brevity we shall call this the "over-strict" ECN test,
because it is over-conservative with what it accepts, contrary to
Postel's robustness principle. A robust protocol will not usually
assume network mangling without comparing with the value
originally sent, and one packet is not sufficient to make an
assumption with such irreversible consequences anyway.
Ironically, networks rarely seem to alter the IP/ECN field on a
SYN from zero to non-zero anyway. In a study conducted in Jan-May
2017 over millions of paths from vantage points in a few dozen
mobile and fixed networks , no such
transition was observed. With such a small or non-existent
incidence of this sort of network mangling, it would be preferable
to report any residual problem paths so that they can be
fixed.
Whatever, the widespread presence of this 'over-strict' test
proves that RFC 5562 was correct to expect that ECT would be
considered invalid on SYNs. Nonetheless, it is not an
insurmountable problem - the over-strict test in Linux was patched
in Apr 2019 and caching can work
round it where previous versions of Linux are running. The
prevalence of these "over-strict" ECN servers makes it challenging
to cache them all. However, below explains how a
cache of limited size can alleviate this problem for a client's
most popular sites.
For the future, updates RFC 3168 to
clarify that the IP/ECN field does not have to be zero on a SYN if
documented in an experimental RFC such as the present ECN++
specification.
Given the server handling of ECN on SYNs outlined in above, an initiator
might combine AccECN with three candidate caching strategies for
setting ECT on a SYN:
Pessimistic ECT and cache successes: The initiator always
requests AccECN, but by default without ECT on the SYN. Then it
caches those servers that confirm that they support AccECN as
'ECT SYN OK'. On a subsequent connection to any server that
supports AccECN, the initiator can then set ECT on the SYN. When
connecting to other servers (non-ECN or classic ECN) it will not
set ECT on the SYN, so it will not fail the 'over-strict' ECN
test.Longer term, as servers upgrade to
AccECN, the initiator is still requesting AccECN, so it will add
them to the cache and use ECT on subsequent SYNs to those
servers. However, assuming it has to cap the size of the cache,
the client will not have the benefit of ECT SYNs to those less
frequently used AccECN servers expelled from its cache.
Optimistic ECT: The initiator always requests AccECN and by
default sets ECT on the SYN. Then, if the server response shows
it has no AccECN logic (so it cannot feed back a CE mark), the
initiator conservatively behaves as if the SYN was CE-marked, by
reducing its initial window.
No cache.
Cache failures: The optimistic ECT strategy can be
improved by caching solely those servers that do not support
AccECN as 'ECT SYN NOK'. This would include non-ECN servers
and all Classic ECN servers whether 'over-strict' or not. On
subsequent connections to these non-AccECN servers, the
initiator will still request AccECN but not set ECT on the
SYN. Then, the connection can still fall back to Classic
ECN, if the server supports it, and the initiator can use
its full initial window (if it has enough request data to
need it). Longer term, as servers
upgrade to AccECN, the initiator will remove them from the
cache and use ECT on subsequent SYNs to that server.Where an access network operator mediates
Internet access via a proxy that does not support AccECN,
the optimistic ECT strategy will always fail. This scenario
is more likely in mobile networks. Therefore, a mobile host
could cache lack of AccECN support per attached access
network operator. Whenever it attached to a new operator, it
could check a well-known AccECN test server and, if it found
no AccECN support, it would add a cache entry for the
attached operator. It would only use ECT when neither
network nor server were cached. It would only populate its
per server cache when not attached to a non-AccECN
proxy.
ECT by configuration: In a controlled environment, the
administrator can make sure that servers support ECN-capable SYN
packets. Examples of controlled environments are single-tenant
DCs, and possibly multi-tenant DCs if it is assumed that each
tenant mostly communicates with its own VMs.
For unmanaged environments like the public Internet,
pragmatically the choice is between strategies (S1), (S2A) and
(S2B). The normative specification for ECT on a SYN in recommends the "optimistic ECT and cache
failures" strategy (S2B) but the choice depends on the implementer's
motivation for using ECN++, and the deployment prevalence of
different technologies and bug-fixes.
The "pessimistic ECT and cache successes" strategy (S1)
suffers from exposing the initial SYN to the prevailing loss
level, even if the server supports ECT on SYNs, but only on the
first connection to each AccECN server. If AccECN becomes widely
deployed on servers, SYNs to those AccECN servers that are less
frequently used by the client and therefore don't fit in the
cache will not benefit from ECN protection at all.
The "optimistic ECT without a cache" strategy (S2A) is the
simplest. It would satisfy the goal of an implementer who is
solely interested in low latency using AccECN and ECN++ and is
not concerned about fall-back to Classic ECN.
The "optimistic ECT and cache failures" strategy (S2B)
exploits ECT on SYNs from the very first attempt. But if the
server turns out to be 'over-strict' it will disable ECN for the
connection, but only for the first connection if it's one of the
client's more popular servers that fits in the cache. If the
server turns out not to support AccECN, the initiator has to
conservatively limit its initial window, but again only for the
first connection if it's one of the client's more popular
servers (and anyway this rarely makes any difference when most
client requests fit in a single packet).
Note that, if AccECN deployment grows, caching successes (S1)
starts off small then grows, while caching failures (S2B) becomes
large at first, then shrinks. At half-way, the size of the cache has
to be capped with either approach, so the default behaviour for all
the servers that do not fit in the cache is as important as the
behaviour for the popular servers that do fit.
MEASUREMENTS NEEDED: Measurements are needed to determine
which strategy would be sufficient for any particular client,
whether a particular client would need different strategies in
different circumstances and how many occurrences of problems
would be masked by how few cache entries.
Another strategy would be to send a not-ECT SYN a short delay
(below the typical lowest RTT) after an ECT SYN and only accept the
non-ECT connection if it returned first. This would reduce the
performance penalty for those deploying ECT SYN support. However,
this 'happy eyeballs' approach becomes complex when multiple
optional features are all tried on the first SYN (or on multiple
SYNs), so it is not recommended.
says that ECT SYN packets could be
misused by malicious clients to augment "the well-known TCP SYN
attack". It goes on to say "a malicious host might be able to inject
a large number of TCP SYN packets through a potentially congested
ECN-enabled router, congesting it even further."
We assume this is a reference to the TCP SYN flood attack (see
https://en.wikipedia.org/wiki/SYN_flood), which is an attack against
a responder end point. We assume the idea of this attack is to use
ECT to get more packets through an ECN-enabled router in preference
to other non-ECN traffic so that they can go on to use the SYN
flooding attack to inflict more damage on the responder end point.
This argument could apply to flooding with any type of packet, but
we assume SYNs are singled out because their source address is
easier to spoof, whereas floods of other types of packets are easier
to block.
Mandating Not-ECT in an RFC does not stop attackers using ECT for
flooding. Nonetheless, if a standard says SYNs are not meant to be
ECT it would make it legitimate for firewalls to discard them.
However this would negate the considerable benefit of ECT SYNs for
compliant transports and seems unnecessary because RFC 3168 already
provides the means to address this concern. In section 7, RFC 3168
says "During periods where ... the potential packet marking rate
would be high, our recommendation is that routers drop packets
rather then set the CE codepoint..." and this advice is repeated in
(section 4.2.1). This makes it harder for
flooding packets to gain from ECT.
showed that ECT can only slightly
augment flooding attacks relative to a non-ECT attack. It was hard
to overload the link without causing the queue to grow, which in
turn caused the AQM to disable ECN and switch to drop, thus negating
any advantage of using ECT. This was true even with the switch-over
point set to 25% drop probability (i.e. the arrival rate was 133% of
the link rate).
The proposed approach in for
experimenting with ECN-capable SYN-ACKs is effectively identical to
the scheme called ECN+ . In 2005, the ECN+
paper demonstrated that it could reduce the average Web response time
by an order of magnitude. It also argued that adding ECT to SYN-ACKs
did not raise any new security vulnerabilities.
The feedback behaviour by the initiator in response to a
CE-marked SYN-ACK from the responder depends on whether classic ECN
feedback or AccECN feedback has been negotiated. In either
case no change is required to RFC 3168 or the AccECN
specification.
Some classic ECN client implementations might ignore a CE-mark on
a SYN-ACK, or even ignore a SYN-ACK packet entirely if it is set to
ECT or CE. This is a possibility because an RFC 3168 implementation
would not necessarily expect a SYN-ACK to be ECN-capable. This issue
already came up when the IETF first decided to experiment with ECN
on SYN-ACKs and it was decided to go ahead
without any extra precautionary measures. This was because the
probability of encountering the problem was believed to be low and
the harm if the problem arose was also low (see Appendix B of RFC
5562).
The IETF has already specified an experiment with ECN-capable
SYN-ACK packets . It was inspired by the
ECN+ paper, but it specified a much more conservative congestion
response to a CE-marked SYN-ACK, called ECN+/TryOnce. This required
the server to reduce its initial window to 1 segment (like ECN+),
but then the server had to send a second SYN-ACK and wait for its
ACK before it could continue with its initial window of 1 SMSS. The
second SYN-ACK of this 5-way handshake had to carry no data, and had
to disable ECN, but no justification was given for these last two
aspects.
The present ECN++ experimental specification obsoletes RFC 5562
because it uses the ECN+ congestion response, not ECN+/TryOnce.
First we argue against the rationale for ECN+/TryOnce given in
sections 4.4 and 6.2 of . It starts with a
rather too literal interpretation of the requirement in RFC 3168
that says TCP's response to a single CE mark has to be "essentially
the same as the congestion control response to a *single* dropped
packet." TCP's response to a dropped initial (SYN or SYN-ACK) packet
is to wait for the retransmission timer to expire (currently 1s).
However, this long delay assumes the worst case between two possible
causes of the loss: a) heavy overload; or b) the normal
capacity-seeking behaviour of other TCP flows. When the network is
still delivering CE-marked packets, it implies that there is an AQM
at the bottleneck and that it is not overloaded. This is because an
AQM under overload will disable ECN (as recommended in section 7 of
RFC 3168 and repeated in section 4.2.1 of RFC 7567). So scenario (a)
can be ruled out. Therefore, TCP's response to a CE-marked SYN-ACK
can be similar to its response to the loss of any
packet, rather than backing off as if the special initial packet of a flow has been lost.
How TCP responds to the loss of any single packet depends what it
has just been doing. But there is not really a precedent for TCP's
response when it experiences a CE mark having sent only one (small)
packet. If TCP had been adding one segment per RTT, it would have
halved its congestion window, but it hasn't established a congestion
window yet. If it had been exponentially increasing it would have
exited slow start, but it hasn't started exponentially increasing
yet so it hasn't established a slow-start threshold.
Therefore, we have to work out a reasoned argument for what to
do. If an AQM is CE-marking packets, it implies there is already a
queue and it is probably already somewhere around the AQM's
operating point - it is unlikely to be well below and it might be
well above. So, the more data packets that the client sends in its
IW, the more likely at least one will be CE marked, leading it to
exit slow-start early. On the other hand, it is highly unlikely that
the SYN-ACK itself pushed the AQM into congestion, so it will be
safe to introduce another single segment immediately (1 RTT after
the SYN-ACK). Therefore, starting to probe for capacity with a slow
start from an initial window of 1 segment seems appropriate to the
circumstances. This is the approach adopted in .
EXPERIMENTATION NEEDED: Experiments will be needed to check
the above reasoning and determine any better strategy for
reducing IW in response to congestion on a SYN-ACK (or a
SYN).
An alternative to the server caching failed connection attempts
would be for the server to rely on the client caching failed
attempts (on the basis that the client would cache a failure whether
ECT was blocked on the SYN or the SYN-ACK). This strategy cannot be
used if the SYN does not request AccECN support. It works as
follows: if the server receives a SYN that requests AccECN support
but is set to not-ECT, it replies with a SYN-ACK also set to
not-ECT. If a middlebox only blocks ECT on SYNs, not SYN-ACKs, this
strategy might disable ECN on a SYN-ACK when it did not need to, but
at least it saves the server from maintaining a cache.
Section 5.2 of RFC 3168 gives the following arguments for not
allowing the ECT marking of pure ACKs (ACKs not piggy-backed on data):
"To ensure the reliable delivery of the congestion indication
of the CE codepoint, an ECT codepoint MUST NOT be set in a packet
unless the loss of that packet in the network would be detected by
the end nodes and interpreted as an indication of congestion.
Transport protocols such as TCP do not necessarily detect all
packet drops, such as the drop of a "pure" ACK packet; for
example, TCP does not reduce the arrival rate of subsequent ACK
packets in response to an earlier dropped ACK packet. Any proposal
for extending ECN-Capability to such packets would have to address
issues such as the case of an ACK packet that was marked with the
CE codepoint but was later dropped in the network. We believe that
this aspect is still the subject of research, so this document
specifies that at this time, "pure" ACK packets MUST NOT indicate
ECN-Capability."
Later on, in section 6.1.4 it reads:
"For the current generation of TCP congestion control
algorithms, pure acknowledgement packets (e.g., packets that do
not contain any accompanying data) MUST be sent with the not-ECT
codepoint. Current TCP receivers have no mechanisms for reducing
traffic on the ACK-path in response to congestion notification.
Mechanisms for responding to congestion on the ACK-path are areas
for current and future research. (One simple possibility would be
for the sender to reduce its congestion window when it receives a
pure ACK packet with the CE codepoint set). For current TCP
implementations, a single dropped ACK generally has only a very
small effect on the TCP's sending rate."
We next address each of the arguments presented above.
The first argument is a specific instance of the reliability
argument for the case of pure ACKs. This has already been addressed by
countering the general reliability argument in .
The second argument says that ECN ought not to be enabled unless
there is a mechanism to respond to it. This argument actually
comprises three sub-arguments:
If ECN is enabled on Pure
ACKs, are there, or could there be, suitable mechanisms to detect,
feed back and respond to ECN-marked Pure ACKs?
There has never been a mechanism
to respond to loss of non-ECN Pure ACKs. So it seems that adding
ECN without a response mechanism will do no extra harm to others,
while improving a connection's own performance (because loss of an
ACK holds back new data). However, if the end systems have no
response mechanism, ECN Pure ACKs do slightly more harm than
non-ECN, because the AQM doesn't immediately clear ECT packets
from the queue until it reaches overload and disables ECN.
Even if there were no harm to
others, does it set an undesirable precedent to allow a flow to
use ECN to protect its Pure ACKs from loss, when there is no
mechanism to respond to ECN-marking?
The last two arguments involve value judgements, but they both
depend on the concrete technical question of mechanism feasibility,
which will therefore be addressed first in below. Then draws conclusions by addressing
the value judgements in the other two questions.
The question of whether the receiver of pure ACKs is required to
detect and feed back any CE-marking is outside the scope of the
present specification - it is a matter for the relevant feedback
specification (classic ECN and AccECN ). The response to congestion
feedback is also out of scope, because it would be defined in the
base TCP congestion control specification
or its variants.
Nonetheless, in order to decide whether the present ECN++
experimental specification should require a host to set ECT on pure
ACKs, we only need to know whether a response mechanism would be
feasible - we do not have to standardize it. So the bullets below
assess, for each type of feedback, whether the three stages of the
congestion response mechanism could all work.
Can the receiver of a pure ACK detect a
CE marking on it?:
Classic feedback: RFC 3168 is silent on this point. The
implementer of the receiver would not expect CE marks on
pure ACKs, but the implementation might happen to check for
CE marks before it looks for the data. So detection will be
implementation-dependent.
AccECN feedback: the AccECN specification requires the
receiver of any TCP packets to count any CE marks on them
(whether or not it sends ECN-capable control packets
itself).
As a general rule, TCP does not ACK a
pure ACK. However, even if the receiver of a CE-mark on a pure
ACK does not feed it back immediately, it could still include it
within subsequent feedback, for instance when it later sends a
data segment (if it ever does):
Classic feedback: RFC 3168 is silent on this point, so
feedback of CE-markings might be implementation specific. If
the receiver (of the pure ACKs) did generate feedback, it
would set the echo congestion experienced (ECE) flag in the
TCP header of subsequent packets in the round, as it would
to feed back CE on data packets.
AccECN feedback: the receiver continually feeds back a
count of the number of CE-marked packets that it has
received and, optionally, a count of CE-marked bytes. For
either metric, AccECN takes into account all types of
packets, including pure ACKs. CE-marked pure ACKs will
solely increment the packet counter; not any byte counter,
because by definition they contain no bytes of data.
In either case (classic or
AccECN feedback), if the TCP sender does receive feedback about
CE-markings on pure ACKs, it will be able to reduce the
congestion window (cwnd) and/or the ACK rate.
Therefore a congestion response mechanism is clearly
feasible if AccECN has been negotiated, but the position is unknown
for the installed base of classic ECN feedback.
This subsection explores issues that congestion control
designers will need to consider when defining a cwnd response to
CE-marked Pure ACKs.
A CE-mark on a Pure ACK does not mean that only Pure ACKs are
causing congestion. It only means that the marked Pure ACK is part
of an aggregate that is collectively causing a bottleneck queue to
randomly CE-mark a fraction of the packets. A CE-mark on a Pure
ACK might be due to data packets in other flows through the same
bottleneck, due to data packets interspersed between Pure ACKs in
the same half-connection, or just due to the rate of Pure ACKs
alone. (RFC 3168 only considered the last possibility, which led
to the argument that ECN-enabled Pure ACKs had to be deferred,
because ACK congestion control was a research issue.)
If a host has been sending a mix of Pure ACKs and data, it
doesn't need to work out whether a particular CE mark was on a
Pure ACK or not; it just needs to respond to congestion feedback
as a whole by reducing its congestion window (cwnd), which limits
the data it can launch into flight through the congested
bottleneck. If it is purely receiving data and sending only Pure
ACKs, reducing cwnd will have caused it no harm, having no effect
on its ACK rate (the next subsection addresses that).
However, when a host is sending data as well as Pure ACKs, it
would not be right for CE-marks on Pure ACKs and on data packets
to induce the same reduction in cwnd. A possible way to address
this issue would be to weight the response by the size of the
marked packets (assuming the congestion control supports a
weighted response, e.g. ). For instance,
one could calculate the fraction of CE-marked bytes (headers and
data) over each round trip (say) as follows:
(CE-marked header bytes + CE-marked data bytes) / (all
header bytes + all data bytes)
Header bytes can be calculated by multiplying a packet
count by a nominal header size, which is possible with AccECN
feedback, because it gives a count of CE-marked packets (as well
as CE-marked bytes). The above simple aggregate calculation caters
for the full range of scenarios; from all Pure ACKs to just a few
interspersed with data packets.
Note that any mechanism that reduces cwnd due to CE-marked Pure
ACKs would need to be integrated with the congestion window
validation mechanism , which already
conservatively reduces cwnd over time because cwnd becomes stale
if it is not used to fill the pipe.
Reducing the congestion window will have no effect on the rate
of pure ACKs. The worst case here is if the bottleneck is
congested solely with pure ACKs, but it could also be problematic
if a large fraction of the load was from unresponsive ACKs,
leaving little or no capacity for the load from responsive
data.
Since RFC 3168 was published, experimental Acknowledgement
Congestion Control (AckCC) techniques have been documented in
(informational). So any pair of TCP
end-points can choose to agree to regulate the delayed ACK ratio
in response to lost or CE-marked pure ACKs. However, the protocol
has a number of open issues concerning deployment (e.g. it
requires support from both ends, it relies on two new TCP options,
one of which is required on the SYN where option space is at a
premium and, if either option is blocked by a middlebox, no
fall-back behaviour is specified).
The new TCP options address two problems, namely that TCP had:
i) no mechanism to allow ECT to be set on pure ACKs; and ii) no
mechanism to feed back loss or CE-marking of pure ACKs. A
combination of the present specification and AccECN addresses both
these problems, at least for CE-marking. So it might now be
possible to design an ECN-specific ACK congestion control scheme
without the extra TCP options proposed in RFC 5690. However, such
a mechanism is out of scope of the present document.
Setting aside the practicality of RFC 5690, the need for AckCC
has not been conclusively demonstrated. It has been argued that
the Internet has survived so far with no mechanism to even detect
loss of pure ACKs. However, it has also been argued that ECN is
not the same as loss. Packet discard can naturally thin the ACK
load to whatever the bottleneck can support, whereas ECN marking
does not (it queues the ACKs instead). Nonetheless, RFC 3168
(section 7) recommends that an AQM switches over from ECN marking
to discard when the marking probability becomes high. Therefore
discard can still be relied on to thin out ECN-enabled pure ACKs
as a last resort.
In the case when AccECN has been negotiated, it provides a
feasible congestion response mechanism, so the arguments for ECT on
pure ACKs heavily outweigh those against. ECN is always more and
never less reliable for delivery of congestion notification. A cwnd
reduction needs to be considered by congestion control designers as
a response to congestion on pure ACKs. Separately, AckCC (or an
improved variant exploiting AccECN) could optionally be used to
regulate the spacing between pure ACKs. However, it is not clear
whether AckCC is justified. If it is not, packet discard will still
act as the "congestion response of last resort" by thinning out the
traffic. In contrast, not setting ECT on pure ACKs is certainly
detrimental to performance, because when a pure ACK is lost it can
prevent the release of new data.
In the case when Classic ECN has been negotiated, the argument
for ECT on pure ACKs is less clear-cut. Some of the installed base
of RFC 3168 implementations might happen to (unintentionally)
provide a feedback mechanism to support a cwnd response. For those
that did not, setting ECT on pure ACKs would be better for the
flow's own performance than not setting it. However, where there was
no feedback mechanism, setting ECT could do slightly more harm than
not setting it. AckCC could provide a complementary response
mechanism, because it is designed to work with RFC 3168 ECN, but it
has deployment challenges. In summary, a congestion response
mechanism is unlikely to be feasible with the installed base of
classic ECN.
This specification uses a safe approach. Allowing hosts to set
ECT on Pure ACKs without a feasible response mechanism could result
in risk. It would certainly improve the flow's own performance, but
it would slightly increase potential harm to others. Morevoer, if
would set an undesirable precedent for setting ECT on packets with
no mechanism to respond to any resulting congestion signals.
Therefore, allows ECT on Pure ACKs if AccECN
feedback has been negotiated, but not with classic RFC 3168 ECN
feedback.
Section 6.1.6 of RFC 3168 presents only the reliability argument
for prohibiting ECT on Window probes:
"If a window probe packet is dropped in the network, this loss
is not detected by the receiver. Therefore, the TCP data sender
MUST NOT set either an ECT codepoint or the CWR bit on window
probe packets.
However, because window probes use exact sequence numbers, they
cannot be easily spoofed in denial-of-service attacks. Therefore,
if a window probe arrives with the CE codepoint set, then the
receiver SHOULD respond to the ECN indications."
The reliability argument has already been addressed in .
Allowing ECT on window probes could considerably improve
performance because, once the receive window has reopened, if a window
probe is lost the sender will stall until the next window probe
reaches the receiver, which might be after the maximum retransmission
timeout (at least 1 minute ).
On the bright side, RFC 3168 at least specifies the receiver
behaviour if a CE-marked window probe arrives, so changing the
behaviour ought to be less painful than for other packet types.
RFC 3168 is silent on whether a TCP sender can set ECT on a FIN. A
FIN is considered as part of the sequence of data, and the rate of
pure ACKs sent after a FIN could be controlled by a CE marking on the
FIN. Therefore there is no reason not to set ECT on a FIN.
RFC 3168 is silent on whether a TCP sender can set ECT on a RST.
The host generating the RST message does not have an open connection
after sending it (either because there was no such connection when the
packet that triggered the RST message was received or because the
packet that triggered the RST message also triggered the closure of
the connection).
Moreover, the receiver of a CE-marked RST message can either: i)
accept the RST message and close the connection; ii) emit a so-called
challenge ACK in response (with suitable throttling) and otherwise ignore the RST (e.g. because the
sequence number is in-window but not the precise number expected
next); or iii) discard the RST message (e.g. because the sequence
number is out-of-window). In the first two cases there is no point in
echoing any CE mark received because the sender closed its connection
when it sent the RST. In the third case it makes sense to discard the
CE signal as well as the RST.
Although a congestion response following a CE-marking on a RST does
not appear to make sense, the following factors have been considered
before deciding whether the sender ought to set ECT on a RST
message:
As explained above, a congestion response by the sender of a
CE-marked RST message is not possible;
So the only reason for the sender setting ECT on a RST would be
to improve the reliability of the message's delivery;
RST messages are used to both mount and mitigate attacks:
Spoofed RST messages are used by attackers to terminate
ongoing connections, although the mitigations in RFC 5961 have
considerably raised the bar against off-path RST attacks;
Legitimate RST messages allow endpoints to inform their
peers to eliminate existing state that correspond to non
existing connections, liberating resources e.g. in DoS attacks
scenarios;
AQMs are advised to disable ECN marking during persistent
overload, so:
it is harder for an attacker to exploit ECN to intensify an
attack;
it is harder for a legitimate user to exploit ECN to more
reliably mitigate an attack
Prohibiting ECT on a RST would deny the benefit of ECN to
legitimate RST messages, but not to attackers who can disregard
RFCs;
If ECT were prohibited on RSTs
it would be easy for security middleboxes to discard all
ECN-capable RSTs;
However, unlike a SYN flood, it is already easy for a
security middlebox (or host) to distinguish a RST flood from
legitimate traffic , and even if a
some legitimate RSTs are accidentally removed as well,
legitimate connections still function.
So, on balance, it has been decided that it is worth
experimenting with ECT on RSTs. During experiments, if the ECN
capability on RSTs is found to open a vulnerability that is hard to
close, this decision can be reversed, before it is specified for the
standards track.
RFC 3168 says the sender "MUST NOT" set ECT on retransmitted
packets. The rationale for this consumes nearly 2 pages of RFC 3168,
so the reader is referred to section 6.1.5 of RFC 3168, rather than
quoting it all here. There are essentially three arguments, namely:
reliability; DoS attacks; and over-reaction to congestion. We address
them in order below.
The reliability argument has already been addressed in .
Protection against DoS attacks is not afforded by prohibiting ECT
on retransmitted packets. An attacker can set CE on spoofed
retransmissions whether or not it is prohibited by an RFC. Protection
against the DoS attack described in section 6.1.5 of RFC 3168 is
solely afforded by the requirement that "the TCP data receiver SHOULD
ignore the CE codepoint on out-of-window packets". Therefore in the sender is allowed to set ECT on
retransmitted packets, in order to reduce the chance of them being
dropped. We also strengthen the receiver's requirement from "SHOULD
ignore" to "MUST ignore". And we generalize the receiver's requirement
to include failure of any validity check, not just out-of-window
checks, in order to include the more stringent validity checks in RFC
5961 that have been developed since RFC 3168.
A consequence is that, for those retransmitted packets that arrive
at the receiver after the original packet has been properly received
(so-called spurious retransmissions), any CE marking will be ignored.
There is no problem with that because the fact that the original
packet has been delivered implies that the sender's original
congestion response (when it deemed the packet lost and retransmitted
it) was unnecessary.
Finally, the third argument is about over-reacting to congestion.
The argument goes that, if a retransmitted packet is dropped, the
sender will not detect it, so it will not react again to congestion
(it would have reduced its congestion window already when it
retransmitted the packet). Whereas, if retransmitted packets can be CE
tagged instead of dropped, senders could potentially react more than
once to congestion. However, we argue that it is legitimate to respond
again to congestion if it still persists in subsequent round
trip(s).
Therefore, in all three cases, it is not incorrect to set ECT on
retransmissions.
Extensive experiments have found no evidence of any traversal
problems with ECT on any TCP control packet . Nonetheless, Sections and specify fall-back measures if
ECT on the first packet of each half-connection (SYN or SYN-ACK)
appears to be blocking progress. Here, the question of fall-back
measures for ECT on other control packets is explored. It supports the
advice given in ; until
there's evidence that something's broken, don't fix it.
If an implementation has had to disable ECT to ensure the first
packet of a flow (SYN or SYN-ACK) gets through, the question arises
whether it ought to disable ECT on all subsequent control packets
within the same TCP connection. Without evidence of any such problems,
this seems unnecessarily cautious. Particularly given it would be hard
to detect loss of most other types of TCP control packets that are not
ACK'd. And particularly given that unnecessarily removing ECT from
other control packets could lead to performance problems, e.g. by
directing them into another queue or over a different path, because
some broken multipath equipment (erroneously) routes based on all 8
bits of the Diffserv field.
In the case where a connection starts without ECT on the SYN
(perhaps because problems with previous connections had been cached),
there will have been no test for ECT traversal in the client-server
direction until the pure ACK that completes the handshake. It is
possible that some middlebox might block ECT on this pure ACK or on
later retransmissions of lost packets. Similarly, after a route
change, the new path might include some middlebox that blocks ECT on
some or all TCP control packets. However, without evidence of such
problems, the complexity of a fix does not seem worthwhile.
MORE MEASUREMENTS NEEDED (?): If further two-ended measurements
do find evidence for these traversal problems, measurements would
be needed to check for correlation of ECT traversal problems
between different control packets. It might then be necessary to
introduce a catch-all fall-back rule that disables ECT on certain
subsequent TCP control packets based on some criteria developed
from these measurements.
The following subsections discuss any interactions between setting
ECT on all packets and using the following popular variants of TCP: IW10
and TFO. It also briefly notes the possibility that the principles
applied here should translate to protocols derived from TCP. This
section is informative not normative, because no interactions have been
identified that require any change to specifications. The subsection on
IW10 discusses potential changes to specifications but recommends that
no changes are needed.
The designs of the following TCP variants have also been assessed and
found not to interact adversely with ECT on TCP control packets: SYN
cookies (see Appendix A of and section 3.1 of
), TCP Fast Open (TFO )
and L4S .
IW10 is an experiment to determine whether it is safe for TCP to
use an initial window of 10 SMSS .
This subsection does not recommend any additions to the present
specification in order to interwork with IW10. The specifications as
they stand are safe, and there is only a corner-case with ECT on the
SYN where performance could be occasionally improved, as explained
below.
As specified in , a TCP
initiator will typically only set ECT on the SYN if it requests AccECN
support. If, however, the SYN-ACK tells the initiator that the
responder does not support AccECN,
advises the initiator to conservatively reduce its initial window,
preferably to 1 SMSS because, if the SYN was CE-marked, the SYN-ACK
has no way to feed that back.
If the initiator implements IW10, it seems rather over-conservative
to reduce IW from 10 to 1 just in case a congestion marking was
missed. Nonetheless, a reduction to 1 SMSS will rarely harm
performance, because:
as long as the initiator is caching failures to negotiate
AccECN, subsequent attempts to access the same server will not use
ECT on the SYN anyway, so there will no longer be any need to
conservatively reduce IW;
currently, at least for web sessions, it is extremely rare for
a TCP initiator (client) to have more than one data segment to
send at the start of a TCP connection (see Fig 3 in ) - IW10 is primarily exploited by TCP
servers.
If a responder receives feedback that the SYN-ACK was CE-marked,
recommends that it
reduces its initial window, preferably to 1 SMSS. When the responder
also implements IW10, it might again seem rather over-conservative to
reduce IW from 10 to 1. But in this case the rationale is somewhat
different:
Feedback that the SYN-ACK was CE-marked is an explicit
indication that the queue has been building, not just uncertainty
due to absence of feedback;
Given it is now likely that a queue already exists, the more
data packets that the server sends in its IW, the more likely at
least one will be CE marked, leading it to exit slow-start
early.
Experimentation will be needed to determine the best
strategy. It should be noted that experience from recent congestion
avoidance experiments where the window is reduced by less than half is
not necessarily applicable to a flow start scenario. Reducing cwnd by
less is one thing. Reducing an increase in cwnd by less is
another.
TCP Fast Open (TFO ) is an experiment to
remove the round trip delay of TCP's 3-way hand-shake (3WHS). A TFO
initiator caches a cookie from a previous connection with a
TFO-enabled server. Then, for subsequent connections to the same
server, any data included on the SYN can be passed directly to the
server application, which can then return up to an initial window of
response data on the SYN-ACK and on data segments straight after it,
without waiting for the ACK that completes the 3WHS.
The TFO experiment and the present experiment to add ECN-support
for TCP control packets can be combined without altering either
specification, which is justified as follows:
The handling of ECN marking on a SYN is no different whether or
not it carries data.
In response to any CE-marking on the SYN-ACK, the responder
adopts the normal response to congestion, as discussed in Section
7.2 of .
A Low Latency Low Loss Scalable throughput (L4S) variant of TCP
such as TCP Prague is mandated to
negotiate AccECN feedback, and strongly recommended to use ECN++ .
The L4S experiment and the present ECN++ experiment can be combined
without altering any of the specifications. The only difference would
be in the recommendation of the best SYN cache strategy.
The normative specification for ECT on a SYN in recommends the "optimistic ECT and cache
failures" strategy (S2B defined in ) for the general Internet.
However, if a user's Internet access bottleneck supported L4S ECN but
not Classic ECN, the "optimistic ECT without a cache" strategy (S2A)
would make most sense, because there would be little point trying to
avoid the 'over-strict' test and negotiate Classic ECN, if L4S ECN but
not Classic ECN was available on that user's access link (as is the
case with Low Latency DOCSIS ).
Strategy (S2A) is the simplest, because it requires no cache. It
would satisfy the goal of an implementer who is solely interested in
ultra-low latency using AccECN and ECN++ (e.g. accessing L4S servers)
and is not concerned about fall-back to Classic ECN (e.g. when
accessing other servers).
Experience from experiments on adding ECN support to all TCP
packets ought to be directly transferable between TCP and other
transport protocols, like SCTP or QUIC.
Stream Control Transmission Protocol (SCTP ) is a standards track transport protocol derived
from TCP. SCTP currently does not include ECN support, but Appendix A
of RFC 4960 broadly describes how it would be supported and a
(long-expired) draft on the addition of ECN to SCTP has been produced
. This draft avoided setting
ECT on control packets and retransmissions, closely following the
arguments in RFC 3168.
QUIC is another standards track transport
protocol offering similar services to TCP but intended to exploit some
of the benefits of running over UDP. Building on the arguments in the
current draft, a QUIC sender sets ECT(0) on all packets.
considers the question of whether
ECT on RSTs will allow RST attacks to be intensified. There are several
security arguments presented in RFC 3168 for preventing the ECN marking
of TCP control packets and retransmitted segments. We believe all of
them have been properly addressed in ,
particularly and on DoS attacks using spoofed ECT-marked SYNs
and spoofed CE-marked retransmissions.
on sending TCP RSTs points out
that implementers need to take care to ensure that the ECN field on a
RST does not depend on TCP's state machine. Otherwise the internal
information revealed could be of use to potential attackers. This point
applies more generally to all control packets, not just RSTs.
There are no IANA considerations in this memo.
Thanks to Mirja Kühlewind, David Black, Padma Bhooma, Gorry
Fairhurst, Michael Scharf, Yuchung Cheng and Christophe Paasch for their
useful reviews. Richard Scheffenegger provided useful advice gained from
implementing ECN++ for FreeBSD.
The work of Marcelo Bagnulo has been performed in the framework of
the H2020-ICT-2014-2 project 5G NORMA. His contribution reflects the
consortium's view, but the consortium is not liable for any use that may
be made of any of the information contained therein.
Bob Briscoe's contribution was partly funded by the Research Council
of Norway through the TimeIn project, partly by CableLabs and partly by
the Comcast Innovation Fund. The views expressed here are solely those
of the authors.
Attaining the promise and avoiding the pitfalls of TCP in the
Datacenter
Enabling Internet-Wide Deployment of Explicit Congestion
Notification
ETHZ
The Power of Explicit Congestion Notification
Measuring ECN++: Good News for ++, Bad News for ECN over
Mobile
UC3M
Simula
Simula
UC3M
Simula
How HTTP/2 is changing Web traffic and how to detect
it
Université Catholique de
Louvain
Politecnico di Torino
Université Catholique de
Louvain
Tracing Internet Path Transparency
ETHZ
ETHZ
University of Aberdeen
ETHZ
tcp: be more strict before accepting ECN negociation
tcp: Accept ECT on SYN in the presence of RFC8311
Destruction Testing: Ultra-Low Delay using Dual Queue Coupled
Active Queue Management
Department of Informatics, University of
Oslo
Implementing the `TCP Prague' Requirements for Low Latency
Low Loss Scalable Throughput (L4S)
Independent
Nokia Bell Labs
Simula Research Lab
Simula Research Lab
Nokia Bell Labs
ETH Zurich
Simula Research Lab
MAC and Upper Layer Protocols Interface (MULPI)
Specification, CM-SP-MULPIv3.1
CableLabs