Network Working Group K. Nielsen
Internet-Draft Ericsson
Intended status: Experimental November 11, 2014
Expires: May 15, 2015

SCTP Tail Loss Recovery Enhancements
draft-nielsen-tsvwg-sctp-tlr-01.txt

Abstract

Loss Recovery by means of T3-Retransmission has significant detrimental impact on the delays experienced through an SCTP association. The throughput achievable over an SCTP association also is negatively impacted by the occurence of T3-Retransmissions. Loss Recovery by Fast Retransmission operation is in most situations superior to T3-Retransmission from a latency and a throughput perspective. The present SCTP Fast Recovery algorithms as specified by [RFC4960] are not able to adequately or timely recover losses in certain situations, thus resorting to loss recovery by lengthy T3-Retransimissions or by non-timely activation of Fast Recovery. In this document we propose for a number of enhancements to the SCTP Loss Recovery algorithms aimed to amend some of these deficiencies with a particular focus on Loss Recovery for drops in Traffic Tails. The enhancements supplement the existing algorithms of [RFC4960] with proactive probing and timer driven activation of the Fast Retransmission algorithm as well as a number of enhancements of the Fast Retransmission algorithm in itself are proposed. The enhancement are proposed as supplements to the Loss Recovery algorithms of [RFC4960] and as such they do not deprecate or replace any of the mechanisms defined by [RFC4960].

The solution proposed draws on prior art in the area of SCTP and TCP Loss Recovery improvements. The mechanisms proposed include the adjustment to SCTP Fast Retransmission of certain improvements specified for TCP Fast Retransmission by [RFC6675] as well as the proposal embeds SCTP Early Retransmit [RFC5827] in a delayed variant. The proposal heavily draws on the ideas put forward for TCP by [DUKKIPATI01] for proactive probing and timer driven entering of Fast Recovery. The proposal embeds certain aspects from [HURTIG] when applicable. The procedures proposed are sender-side only and do not impact the SCTP receiver.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on May 15, 2015.

Copyright Notice

Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.


Table of Contents

1. Introduction

Loss Recovery by means of T3-Retransmission has significant impact on the delays experienced through, as well as, the throughput achievable over an SCTP association. Loss Recovery by Fast Retransmission (FR) operation in most situations is superior to T3-Retransmission from both a latency and a throughput perspective.

The present SCTP Fast Retransmission algorithm, as specified by [RFC4960], is driven uniquely by exceed of a duptresh number of mis indication counts stemming for returned SACKs, and it is as such not able to adequately or timely recover losses in traffic tails where a sufficient number of such SACKs may not be generated, there resorting to loss recovery by T3-Retransimissions or by "non-timely" activation of Fast Recovery.

By drop in traffic tails we refer not only to "pure" tail drops, i.e., drop of all packets in the end of the communication on an SCTP association from a certain point onwards, but more generally and specifically to the following situations:

  1. Pure tail drops of the last SCTP packets of an SCTP association or more generally drop of packets in the end of an SCTP association which are not proceeded by more than dupthresh number of packets which are not dropped. Drops of either type we will generally refer to as Tail Drops.
  2. Tails Drops among packets sent in a the end of bursts spaced by pauses of time equal to or greater than the T3-timeout (approximately). It is noted that such bursts (pauses in between bursts) may result from application limitations, from congestion control limitations or from receiver side limitations.
  3. Drops among packets sent so sparsely that each dropped packet constitutes a tail drop in that dupthresh number of packets would not be sent (would not be available for sent) prior to expiry of the T3-timeout.

It shall be noted that while the above traffic drop criteria describe drops among the forward data packets only, then drops among forward data packets combined with drops of the returned SACKs may together result in that an insufficient number of SACKs be returned to traffic sender for that the Fast Retransmission algorithm be activated prior to T3-timeout occurring. The tail traffic situations for which SCTP FR is not able to recover the losses is thus in general broader than the exact situations listed above. The improvements proposed includes enhancement of SCTP to deduce the mis indication counts from an enhanced SACK scoreboard thus removing some of the vulnerability of the present SCTP mis indication counting to loss of SACKs.

It is noted that the Early Retransmit algorithm, [RFC5827], addresses activation of Fast Recovery for a particular subset of the above tail drop situations. The solution proposed embeds (as a special case) the Early Retransmits algorithm in the delayed variant, experienced with for TCP in [DUKKIPATI02] in which Early Retransmission is only activated provided a certain time has elapsed since the lowest outstanding TSN was transmitted. The delay adds robustness towards spurious retransmissions caused by "mild" packet re-ordering as documented for TCP in [DUKKIPATI02].

1.1. SCTP TLR Function

The function proposed for enhancements of the SCTP Loss Recovery operation for Traffic Tail Losses is divided in two parts:

It is noted that depending on the exact situation (e.g., drop pattern, congestion window and amount of data in flight) then T3-retransmission procedures need not be inferior to Fast Retransmission procedures. Rather in some situations T3-retransmission will indeed be superior as T3-retransmissions allow for ramp up of the congestion window during the Recovery Process and as it, by its nature of declaring all outstanding data as lost, never risks being blocked by congestion window limitations. The changes proposed in this document focus on improving the Loss Recovery operation of SCTP by enforcing timely activation of (improved) Fast Retransmission algorithms. With the purpose to reduce the latency of the TCP and SCTP Loss Recovery operation [HURTIG] has taken the alternative approach of accelerating the activation of T3-retransmission processes when Fast Recovery is not able to kick in to recover the loss. [HURTIG] only addresses a subset of the Tail loss scenarios in scope in the work presented here. The ideas of [HURTIG] for accurate RTO restart are drawn on in the solution proposed here for accurate restart of the new tail loss probe timer (PTO-timer) as well as for accurate set of the T3-timer under certain conditions thus harvesting some og the same latency optimizations as [HURTIG].

OPEN ISSUE: It is to be determined if [HURTIG], or plain T3-retransmission of [RFC4960], are opportune compared to the solution proposed here in certain situations. Speculated situations include situations where the Fast Retransmission algorithm (when activated via new proactive approach) is blocked by congestion control (CC) limitations. If the issue is significant, the remedy may be to look for special purpose amendments, like to amend the CC operation during SCTP FR or to redesign the solution to promote proactive T3-retransmission operation rather than Fast Retransmission in certain situations. Yet another remedy may be to generally look to improve the CC operation of SCTP.

The SCTP TLR procedures proposed apply as add-on supplements to any SCTP implementation based on [RFC4960]. The procedures are sender-side only and do not impact the SCTP receiver.

1.2. TCP applicability

SCTP Loss Recovery operation in its core is based on the design of Loss Recovery for TCP with SACK enabled. The enhancements of SCTP Tail Loss Recovery proposed here are readably applicable for TCP.

It is noted that while the SCTP TLR algorithms and SCTP TLR state machine defined is inspired by the timer driven tail loss probe approach specified in [DUKKIPATI01] for TCP, then the solution defined here differs in the approach taken. The approach here is a clean state approach defining a new comprehensive SCTP TLR statemachine on top of (in addition to) the existing Fast Recovery and T3-Recovery states covering all tails loss patterns, whereas the approach of [DUKKIPATI01] relies on a number of experimental mechanisms ([DUKKIPATI02], [MATHIS], [RFC5827]) defined for TCP in IETF or in Research with adhoc extension to support selected Tail loss patterns by addition of the tail loss probe mechanism and the therefrom driven activation of the mechanisms.

1.3. Packet Re-ordering

The solution proposed is an enhancement of the existing mis indication counting based Fast Recovery operation of SCTP, [RFC4960], and as such the solution inherits the fundamental vulnerability to packet re-ordering that the SCTP Fast Recovery algorithm of [RFC4960] embeds.

The solution does not increase the vulnerability of Loss Recovery to packet-reordering as demonstrated by (to be filled in).

1.4. Congestion Control

It shall be noted that in its very nature of prompting for activation of Fast Recovery instead of T3-Recovery then the benefit of the solution proposed versus the existing solution of [RFC4960] will depend on the CC operation not only during the recovery process but also after exit of the recovery process. In this context it is noted that the prior approach taken for TCP, [DUKKIPATI01], has been documented for a TCP implementation running CUBIC, whereas SCTP runs a CC algorithm more similar to TCP Reno CC as defined by [RFC5681].

The solution at present is defined within the constraints of existing Congestion Control principles of STCP as defined by [RFC4960]. It is anticipated that Congestion Control improvements are desirable for SCTP in general as well as for the functions deined here in particular.

2. Conventions and Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].

3. Description of Algorithms

3.1. SCTP Scoreboard and Mis Indication Counting Enhancements

3.1.1. Highest TSN Newly Acknowledged Extension

Entering of Fast Recovery in SCTP, as specified by [RFC4960]), is driven by mis indication counts. When a TSN has received dupthresh=3 mis indication counts, the TSN is declared lost and will be eligible for fast retransmission via Fast Recovery procedure.

Mis indication counts are in RFC4960 SCTP driven entirely by receipt of SACKs in accordance with the Highest TSN Newly Acknowledged algorithm (section 7.2.4 of [RFC4960]):

An evident issue with the HTNA algorithm is that it is vulnerable to loss of SACKs. In many situations loss of SACKs will result only in a slight delayed entering of Fast Recovery for a dropped TSN, but generally then by relying on HTNA algorithm only, loss of SACKs will further broaden the trafic tails situations where Fast Recovery either not be activated in a timely manner or will not be activated at all due to the receipt of an insufficient number SACKs only.

In order to make SCTP Fast Recovery more robust towards drop of SACKs we describe for the following extension of the HTNA algorithm to be supported by an SCTP implementation:

The solution is robust towards split SACK. The solution requires for the SCTP impementation to keep track of the relationship inbetween chunks and packets. One solution is for the SCTP implementation to maintain a monotonically incrementing packet seqence number to map chunks to packets and for each outstanding chunk to keep state of the packet id that the chunk was sent in as well as (incrementally updated) the packet ids of up to dupthres-1 (=2) packets ahead of line for which chunks have been SACKed.

As an alternative to the above accurate packet counting then an SCTP implementation MAY instead support the following bytes counting based extension of the RFC4960 HTNA algorithm:

For both solutions (NAPhol, HBNA) then it is noted that an SCTP implementation only need to keep count of the mis-indications up to the dupthres=3 threshold level and equally well an implementation need not track the exact number of packets ahead of line or the exact number os bytes ahead of line of a certain missing TSN once this number surpasses the dupthres=3 threshold.

This last byte based approach follows the approach taken for TCP, Islost(), in [RFC6675]. It is noted, however, that due to the message based approach of SCTP, then a byte based approach generally will be less accurate as a measure for the number of packet received ahead of line than it is for byte stream based TCP.

OPEN ISSUE: Check alignment with algorthms defined in [HURTIG]. If relevant align.

3.2. RFC6675 nextseg() Tail Loss Enhancements for SCTP FR

The Fast Recovery algorithm for TCP as specified in [RFC6675] implements some differences compared to the fast retransmission algorithm specified for SCTP by [RFC4960]. Of particular significance for recovery of losses in traffic tails scenarios are the fact that the [RFC6675] algorithm, once Fast Recovery has been activated, takes two "last resort" retransmission measures, step 3) and step 4) of Nextseg() of [RFC6675], that faciliate the recovery of losses in situations where only an insufficient number of SACKs would be able to be generated to complete the Fast Recovery process without resorting to T3-timeout. For SCTP Fast Recovery we formulate the equivalent measures as follows:

Last Resort Retransmission:
If the following conditions are met:

then an outstanding TSN less than or equal to the Fast Recovery Exit Point, for which there exists SACKs of chunks ahead of line of the TSN, may be retransmitted provided the CWND allow. The bytes of a TSN which is retransmitted in this manner are not subtracted from the flight size prior to this action be taken nor as a result of this action. If the mis indication count of the TSN subsequently reaches the dupthres value, the bytes of the TSN shall be subtracted from the flight size. Once acknowledged the remaining contribution of this TSN in the flight size (whether it be there counted once or twice at this point in time) is subtracted. A TSN which is retransmitted in this manner will be marked as ineligible for a subsequent fast retransmit.

Rescue:
If all of the following conditions are met:

and there exist non-SACKed, non fast retransmitted TSNs, within the Fast Recovery Exit point, then for this entry of Fast Recovery, conditionally to that the CWND allows, we allow for fast retransmisssion of one packet of consecutive outstanding non fast retransmitted TSNs up to PMTU size, the highest TSN of which MUST be the highest outstanding TSN within the Fast Recovery Point. The bytes of a TSN which is retransmitted in this manner are not subtracted from the flight size prior to this action be taken nor as a result of this action. If the mis indication count of the TSN subsequently reaches the dupthres value, the bytes of the TSN shall be subtracted from the flight size. Once acknowledged the remaining contribution of this TSN in the flight size (whether it be there counted once or twice at this point in time) is subtracted. A TSN which is retransmitted in this manner will be marked as ineligible for a subsequent fast retransmit.

An implementation of the Rescue operation may be accomplished by maintain of an RescueRTX parameter as described for TCP in [RFC6675].

DISCUSSION: [RFC4960] in addition to the HTNA algorithm demand for additional mis indication counting to be performed during Fast Recovery according to the following prescription (section 7.2.4 of [RFC4960]):

(#)
If an endpoint is in Fast Recovery and a SACK arrives that advances the Cumulative TSN Ack Point, the miss indications are incremented for all TSNs reported missing in the SACK.

It is noted that under special circumstances then (#) make SCTP Fast Recovery complete in situations where TCP Fast Recovery would only complete by virtue of the measure 3) or 4) of [RFC6675] and as such these measures are more critically demanded for TCP Fast Recovery operation than for the SCTP Fast Recovery operation. However as documented by (to be filled in) the Last Resort Retransmission operation and the Rescue operation also for SCTP significantly improve the Loss Recovery operation; the latency of the individual loss recovery operation as well as the ability of the operation to complete without resort to T3-timeout. Consequently this document prescribes for Enhanced SCTP Tail Loss Recovery to implement these procedures.

As the algoritm extension is limited by the existing congestion control algorithm of SCTP, these extensions of SCTP Fast Recovery do not compromize the TCP fairness of the SCTP Fast Recovery Operation.

3.3. SCTP-TLR Description

3.3.1. Principles

The Tail Loss Recovery function for SCTP is based on the following principles:

3.3.2. SCTP - TLR Statemachine

In addition to the Fast Recovery State and the T3-Recovery state the SCTP Tail Loss Recovery function defines 3 states: The SCTP TLR OPEN state, the SCTP TLR PROBE WAIT state and the SCTP TLR DELAY WAIT state. At any given time SCTP transmission logic will be in either of the 5 states.

Figure 1 illustrates the states and the state transistions.

(to be inserted)

 

Figure 1, Enhanced Loss Recovery State Machine Diagram

In the following we describe the states and the actions taken.

3.3.2.1. SCTP TLR OPEN STATE

In this state SCTP is not performing Fast Recovery nor T3-recovery. This is the state entered when SCTP sends the first data after idle. In this state SCTP has outstanding data, a PTO timer is running on the lowest outstanding TSN and the SACK scoreboard has no gaps. I.e., the highest SACK'ed TSN is cummulatively acked.

The PTO set on a new lowest outstanding TSN in this state will follow [PTO1] when less than 2 packets are outstanding at the time when the timer is set and follow [PTO2] when 2 or more packets are outstanding when the PTO timer is set.

In this state the following may happen:

3.3.2.2. SCTP TLR DELAY PROBE STATE

In this state the lowest outstanding TSN has remained unSACK’ed for more than PTO time and no indication (no SACK of higher outstanding TSNs have been received) thus resulting in the transmittal of a TLPP to probe for the network responsiveness.

The MAX(PTO, RTO-PTO) T3-value set on the lowest outstanding TSN when sending the TLPP probe and entering this state shall be MAX(PTO1, (RTO-PTO)_previous), where the (RTO-PTO)_previous is set according to value of this at the time the PTO timer previously was set on the lowest outstanding TSN.

In this state then the following may happen:

3.3.2.3. SCTP TLR DELAY WAIT STATE

In this state network responsiveness has been received (in form of a SACK of higher TSN than the lowest outstanding TSN) and the PTO timer on the lowest outstanding TSN is running for potential entering of SCTP TLP driven Fast Recovery.

The PTO set on a new lowest outstanding TSN in this state will be [PTO2].

In this state then the following may happen:

3.3.2.4. Exit of Loss Recovery

After exit of Fast Recovery or T3-Recovery then if data is outstanding a PTO timer is started on the lowest outstanding TSN and the state transits to either SCTP TLR OPEN state or to SCTP TLP DELAY Wait state depending on the status of the SACK scoreboard (i.e., do gaps exists or not). The PTO timer set will follow the rules described above.

3.3.3. TLPP Transmission Rules

The transmission of a Tail Loss Probe Packet (TLPP), done when entering the SCTP TLR PROBE DELAY WAIT state, is governed by the following details:

Section 3.3.4) is more simple when only one TSN has been used as a probe.

The motivation for sending TLPP of retransmission in form of one chunk only is that demasking of loss recovery by the TLPP (see

TLPP Transmission conditions:

Section 3.3.4. The

The above rules are defined to support detection of TLPP recovered losses by the algorithm described in

3.3.4. TLPP Recovered Losses

If a single SCTP packet is lost, there is a risk that the TLPP packet itself might repair the loss if that particular lost packet is used as probe. The masking problem is only present if the TLPP is based on retransmission data (i.e., not if the TLPP is based on new data). The TLPP might mask the loss and thus interfering with the congestion control principle that requires for CWND halving when a loss is detected.

At present the solution in this document operates with the algorithm defined for this purpose in [DUKKIPATI01] with a slight adjustment to SCTP to rely on the D-SACK (duplicate TSN received) information available from SCTP SACK. The solution operates with a conceptual TLPP Retransmission Episode. As follows:

OPEN ISSUE: The above solution is vulnerable to spurious CWND halving when a TLPP packet is re-ordered compared to a subsequent new data chunk sent. A possibly solution, contemplated for a number of reasons for SCTP, is to extend SCTP to distinguish retransmitted chunks from original chunks.

3.4. SCTP MH Considerations

The functions defined have been implemented for SCTP MH. MH aspects to be filled in.

4. Evaluation of function

Experiments in progress. Details to be filled in.

5. Socket API Considerations

This section will describe how the socket API defined in [RFC6458] is extended to provide a way for the application to control the retransmission algorithms in operation in the SCTP layer.

Socket option for control of the features is yet to be defined.

Please note that this section is informational only.

6. Security Considerations

There are no new security considerations introduced by the functions defined in this document.

7. Acknowledgements

The author acknowlegdes Henrik Jensen for his very significant contribution for the definition of, the implementation of and the experiments with function.

The work heavily draws on prior art work done for TCP, [DUKKIPATI01] in particular. The contributors of that work should be credited for many of the ideas put forward here for SCTP.

8. IANA Considerations

This document does not create any new registries or modify the rules for any existing registries managed by IANA.

9. References

9.1. Normative References

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC4960] Stewart, R., "Stream Control Transmission Protocol", RFC 4960, September 2007.
[RFC5062] Stewart, R., Tuexen, M. and G. Camarillo, "Security Attacks Found Against the Stream Control Transmission Protocol (SCTP) and Current Countermeasures", RFC 5062, September 2007.
[RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M. and Y. Nishida, "A Conservative Loss Recovery Algorithm Based on Selective Acknowledgment (SACK) for TCP", RFC 6675, August 2012.

9.2. Informative References

[DUKKIPATI01] Dukkipati, N., Cardwell, N., Cheng, Y. and M. Mathis, "Tail Loss Probe (TLP): An Algorithm for Fast Recovery of Tail", Work Expired , 2 2013.
[DUKKIPATI02] Dukkipati, N., Mathis, M., Cheng, Y. and M. Ghobadi, "Proportional Rate Reduction for TCP", Proceedings of the 11th ACM SIGCOMM Conference on Internet Measurement , 11 2011.
[HURTIG] Hurtig et al., P., "TCP and SCTP RTO Restart, draft-ietf-tcpm-rtorestart-03", IETF Work In Progress , 7 2014.
[MATHIS] Mathis, M., "FACK", ACM SIGCOMM Computer Communication Review 26,4, 10 1996.
[RFC5681] Allman, M., Paxson, V. and E. Blanton, "TCP Congestion Control", RFC 5681, September 2009.
[RFC5827] Allman, M., Avrachenkov, K., Ayesta, U., Blanton, J. and P. Hurtig, "Early Retransmit for TCP and Stream Control Transmission Protocol (SCTP)", RFC 5827, May 2010.
[RFC6458] Stewart, R., Tuexen, M., Poon, K., Lei, P. and V. Yasevich, "Sockets API Extensions for the Stream Control Transmission Protocol (SCTP)", RFC 6458, December 2011.

Author's Address

Karen E. E. Nielsen Ericsson Kistavaegen 25 Stockholm, 164 80 Sweden EMail: karen.nielsen@tieto.com