Internet Engineering Task Force Urtzi Ayesta Internet Draft FranceTelecom R&D Document: draft-ayesta-to-short-tcp-00.txt Konstantin Avrachenkov Expires: October 2002 INRIA October 2002 On reducing the number of TimeOuts for short-lived TCP connections Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 [1]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document shows that short TCP sessions are prone to timeout. In particular, one single segment loss will provoke TCP to timeout if the document size is below certain threshold. This document analyzes the benefit of TCP modifications such as Limited Transmit Algorithm [RFC3042] and Increasing Initial Window [RFC2414] in the context of short-lived TCP transfers. However TCP remains vulnerable to the losses at the very end of the transmission. Therefore we suggest complementary modifications to Limited Transmit Algorithm to recover effectively from losses at the end of the TCP transfer. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [2]. 1. Introduction TCP sender requires the reception of three duplicate acknowledgements (ACK) to recover from a segment loss without timing out. Consequently, losses at the very end of the transmission will inevitably provoke a timeout. This might especially degrade the TCP performance of short-lived sessions. This document analyzes possible modifications to reduce the timeout probability. RFC 2988 [RFC2988] defines the standard algorithm to compute the retransmission timeout (RTO). In particular, RFC 2988 [RFC2988] recommends to round this timer up to 1 second to avoid retransmissions of segments only delayed and not lost. Because of this conservative RTO definition, it is important for TCP senders to detect and recover from as many losses as possible without having a timeout. The TCP loss recovery mechanism have had have had several modifications over the recent years. The fast retransmission algorithm, which was developed in Tahoe TCP [Jac88], retransmits an unacknowledged segment upon reception of three duplicate ACKs, sets the congestion window to one, sets the slow-start threshold to half of the current congestion window and begins slow start. In the fast recovery algorithm proposed in Reno TCP version[FF96], after receiving three duplicate ACKs, the congestion window is halved by two and Congestion Avoidance replaces slow start. TCP's selective acknowledgement(SACK) [RFC2018] option permits the receiver to inform the sender about the data blocks that were successfully received. Recently two new modifications have been proposed, Increasing the Initial Window (IW) [RFC2414] and Limited Transmit Algorithm (LT) [RFC3042]. According to the IW proposal the initial size of the congestion window is increased from one or two segment(s) to roughly 4K bytes (never more than four segments). This modification benefits the individual connection in several ways [RFC2414]. In the particular case of short-lived TCP we note that: it reduces the transfer time in several round trip times (RTT) and it makes TCP more robust against segments lost in the very beginning of the connection. With LT, the TCP sender sends a new data segment in response to each of the first two duplicate ACKs. Eventually it will receive a third duplicate which will trigger off fast retransmission and fast recovery phases. Clearly, transmitting these new data segments increases the probability that TCP can recover from a lost segment(s) without timing out (see [Flo01] for simulations examples of LT). In the literature it has been reported that many of the timeouts are due to non-trigger of fast recovery. In [LK98] the authors analyzed part of the traces collected by Paxson [Pax97] and found that 85% of the timeouts were due to this reason. [BPS+97] found that almost 50% of the losses required a timeout to recover. In addition only 4% out of them could have been avoided with the TCP selective acknowledgement (SACK) mechanism and 25% using LT. Unfortunately, to the best of our knowledge some important questions remain open: Why do TCP senders receive not enough duplicate ACKs? Is this because of the small size of the congestion window or because of burstyness of the segment loss process? So far the same TCP algorithm is used regardless the size of the file to be transmitted. It is known (see, e.g. [FBP+01]) that a TCP session typically belongs to one of the following two kinds: "mice" or "elephants". Most TCP sessions are "mice" with a small size, but a small amount of "elephants" (in terms of flows) is responsible for the largest amount of transferred data (in terms of bytes) (approximately 80% according to [GM01]). In [TMW97] the authors based on measurements in the backbone found that the average size of flows was 10Kbytes. More recent measurements on the Sprint IP backbone network [FMD+01] show that around 70% of the flows carry fewer than 1Kbyte and 90% of the flows carry fewer than 10Kbytes. Note: The values of [FMD+01] do not correspond only to TCP, but to all transport protocols. Still the authors report that over 90% of the traffic is transmitted with TCP, even on the links with a significant percentage of streaming media. In [TMW97] authors have reported that TCP carries 95% of the bytes, 90% of the segments and 80% of the flows on the link. Therefore, it seems feasible to modify TCP and improve the performance of short-lived TCP flows without significant increase of the overall network load. The rest of the paper is organized as follows, Section 2 analyzes in detail the performance of TCP focusing on short-lived TCP sessions. In Section 3 some possible TCP modifications are discussed and simulation results are presented. The last section is conclusions. 2. What causes short-lived TCP to timeout? When a loss occurs, the congestion window of the sender will continue sliding forward until the lost segment gets to the left most position. If the value of the congestion window is less than four segments (two with LT) the TCP session will timeout. There is yet another situation when the sender will inevitably timeout. Namely, if a loss occurs when the remaining amount of data is less than three segments, no matter what the actual value of the congestion window is, the sender will not receive three duplicate ACKs and will have to rely on a timeout to detect the loss. As a consequence, one can identify three situations where TCP sessions are prone to timeouts. The first case corresponds to the beginning of the session when the congestion window is below 4 (2 with LT) segments. The second case corresponds to the middle part of the transfer when the congestion window is small. For example the limit imposed by the receiver advertised window is small, the link has a small bandwidth-delay product or after the loss recovery phase. The third case corresponds to the very end of the transmission. Namely if any of the last three segments are lost the sender will not receive three duplicate ACKs and it will inevitably timeout. IW helps in the first case, LT does the same in the first and second case. However, neither of them helps at the end of the transmission. Note: At the end of the transmission, the use of LT does not make any difference since it only sends new data upon the reception of two duplicate ACKs and it does not make the decision to retransmit a segment until three duplicate ACKs are received. To the best of our knowledge this case was first observed in [AA02]. The third case might not be of crucial importance for long-lived TCP flows, but it may have a significant effect on the transfer time of short-lived TCP. One can define a threshold on the file size (TO-THRESH), such that if the file size is less than this threshold a single loss will inevitably lead to a timeout. TO-THRESH is given by the sum of the number of segments that have to be transmitted to reach a congestion window of size 4 (two with LT) and three segments corresponding to the end of the file. In the case the receiver does not employ delayed ACK we get (in brackets it is shown the contribution of the two intervals): Initial TCP TCP Window Limited Transmit 1 6(3+3) 4(1+3) 2 5(2+3) 3(0+3) 3 4(1+3) 3(0+3) 4 3(0+3) 3(0+3) In the case of TCP employing delayed ACK we get the following values: Delayed ACK employed Initial TCP TCP Window Limited Transmit 1 7(4+3) 5(2+3) 2 6(3+3) 4(1+3) 3 4(1+3) 3(0+3) 4 3(0+3) 3(0+3) The values presented in the tables above along the statistics reported on the file size of TCP flows [GM01,TMW97,FMD+01]] suggest that the value of TO-THRESH is of the order of a major portion of the TCP flows. Clearly, this implies that TCP's loss recovery mechanism does not work well for "mince" type TCP flows. Balakrishnan et al. [BPS+97] concluded by measurements on an internet server that TCP's loss recovery performance is poor when it comes to short Web transfers. We presume that the end of the file effect might have had an impact on the measurements and it was not identified by the authors. In [AA02] we look at the expected TCP transfer time conditioning on the number of losses. Via simulations and the theoretic model we observed a very interesting phenomenon - the non monotonicity of the expected conditional transfer time. That is given that certain segment(s) is(are) lost(s), it turns out that on average it may take less time to transmit a larger file. For instance in the case of one loss, the picture transfer time vs. file size shows a unique peak at TO-THRESH. First the transfer time goes up until the file size is smaller than TO-THRESH. Then the transfer time start to decrease and only after some file size it starts to increase again. This behavior is due to the conservative duration of the retransmission timer, typically several times greater than an average round trip time (RTT). 3. TCP modifications to improve the performance of short-lived TCP transfers. From the previous section we know that short-lived TCP flows are particularly vulnerable to segment losses since in most of the situations they will have to rely on a RTO to recover from them. The use of the LT algorithm, reduces the value of TO-THRESH and hence the aliviates the outlined problem. However if there is no new data to send (at the end of the file) LT does not help. Thus, at the end of the TCP transfer it might be useful to retransmit early. Paxson [Pax97] affirms that TCP fast retransmission threshold could be safely lowered from 3 duplicate ACKs to 2 by introducing a 20msec waiting time before retransmitting. This strategy could be as well adopted in the case one duplicate ACK is received and no further data is queued to send. That is, waiting for some time before deciding to retransmit a segment. With early retransmission only the loss of the last segment will force the sender to timeout. To overcome this one can consider that TCP could send an extra segment at the end of the session (containing no data of course). This segment would not be sent reliably and its only goal would be to avoid a timeout when the last segment is dropped. On the other hand this modification may degrade the performance of the network because of the early retransmission of only reordered and not lost segments and lead to an increase of the loss rate. Several authors have studied the phenomena of segment reordering. Paxson [Pax97] transmitted 100Kb between 35 computers and measured that 0.1%-2% of all segments (data and ACK) experienced reordering and 12%-35% of the flows (depending on the data set) experienced at least one reordered segment. Bennett et al. [BPS99] sent ICMP probes to the MAE-East Internet exchange point and found that the probability of a session experiencing reordering was over 90%. They conjecture reordering is a function of network load and they consider reordering is a result of the use of parallelism in network devices. Iannaccone et al. In [BS02] the authors develop three techniques to measure one-way segment reordering and perform 20 day period test. They establish that over 40% of the paths tested experience some reordering during the t Note: ACKs that acknowledge new data are the only ones that make the sender increase the congestion window. If we consider a TCP receiver that implements a delayed ACK algorithm with more tha 50ms idle time, a reordered segment with segment lag of 1 and time lag less than 50ms would not affect the number and the rate of ACK acknowledging new sequence numbers sent by the receiver to the sender. Clearly, there is a variability on the reported values of reordering and it is not possible to conclude whether this variability comes from the differences on the procedure to collect and analyze the data or changes in the network (for example, different grade of parallelism in the switches). [JID+02] is in our opinion the most comprehensive study carried out until now but their results have to be confirmed by other studies before concluding that reordering is not significant in the today Internet. To explain the differences among the cited papers, it is worthy to note that [Pax97] focuses on long lived TCP flows while in [JID+02] the authors deal with the usual mix of "elephant" and "mice" type flows. Bennet et al. [BPS99] study is based on measurements taken at a particular switch that is known to induce high level of reordering while [JID+02] is based on flows from a big diverse range of sources and destinations. We have implemented this modification in the ns simulator [NS] (early retransmission at the end of the transfer on top of LT) to evaluate the reduction of the transfer time. We have not investigated the impact of segment reordering because of the non-existence of appropriate models. We compute the conditional expected transfer time given the flow experiences at least one lost. We compare the values of the conditional expected transfer time for TCP without LT, TCP with LT and TCP with LT and file end early retransmission. In the particular case of RTT=100ms we obtained that LT decreases the conditional expected transfer time of file 6 segments by 10% and our proposition by 45%. The reduction in conditional expected transfer time decreases as the file size increases. This demonstrates that our modification benefits short-lived TCP transfers. One may expect that the increase of the load network load because of spurious retransmission is proportional to the number of spurious retransmission induced by the modification. If the measurements of the flow size distribution [FBP+01,GM01,TMW97,FMD+01] and the segment reordering rates [Pax97, BPS99,JID+02] are in agreement with the real Internet, one expects that our modification will not lead to a significant increase of the load and loss rates. Particularly if we note that some of the reorderings are invisible for the sender due to the small time lag [JID+02]. 4. Conclusion. This document analyzes the impact of timeouts in the context on the performance of short-lived TCP flows. The document proposes a modification of TCP on the top of the LT algorithm to avoid timeouts and hence to reduce the transfer time. Security Considerations This document proposes a modification of TCP on the top of Limited Transmit Algorithm. Security considerations concerning Limited Transmit Algorithm are discussed in RFC 3042 and they apply to this algorithm also. Secondly, when duplicate ACKs are received and there is no more data to send this document proposes TCP to retransmit immediately to avoid timeouts. This modification does not raise any known security issue. References [AA02] Urtzi Ayesta, Konstantin Avrachenkov, "The Effect of the Initial Window Size and Limited Transmit Algorithm on the Transient Behavior of TCP Transfers", In Proc. of the 15th ITC Internet Specialist Seminar, Wurzburg, July 2002. [BPS+97] Hari Balakrishnan, Venkata Padmanabhan, Srinivasan Seshan, Mark Stemm and Randy Katz, "TCP Behavior of a Busy Web Server: analysis and Improvements". Proc IEEE INFOCOM, San Francisco, CA, March 1998 [BPS99] J.C.R. Bennett, C.Partridge and N.Shectman, "Packet Reordering is Not Pathological Network Behavior," IEEE Transaction on Networking, Vol. 7,No. 6, December 1999. [BS02] John Bellardo, Stefan Savage, "Measuring Packet Reordering,", ACM SIGCOMM Internet Measurement Workshop 2002, Marseille, France, November 2002. [CSA00] Neal Cardwell, Stefan Savage, Thomas Anderson, "Modeling TCP latency", in Proc. IEEE INFOCOM 2000, Tel-Aviv, Israel, March 2000. [CMT98] K. Claffy, Greg Miller, and Kevin Thompson. "The nature of the beast. Recent traffic measurements from an Internet backbone". In Proceedings of INET '98, July 1998. [FF96] Kevin Fall, Sally Floyd. "Simulation-based Comparisons of Tahoe, Reno and SACK TCP," Computer Communication Review, July 1996. [Flo01] Floyd, S. "A Report on Some Recent Developments in TCP Congestion Control", IEEE Communications Magazine, April 2001. [FMD+01] C.Fraleigh, S.Moon, C.Diot, B.Lyles, F.Tobagi, "Packet-Level Traffic Measurements from a Tier-1 IP Backbone", Sprint ATL Technical Report TR01-ATL-110101, November 2001. [FBP+01] S. Ben Fredj, T.Bonald, A.Proutiere, G.Regnie, J.Roberts, "Statistical Badwidth Sharing: A Study of Congestion at Flow Level", SIGCOMM 2001. [GM01] Liang Guo, Ibrahim Matta, "The War Between Mice and Elephants", Proc. 9th IEEE International Conference on Network Protocols (ICNP'01), Riverside, CA, November,2001. [Jac88] Jacobson, V., "Congestion Avoidance and Control," SIGCOMM 1988, Stanford, CA., August 1988. [JID+02] S.Jaiswal, G.Iannaccone, C.Diot, J.Kurose, D.Towsley, "Measurement and Classification of Out-of-Sequence Packets in a Tier-1 IP Backbone," ACM SIGCOMM Internet Measurement Workshop 2002, Marseille, France, November 2002. Extended version available as: UMass CMPSCI Technical Report TR 02-17. [NS] Ns network simulator. URL: http://www.isi.edu/nsnam/. [LK98] Lin, D., and Kung, H.T., TCP Fast Recovery Strategies: Analysis and Improvements, In Proc. of INFOCOM 98, San Francisco, CA, March 1998. [Pax97] Vern Paxson, "Ent-to-End Internet Packet Dynamics", ACM SIGCOMM, Cannes, France, September 1997. [RFC1122] Braden, R., "Requirements for Internet Hosts Communication Layers", STD 3, RFC 1122, October 1989. [RFC2018] Mathis M., Mahdavi J., Floyd S., Romanow A., "TCP Selective Acknowledgement Options," RFC 2018. [RFC2414] M.Allmamn, S.Floyd, C. Partridge, "Increasing TCP's Initial window", RFC 2414, September 1998.A small modification of RFC 2414 has been approved by the IESG to go to Proposed Standard on August 28, 2002. [RFC2581] M. Allman, V.Paxson, W.Stevens, "TCP Congestion Control", RFC 2581, April 1999. [RFC2988] Vern Paxson, Mark Allman, "Computing TCP's Retransmission Timer", RFC 2988, November 2000. [RFC3042] Mark Allman, Hari Balakrishnan, Sally Floyd, "Enhancing TCP's Loss Recovery Using Limited Transmit", RFC 3042, January 2001. [TMW97] Kevin Thompson, Gregory J. Miller, and Rick Wilder. "Wide-area Internet traffic patterns and characteristics". IEEE Network, 11(6), November 1997. Acknowledgments Author's Addresses Urtzi Ayesta France Telecom R&D 905 rue Albert Einstein 06921 Sophia Antipolis France Email: Urtzi.Ayesta@francetelecom.com Konstantin Avrachenkov INRIA 2004 route des Lucioles, B.P.93 06902, Sophia Antipolis France Phone: 00 33 492 38 7751 Email: k.avrachenkov@inria.fr 1 Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. 2 Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997 October 2002 Ayesta et Avrachenkov October 2002 [Page 5]