Network Working Group Noritoshi Demizu INTERNET-DRAFT NICT Expires: March 8, 2005 September 8, 2004 A Modification to Make PAWS Robust to Segment Reordering Status of this Memo By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, or will be disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than a "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Copyright Notice Copyright (C) The Internet Society (2004). All Rights Reserved. Abstract The purpose of PAWS (Protect Against Wrapped Sequence numbers), defined in RFC1323, is to protect a TCP connection against old duplicate segments from the same connection. There is a possibility, however, that PAWS may discard valid reordered segments. This memo proposes a modification to the TCP Timestamps option to solve this problem. Demizu Expires March 2005 [Page 1] Internet-Draft September 2004 1. Introduction The purpose of PAWS (Protect Against Wrapped Sequence numbers) [RFC1323] is to protect a TCP connection against old duplicate segments from the same connection. The PAWS mechanism uses the TCP Timestamps option [RFC1323] to detect old duplicate segments. PAWS could, however, falsely discard valid segments that are delayed due to reordering. For example, if a retransmitted segment sent by Fast Retransmit [RFC2581] is reordered and arrives at the receiver earlier than some delayed segments sent prior to the retransmitted segment, and if the timestamp on the retransmitted segment is newer than the timestamps on the delayed segments, the delayed segments will be discarded by PAWS at the receiver. In the case where Limited Transmit [RFC3042] is used as well, a retransmitted segment sent by Fast Retransmit might follow two new data segments sent by Limited Transmit in a row. In this case, the problem described in the previous paragraph would be more likely to occur. In the case where NewReno [RFC3782] is used, the problem could occur if a retransmitted segment sent by NewReno is reordered and arrives at the receiver earlier than some new data segments sent prior to the retransmitted segment. [Pax97] and [BPS99] show that segment reordering is not a rare event. If valid segments were falsely discarded by PAWS due to reordering, there would be negative effects on TCP performance. This memo proposes a modification to the TCP Timestamps option to solve this problem. The memo is organized as follows. Section 2 describes the problem by showing examples. Section 3 discusses possible methods of solving the problem. Section 4 proposes a solution. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 2. Problem Description There is a possibility that valid segments could be discarded by PAWS when those segments are delayed because of reordering. This section shows some examples of this problem, then describes a generic scenario and some possible negative effects. Demizu Expires March 2005 [Page 2] Internet-Draft September 2004 2.1 Example 1: Reordering and Fast Retransmit with Limited Transmit In this example, suppose TCP A is sending data to TCP B. Assume that TCP A supports the TCP Timestamps option [RFC1323], TCP Congestion Control [RFC2581], and Limited Transmit [RFC3042], and that TCP B supports the TCP Timestamps option with PAWS [RFC1323]. Suppose the data segment sequence W.1, X.2, Y.3, Z.4, A.5 is sent by TCP A, where the letter indicates the sequence number and the digit represents the timestamp (TSval). In this data segment sequence, suppose W.1 and X.2 are sent in the Congestion Avoidance phase, Y.3 and Z.4 are sent by Limited Transmit, and A.5 is sent by Fast Retransmit. Figure 1 illustrates the data segment sequence observed at TCP A. The x-axis represents time, and the y-axis represents the sequence number. W.1 through Z.4 and A.5 indicate the data segments sent. Each 'o' mark indicates a received ACK segment. Lines are drawn with symbol characters between data segments and between ACK segments. Sequence number A Z.4 | Y.3~~ \ | X.2~~ \ | W.1~~ \ | ~~ \ | A.5 | o____o____o____o | o~~~~ 1 2 3!! <-- dup-ACK count | o~~~~ +--------------------------------> Time Figure 1: Time vs. sequence number at TCP A Now, suppose the data segment sequence W.1, X.2, Y.3, Z.4, A.5 sent by TCP A is reordered as W.1, X.2, Y.3, A.5, Z.4 (i.e., Z.4 and A.5 are exchanged) on the path to TCP B. Figure 2 illustrates the resulting data segment sequence observed at TCP B. What happens at TCP B is described below. 0. Assume TS.Recent is valid and TS.Recent == 0. Assume RCV.NXT == A. 1. W.1 is received. PAWS accepts it because TS.Recent < 1. TS.Recent is not updated because RCV.NXT < W. Demizu Expires March 2005 [Page 3] Internet-Draft September 2004 2. X.2 is received. PAWS accepts it because TS.Recent < 2. TS.Recent is not updated because RCV.NXT < X. 3. Y.3 is received. PAWS accepts it because TS.Recent < 3. TS.Recent is not updated because RCV.NXT < Y. 4. A.5 is received. PAWS accepts it because TS.Recent < 5. TS.Recent is updated because RCV.NXT == A and A.5 has data. Now, TS.Recent == 5 and RCV.NXT >= A + the data length of A.5. (The actual new value of RCV.NXT depends on the out-of-order data queue in TCP B.) 5. Z.4 is received. PAWS discards it because TS.Recent > 4. In this example, the valid segment Z.4 is discarded by PAWS in step 5. Figure 2 illustrates this scenario. Sequence number A Z.4 | Y.3 / | X.2~~ \ / | W.1~~ \ / | ~~ \ / | A.5 | +--------------------------------> Time +---------+-------------------------------+ |Segment |(prev) W.1 X.2 Y.3 A.5 Z.4 | +---------+-------------------------------+ |PAWS | - Pass Pass Pass Pass Fail| |TS.Recent| 0 0 0 0 5 5 | |RCV.NXT | A A A A >A >A | +---------+-------------------------------+ Figure 2: Time vs. sequence number at TCP B Even in the case where TCP A does not support Limited Transmit (i.e., in the case where Y.3 and Z.4 are not sent in the example above), if the data segment sequence W.1, X.2, A.5 sent by TCP A is reordered as W.1, A.5, X.2 (i.e., X.2 and A.5 are exchanged) on the path to TCP B, X.2 could be discarded by PAWS. Since there would be a small gap between the time when X.2 is sent and the time when A.5 is sent, the possibility of this problem occurring would be less than in the example above. Demizu Expires March 2005 [Page 4] Internet-Draft September 2004 2.2 Example 2: Reordering and NewReno In this example, suppose TCP A is sending data to TCP B. Assume that TCP A supports the TCP Timestamps option [RFC1323], TCP Congestion Control [RFC2581], and NewReno [RFC3782], and that TCP B supports the TCP Timestamps option with PAWS [RFC1323]. Suppose the data segment sequence W.1, X.2, Y.3, Z.4, A.5 is sent by TCP A, where the letter indicates the sequence number and the digit represents the timestamp (TSval). In the data segment sequence, suppose W.1 through Z.4 are sent by Fast Recovery at each time when a duplicate ACK segment is received, and A.5 is sent by NewReno. Figure 3 illustrates the data segment sequence observed at TCP A. This figure uses the same notation as Figure 1. Sequence number A Z.4 | Y.3~~ \ | X.2~~ \ | W.1~~ \ | ~~ \ | A.5 | o | / | / | / | / | ..o____o____o____o | +--------------------------------> Time Figure 3: Time vs. sequence number at TCP A Now, suppose the data segment sequence W.1, X.2, Y.3, Z.4, A.5 sent by TCP A is reordered as W.1, X.2, Y.3, A.5, Z.4 (i.e., Z.4 and A.5 are exchanged) on the path to TCP B. The resulting data segment sequence observed at TCP B are the same as Figure 2. And what happens at TCP B are also the same as those in the example described in section 2.1. Consequently, the valid segment Z.4 is discarded by PAWS. 2.3 Generic Scenario In general, this problem occurs in the following scenario. Suppose TCP A is sending data to TCP B, and consider the following steps. Demizu Expires March 2005 [Page 5] Internet-Draft September 2004 1. Data segment Z.4 is sent by the sender (TCP A). 2. Data segment A.5 is sent by the sender (TCP A). The sequence number of segment A.5 is lower than that of segment Z.4. The value of the TSval on segment A.5 is newer than that on segment Z.4. Note: Segment A.5 would be a retransmitted segment sent by Fast Retransmit, NewReno, SACK [RFC2018][RFC3517], or another mechanism that infers a segment loss and retransmits the lost data quickly. The sequence number of segment A.5 would be less than SND.NXT. 3. Segment A.5 arrives at the receiver earlier than segment Z.4. Suppose that segment A.5 satisfies SEG.SEQ <= RCV.NXT < SEG.SEQ + SEG.LEN, and the value of the TSval on segment A.5 is not older than TS.Recent at the receiver (TCB B). Segment A.5 is accepted by PAWS at the receiver. TS.Recent at the receiver is updated with the value of the TSval on segment A.5. RCV.NXT is also updated. 4. Segment Z.4 arrives at the receiver (TCP B). Segment Z.4 is discarded by PAWS because the value of the TSval on segment Z.4 is older than TS.Recent at the receiver. In this scenario, the gap between the time when segment Z.4 is sent and the time when segment A.5 is sent should be small so that reordering could exchange segments Z.4 and A.5. 2.4 Negative effects This problem would cause some negative effects on TCP performance. A sender would spend additional time detecting a loss and recovering from it. Moreover, the sender would consider the loss to be a congestion indication, and the congestion window would needlessly be further reduced. In addition, discarding valid acceptable segments at a receiver is a waste of bandwidth. Demizu Expires March 2005 [Page 6] Internet-Draft September 2004 3. Discussion This section discusses possible methods of solving the problem described in section 2. First, two modifications are compared: a receiver-side modification and a sender-side modification. Then, the better method is discussed in more depth. The purpose of this section is to give the technical reasoning behind the specification described in section 4. 3.1 Receiver-side Modification A straightforward way to solve the problem would be to modify the rules of PAWS so that valid delayed segments will be accepted. The new rule would be like the following: - Change the inequality in R1) in section 4.2.1 of [RFC1323] as follows: Current: SEG.TSval < TS.Recent Proposal: SEG.TSval < TS.Recent - T1, where T1 = RTT. - In addition, to keep TS.Recent be monotone non-decreasing, in R3) in section 4.2.1 of [RFC1323], TS.Recent is updated only when SEG.TSval >= TS.Recent. With this new rule, it would be very important to choose the value of T1 appropriately. If T1 was too large, old duplicate segments would be accepted, and PAWS would become useless. On the other hand, if T1 was too small, valid segments would still be discarded, and this new rule would become useless. It would be difficult, however, for a receiver to determine the value of T1 appropriately, because it would not have enough information on the frequency of the values of TSval's on segments sent by a sender. - [RFC1323] recommends that the range of the frequency be 1 millisecond to 1 second. This is too wide for a receiver to be able to choose a practicable value for T1 under any circumstances. - It might be possible to infer an approximation of the frequency by observing the values of TSval's on received segments. The calculation would be very complex, however. - It would be possible to introduce a new TCP option to exchange the frequency information. Unfortunately, this would take Demizu Expires March 2005 [Page 7] Internet-Draft September 2004 many years to deploy. Furthermore, the TCP option space is limited. Therefore, it would be difficult to solve the problem on the receiver side. 3.2 Sender-side Modification Another solution would be to change the rule determining the value of the TSval on a retransmitted segment sent by Fast Retransmit, NewReno, etc., so that the value of the TSval on such a segment will never be newer than the values of the TSval's on segments sent prior to the retransmitted segment. The new rule would be like the following: - When a segment is *NOT* sent by Fast Retransmit, NewReno, etc., use the current timestamp clock for the TSval on the segment, and record the value. - When a segment *is* sent by Fast Retransmit, NewReno, etc., use the recorded value for the TSval on the segment. Or, the rule could be simplified, like the following: - Introduce a new timestamp variable TS.SndMax to the TCP per-connection state. Its initial value would be the current timestamp clock. - Before sending a segment *NOT* triggered by Fast Retransmit, NewReno, etc., update TS.SndMax with the current timestamp clock. - Use TS.SndMax for the TSval on any segment. Now, let us revisit the examples in section 2. In those examples, Z.4 is discarded by PAWS because it arrives at the receiver later than A.5 and the TSval on Z.4 is older than that on A.5. If the new rule above was introduced, however, the value of the TSval on A.5 would change from 5 to 4. Therefore, at step 4 in section 2.1, TS.Recent at the receiver would become 4 instead of 5. Hence, Z.4 would be accepted because TS.Recent == 4. Figure 4 illustrates the new scenario observed at TCB B. Thus, the new rule given above would solve the problem illustrated by the examples in section 2. Demizu Expires March 2005 [Page 8] Internet-Draft September 2004 Sequence number A Z.4 | Y.3 / | X.2~~ \ / | W.1~~ \ / | ~~ \ / | A.4 | +--------------------------------> Time +---------+-------------------------------+ |Segment |(prev) W.1 X.2 Y.3 A.4 Z.4 | +---------+-------------------------------+ |PAWS | - Pass Pass Pass Pass Pass| |TS.Recent| 0 0 0 0 4 4 | |RCV.NXT | A A A A >A >A | +---------+-------------------------------+ Figure 4: Time vs. sequence number at TCP B Comparing the two modifications introduced in sections 3.1 and 3.2, this memo recommends the sender-side modification in section 3.2. 3.3 Limitation The rule introduced in section 3.2 has one limitation: If A.4 (which was A.5) arrived at the receiver earlier than both Y.3 and Z.4, Z.4 would be accepted, but Y.3 would be discarded. One way to avoid discarding last N segments would be as follows: Both the TSval's and the sequence numbers of the last N segments should be recorded. Then, the value of the TSval on a segment sent by Fast Retransmit, NewReno, etc. should be that of the oldest TSval among the recorded segments whose sequence number is not less than the sequence number plus the data length of the segment being sent. This idea would overcome the above limitation, but it would also introduce some complexity. Therefore, this memo does not recommend this idea. Another solution would be as follows: When a segment is sent by Fast Retransmit, NewReno, etc., the value of its TSval on the segment is changed to the current timestamp clock - T2, instead of TS.SndMax. It would be difficult, however, to choose an appropriate constant value for T2. If T2 was too large, the segment would be considered as an old duplicate segment and discarded by PAWS at a receiver. If T2 was too small, the problem described in section 2 might not be solved. An appropriate value for T2 would depend on when the current outstanding segments were sent. Hence, this idea would not work. Demizu Expires March 2005 [Page 9] Internet-Draft September 2004 As a result, this memo does not recommend solutions to the limitation described in section 3.3. 3.4 Side Effect Since the value of the TSval on a segment sent by Fast Retransmit, NewReno, etc. could be older than the current timestamp clock, the measured RTT calculated using the SEG.TSecr on the ACK segment that acknowledges the retransmitted segment could be longer than the actual RTT. In many cases, the gap between the time when the previous segment is sent and the time when the retransmitted segment is sent would be small. And the error of the measured RTT would be negligible. One considerable case would be where the congestion window is 2 MSS and Limited Transmit is used. In this case, if a segment is lost, Limited Transmit would keep sending new data segments every RTT, and each data segment would update TS.SndMax. When the third duplicate ACK segment is received, Fast Retransmit would retransmit the lost data with TSval = TS.SndMax, which indicates one RTT ago. In this case, the measured RTT would be one RTT longer than the actual RTT. The worst case would be where NewReno is retransmitting data segments upon ACK segments continuously. During the period, TS.SndMax would not be updated. Therefore, the errors of measured RTTs would be RTT times the number of successive retransmitted segments. Although it would be better to mismeasure RTTs in limited cases than to discard valid segments in exceptional cases, mismeasurement should be avoided. One way would be as follows: Even if a segment is sent by Fast Retransmit, NewReno, etc., if the current timestamp clock - TS.SndMax is longer than a fraction of one RTT (e.g. 1/2 * RTT), update TS.SndMax with the current timestamp clock. The fraction of one RTT should be long enough to keep the modification effective. 3.5 How to Distinguish Segments To implement the rule described in section 3.2, it is important to precisely distinguish segments sent by Fast Retransmit, NewReno, etc. from other segments. It is implementation-dependent. In some implementations, it would be easy to add a new flag to the arguments of the TCP output routine to indicate whether the output request was triggered by Fast Retransmit, NewReno, etc. or not. Demizu Expires March 2005 [Page 10] Internet-Draft September 2004 In other implementations, segments would be distinguishable by their sequence numbers. If the sequence number of a data segment being sent was less than SND.NXT, the data segment would be sent by Fast Retransmit, NewReno, etc. Note that some implementations, such as BSD-derived implementations, temporarily lower SND.NXT on sending such a segment. Also note that the sequence number of a RST segment sent against an unacceptable segment received in unestablished states may be less than SND.NXT. 4. Specification This memo proposes the following modification to the processing of the TCP Timestamps option to solve the problem described in section 2. The technical reasoning behind this specification is discussed in section 3. The TCP per-connection state is augmented by a new timestamp: TS.SndMax. Initially: 0. When a TCP per-connection state is created for a new TCP connection, TS.SndMax SHOULD be initialized by my.TSclock. On sending each segment: 1. If the segment is *NOT* sent by Fast Retransmit, NewReno, etc., or if my.TSclock - TS.SndMax > k * RTT where k=1/2, TS.SndMax SHOULD be updated with my.TSclock. Otherwise, TS.SndMax SHOULD NOT be updated. 2. The value of TSval on the segment MUST be copied from TS.SndMax. In the description above, my.TSclock is the "local source of 32-bit timestamp values." TSval is one of the two timestamp fields of the TCP Timestamps option. See [RFC1323] for more details. The method of distinguishing segments sent by Fast Retransmit, NewReno, etc. from other segments is implementation-dependent. See section 3.5. 5. Security Considerations Security issues are not discussed in this memo. Demizu Expires March 2005 [Page 11] Internet-Draft September 2004 Acknowledgments Kacheong Poon gave the author a hint for the idea tweaking TSval on a local node to control TS.Recent on a peer node. Normative References [RFC793] J. Postel, "Transmission Control Protocol", RFC793, STD 7, September 1981 [RFC1323] V. Jacobson, R. Braden, and D. Borman, "TCP Extensions for High Performance", RFC1323, May 1992 [RFC2119] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", RFC2119, BCP14, March 1997 Informative References [RFC2018] M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow, "TCP Selective Acknowledgement Options", RFC2018, October 1996. [RFC2581] M. Allman, V. Paxson, and W. Stevens, "TCP Congestion Control", RFC2581, April 1999 [RFC3042] M. Allman, H. Balakrishnan, and S. Floyd, "Enhancing TCP's Loss Recovery Using Limited Transmit", RFC3042, January 2001 [RFC3517] E. Blanton, M. Allman, K. Fall, and L. Wang, "A Conservative Selective Acknowledgment (SACK)-based Loss Recovery Algorithm for TCP", RFC3517, April 2003 [RFC3782] S. Floyd, T. Henderson, and A. Gurtov, "The NewReno Modification to TCP's Fast Recovery Algorithm", RFC3782, April 2004 [BPS99] J. Bennett, C. Partridge, and N. Shectman, "Packet Reordering is Not Pathological Network Behavior", IEEE/ACM Transactions on Networking, December 1999 [Pax97] Vern Paxon, "End-to-End Internet Packet Dynamics", In ACM SIGCOMM'97, September 1997 Demizu Expires March 2005 [Page 12] Internet-Draft September 2004 Author's Address Noritoshi Demizu National Institute of Information and Communications Technology(NICT) 4-2-1 Nukui-Kitamachi, Koganei, Tokyo 184-8795, Japan Phone: +81-42-327-7432 (Ex. 5813) E-mail: demizu@nict.go.jp Copyright Statement Copyright (C) The Internet Society (2004). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Demizu Expires March 2005 [Page 13] Internet-Draft September 2004 Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Demizu Expires March 2005 [Page 14]