Network Working Group Noritoshi Demizu INTERNET-DRAFT NICT Expires: April 25, 2005 October 25, 2004 A Modification to Make PAWS Robust to Segment Reordering Status of this Memo By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, or will be disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than a "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Copyright Notice Copyright (C) The Internet Society (2004). All Rights Reserved. Abstract The purpose of PAWS (Protect Against Wrapped Sequence numbers), defined in RFC1323, is to protect a TCP connection against old duplicate segments from the same connection. There is a possibility, however, that PAWS may discard valid reordered segments. This memo proposes a modification to PAWS to solve this problem. Demizu Expires April 2005 [Page 1] Internet-Draft October 2004 1. Introduction The purpose of PAWS (Protect Against Wrapped Sequence numbers) [RFC1323] is to protect a TCP connection against old duplicate segments from the same connection. The PAWS mechanism uses the TCP Timestamps option [RFC1323] to detect old duplicate segments. PAWS could, however, falsely discard valid segments that are delayed due to reordering. For example, if a retransmitted segment sent by Fast Retransmit [RFC2581] is reordered and arrives at the receiver earlier than some delayed segments sent prior to the retransmitted segment, and if the timestamp on the retransmitted segment is newer than the timestamps on the delayed segments, the delayed segments will be discarded by PAWS at the receiver. In the case where Limited Transmit [RFC3042] is used as well, a retransmitted segment sent by Fast Retransmit might follow two new data segments sent by Limited Transmit in a row. In this case, the problem described in the previous paragraph would be more likely to occur. In the case where NewReno [RFC3782] is used, the problem could occur if a retransmitted segment sent by NewReno is reordered and arrives at the receiver earlier than some new data segments sent before the retransmitted segment. [Pax97][BPS99] and [ZM04] show that segment reordering is not a rare event. If valid segments were falsely discarded by PAWS due to reordering, there would be negative effects on TCP performance. This memo proposes a modification to PAWS to solve this problem. The memo is organized as follows. Section 2 describes the problem by showing examples. Section 3 describes basic idea. And section 4 proposes a solution. Appendix A compares possible methods of solving the problem. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 2. Problem Description There is a possibility that valid segments could be discarded by PAWS when those segments are delayed because of reordering. This section shows some examples of this problem, then describes a generic scenario and some possible negative effects. Demizu Expires April 2005 [Page 2] Internet-Draft October 2004 2.1 Example 1: Reordering and Fast Retransmit with Limited Transmit In this example, suppose TCP A is sending data to TCP B. Assume that TCP A supports the TCP Timestamps option [RFC1323], TCP Congestion Control [RFC2581], and Limited Transmit [RFC3042], and that TCP B supports the TCP Timestamps option with PAWS [RFC1323]. Suppose the data segment sequence W.1, X.2, Y.3, Z.4, A.5 is sent by TCP A, where the letter indicates the sequence number and the digit represents the timestamp (TSval). In this data segment sequence, suppose W.1 and X.2 are sent in the Congestion Avoidance phase, Y.3 and Z.4 are sent by Limited Transmit, and A.5 is sent by Fast Retransmit. Figure 1 illustrates the data segment sequence observed at TCP A. The x-axis represents time, and the y-axis represents the sequence number. W.1 through Z.4 and A.5 indicate the data segments sent. Each 'o' mark indicates a received ACK segment. Lines are drawn with symbol characters between data segments and between ACK segments. Sequence number A Z.4 | Y.3~~ \ | X.2~~ \ | W.1~~ \ | ~~ \ | A.5 | o____o____o____o | o~~~~ 1 2 3!! <-- dup-ACK count | o~~~~ +--------------------------------> Time Figure 1: Time vs. sequence number at TCP A Now, suppose the data segment sequence W.1, X.2, Y.3, Z.4, A.5 sent by TCP A is reordered as W.1, X.2, Y.3, A.5, Z.4 (i.e., Z.4 and A.5 are exchanged) on the path to TCP B. Figure 2 illustrates the resulting data segment sequence observed at TCP B. What happens at TCP B is described below. 0. Assume TS.Recent is valid and TS.Recent == 0. Assume RCV.NXT == A. 1. W.1 is received. PAWS accepts it because TS.Recent < 1. TS.Recent is not updated because RCV.NXT < W. Demizu Expires April 2005 [Page 3] Internet-Draft October 2004 2. X.2 is received. PAWS accepts it because TS.Recent < 2. TS.Recent is not updated because RCV.NXT < X. 3. Y.3 is received. PAWS accepts it because TS.Recent < 3. TS.Recent is not updated because RCV.NXT < Y. 4. A.5 is received. PAWS accepts it because TS.Recent < 5. TS.Recent is updated because RCV.NXT == A and A.5 has data. Now, TS.Recent == 5 and RCV.NXT >= A + the data length of A.5. (The actual new value of RCV.NXT depends on the out-of-order data queue in TCP B.) 5. Z.4 is received. PAWS discards it because TS.Recent > 4. In this example, the valid segment Z.4 is discarded by PAWS in step 5. Figure 2 illustrates this scenario. Sequence number A Z.4 | Y.3 / | X.2~~ \ / | W.1~~ \ / | ~~ \ / | A.5 | +--------------------------------> Time +---------+-------------------------------+ |Segment |(prev) W.1 X.2 Y.3 A.5 Z.4 | +---------+-------------------------------+ |PAWS | - Pass Pass Pass Pass Fail| |TS.Recent| 0 0 0 0 5 5 | |RCV.NXT | A A A A >A >A | +---------+-------------------------------+ Figure 2: Time vs. sequence number at TCP B Even in the case where TCP A does not support Limited Transmit (i.e., in the case where Y.3 and Z.4 are not sent in the example above), if the data segment sequence W.1, X.2, A.5 sent by TCP A is reordered as W.1, A.5, X.2 (i.e., X.2 and A.5 are exchanged) on the path to TCP B, X.2 could be discarded by PAWS. Since there would be a small gap between the time when X.2 is sent and the time when A.5 is sent, the possibility of this problem occurring would be less than in the example above. Demizu Expires April 2005 [Page 4] Internet-Draft October 2004 2.2 Example 2: Reordering and NewReno In this example, suppose TCP A is sending data to TCP B. Assume that TCP A supports the TCP Timestamps option [RFC1323], TCP Congestion Control [RFC2581], and NewReno [RFC3782], and that TCP B supports the TCP Timestamps option with PAWS [RFC1323]. Suppose the data segment sequence W.1, X.2, Y.3, Z.4, A.5 is sent by TCP A, where the letter indicates the sequence number and the digit represents the timestamp (TSval). In the data segment sequence, suppose W.1 through Z.4 are sent by Fast Recovery at each time when a duplicate ACK segment is received, and A.5 is sent by NewReno. Figure 3 illustrates the data segment sequence observed at TCP A. This figure uses the same notation as Figure 1. Sequence number A Z.4 | Y.3~~ \ | X.2~~ \ | W.1~~ \ | ~~ \ | A.5 | o | / | / | / | / | ..o____o____o____o | +--------------------------------> Time Figure 3: Time vs. sequence number at TCP A Now, suppose the data segment sequence W.1, X.2, Y.3, Z.4, A.5 sent by TCP A is reordered as W.1, X.2, Y.3, A.5, Z.4 (i.e., Z.4 and A.5 are exchanged) on the path to TCP B. The resulting data segment sequence observed at TCP B are the same as Figure 2. And what happens at TCP B are also the same as those in the example described in section 2.1. Consequently, the valid segment Z.4 is discarded by PAWS. 2.3 Generic Scenario In general, this problem occurs in the following scenario. Suppose TCP A is sending data to TCP B, and consider the following steps. Demizu Expires April 2005 [Page 5] Internet-Draft October 2004 1. Data segment Z.4 is sent by the sender (TCP A). 2. Data segment A.5 is sent by the sender (TCP A). The sequence number of segment A.5 is lower than that of segment Z.4. The TSval value on segment A.5 is newer than that on segment Z.4. Note: Segment A.5 would be a retransmitted segment sent by Fast Retransmit, NewReno, SACK [RFC2018][RFC3517], or another mechanism that infers a segment loss and retransmits the lost data quickly. The sequence number of segment A.5 would be less than SND.NXT. 3. Segment A.5 arrives at the receiver earlier than segment Z.4. Suppose that segment A.5 satisfies SEG.SEQ <= RCV.NXT < SEG.SEQ + SEG.LEN, and the TSval value on segment A.5 is not older than the TS.Recent value at the receiver (TCB B). Segment A.5 is accepted by PAWS at the receiver. TS.Recent at the receiver is updated with the TSval value on segment A.5. RCV.NXT is also updated. 4. Segment Z.4 arrives at the receiver (TCP B). Segment Z.4 is discarded by PAWS because the TSval value on segment Z.4 is older than the TS.Recent value at the receiver. In this scenario, the gap between the time when segment Z.4 is sent and the time when segment A.5 is sent should be small so that reordering could exchange segments Z.4 and A.5. 2.4 Negative effects This problem would cause some negative effects on TCP performance. A sender would spend additional time detecting a loss and recovering from it. Moreover, the sender would consider the loss to be a congestion indication, and the congestion window would needlessly be further reduced. In addition, discarding valid acceptable segments at a receiver is a waste of bandwidth. Demizu Expires April 2005 [Page 6] Internet-Draft October 2004 3. Basic Idea Suppose the following tuple is recorded for every 2^30 bytes of data. If more than one segment contained a target byte, this tuple was created only for the first segment containing the target byte. In this tuple, the received time field keeps the time when the tuple was recorded. The sequence number field keeps the sequence number of the target byte. The timestamp field keeps the SEG.TSval value of the segment containing the target byte. Any incoming segment should be checked as follows: if the second latest tuple is not older than 24 days ago, and the SEG.TSval value on the received segments is older than the value of the timestamp field in the second latest tuple, the incoming segment would be an old duplicate segment. If an incoming segment was an old duplicate segment, it should be discarded and an ACK segment should be sent in reply. The algorithm updating TS.Recent should take care to keep TS.Recent monotonically non-decreasing, because this idea does not ensure it. Appendix A.3 describes this idea in more depth. Demizu Expires April 2005 [Page 7] Internet-Draft October 2004 4. Specification This memo proposes the following modification to the processing of the TCP Timestamps option to solve the problem described in section 2. - The TCP per-connection state is augmented by an array: TSRec. - TSRec is an array of a structure which consists of the following members. - rcv_time: when the target byte is received first time. - rcv_seqno: the sequence number of the target byte. - rcv_tsval: the SEG.TSval value of the segment. - The number of elements in the TSRec array is 2. - When an original SYN is received: /* Record it in the first tuple. */ TSRec[0].rcv_time = current_time; TSRec[0].rcv_seqno = IRS; /* Initial Receive Seq no */ TSRec[0].rcv_tsval = SEG.TSval; /* Copy the first tuple to the second tuple. */ TSRec[1] = TSRec[0]; - When other segment is received: /* Execute the test. */ if (current_time - TSRec[1].rcv_time <= 24days && SEG.TSval < TSRec[1].rcv_tsval) { Send an ACK segment in reply. Discard the incoming segment. } /* Update RCV.NXT */ /* Record it in the first tuple if necessary. */ if (SEG.SEQ > 0 && RCV.NXT - TSRec[0].rcv_seqno > 2^30) { /* Copy the first tuple to the second tuple. */ TSRec[1] = TSRec[0]; /* Record the received segment in the first tuple. */ TSRec[0].rcv_time = current_time; TSRec[0].rcv_seqno += 2^30; TSRec[0].rcv_tsval = SEG.TSval; } Demizu Expires April 2005 [Page 8] Internet-Draft October 2004 5. Security Considerations Security issues are not discussed in this memo. Acknowledgments Masaki Hirabaru (NICT) invented the idea in section 3. Normative References [RFC793] J. Postel, "Transmission Control Protocol", RFC793, STD 7, September 1981 [RFC1323] V. Jacobson, R. Braden, and D. Borman, "TCP Extensions for High Performance", RFC1323, May 1992 [RFC2119] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", RFC2119, BCP14, March 1997 [RFC3668] S. Bradner, "Intellectual Property Rights in IETF Technology", RFC3668, BCP79, February 2004 Informative References [RFC2018] M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow, "TCP Selective Acknowledgement Options", RFC2018, October 1996. [RFC2581] M. Allman, V. Paxson, and W. Stevens, "TCP Congestion Control", RFC2581, April 1999 [RFC3042] M. Allman, H. Balakrishnan, and S. Floyd, "Enhancing TCP's Loss Recovery Using Limited Transmit", RFC3042, January 2001 [RFC3517] E. Blanton, M. Allman, K. Fall, and L. Wang, "A Conservative Selective Acknowledgment (SACK)-based Loss Recovery Algorithm for TCP", RFC3517, April 2003 [RFC3522] R. Ludwig, M. Meyer, "The Eifel Detection Algorithm for TCP", RFC3522, April 2003 [RFC3782] S. Floyd, T. Henderson, and A. Gurtov, "The NewReno Modification to TCP's Fast Recovery Algorithm", RFC3782, April 2004 Demizu Expires April 2005 [Page 9] Internet-Draft October 2004 [BPS99] J. Bennett, C. Partridge, and N. Shectman, "Packet Reordering is Not Pathological Network Behavior", IEEE/ACM Transactions on Networking, December 1999 [Dem04] N. Demizu, "Protection Against Spoofing Attacks by Using the TCP Timestamps Option", (work in progress), Internet-Draft , October 2004. [Pax97] Vern Paxon, "End-to-End Internet Packet Dynamics", In ACM SIGCOMM'97, September 1997 [ZM04] Zhou, X. and P. Van Mieghem, "Reordering of IP Packets in Internet", In Proceedings of Passive and Active Measurement (PAM2004), April 2004, Appendix A: Comparison of Possible Methods Appendix A discusses possible methods of solving the problem described in section 2. Three methods are described and compared: a receiver-side modification, sender-side modifications, and a replacement of PAWS. In the description below, my.TSclock is the "local source of 32-bit timestamp values." TSval is one of the two timestamp fields of the TCP Timestamps option. See [RFC1323] for more details. The variables TS.SndMax and TS.SndMin are borrowed from [Dem04]. Acknowledgments Kacheong Poon (Sun) gave the author a hint for the idea tweaking TSval on a local node to control TS.Recent on a peer node. Appendix A.2 is based on his hint. Ronald Alleyne (S2IO) gave the author the idea using TS.SndMin. He also gave the author its limitation, its side effects, and measures against them. Appendix A.2.2 is based on his proposal. Masaki Hirabaru (NICT) gave the author an alternative idea to the current PAWS. Appendix A.3 is based on his idea. A.1 Receiver-side Modification A straightforward way to solve the problem would be to modify the rules of PAWS so that valid delayed segments will be accepted. Demizu Expires April 2005 [Page 10] Internet-Draft October 2004 The new rule would be as the following: - Change the inequality in R1) in section 4.2.1 of [RFC1323] as follows: Current: SEG.TSval < TS.Recent Proposal: SEG.TSval < TS.Recent - T1, where T1 = RTT. - In addition, to keep TS.Recent be monotonically non-decreasing, in R3) in section 4.2.1 of [RFC1323], TS.Recent is updated only when SEG.TSval >= TS.Recent. With this new rule, it would be very important to choose the value of T1 appropriately. If T1 was too large, old duplicate segments would be accepted, and PAWS would become useless. On the other hand, if T1 was too small, valid segments would still be discarded, and this new rule would become useless. It would be difficult, however, for a receiver to determine the value of T1 appropriately, because it would not have enough information on the frequency of the values of TSval's on segments sent by a sender. - [RFC1323] recommends that the range of the frequency is 1 millisecond to 1 second. This is too wide for a receiver to be able to choose a practicable value for T1 which would work under any circumstances. - It might be possible to infer an approximation of the frequency by observing the TSval values on received segments. The calculation would be very complex, however. - It would be possible to introduce a new TCP option to exchange the frequency information. Unfortunately, this would take many years to deploy. Furthermore, the TCP option space is limited. - It might be possible to specify the frequency, e.g. 1ms, as a standard. It would take many years to deploy, however. Therefore, it would be difficult to solve the problem by the method above. A.2 Sender-side Modification Another solution would be to change the rule for the TSval value on a retransmitted segment sent by Fast Retransmit, NewReno, etc., so that the TSval value on such a segment will never be newer than the TSval Demizu Expires April 2005 [Page 11] Internet-Draft October 2004 values on segments sent prior to the retransmitted segment. A.2.1 Sender-side Modification Using the Maximum TSval Sent A.2.1.1 Proposal An intuitive way of a sender-side modification would be as follows: - When a segment is *NOT* sent by Fast Retransmit, NewReno, etc., use the current timestamp clock for the TSval value on the segment, and record the value. - When a segment *is* sent by Fast Retransmit, NewReno, etc., use the recorded value for the TSval value on the segment. Since TSval is monotonically non-decreasing, the recorded value is equal to the maximum value of TSval that has been sent. The variable to record the value is called TS.SndMax in this memo. Now, let us revisit the examples in section 2 with this rule. In those examples, Z.4 is discarded by PAWS because it arrives at the receiver later than A.5 while its TSval is older than that on A.5. If the rule above was introduced, however, the TSval value on A.5 would change from 5 to 4. Therefore, at step 4 in section 2.1, TS.Recent at the receiver would become 4 instead of 5. Hence, Z.4 would be accepted. Thus, the rule given above solves the problem. Figure A-1 illustrates the scenario observed at TCB B. Sequence number A Z.4 | Y.3 / | X.2~~ \ / | W.1~~ \ / | ~~ \ / | A.4 | +--------------------------------> Time +---------+-------------------------------+ |Segment |(prev) W.1 X.2 Y.3 A.4 Z.4 | +---------+-------------------------------+ |PAWS | - Pass Pass Pass Pass Pass| |TS.Recent| 0 0 0 0 4 4 | |RCV.NXT | A A A A >A >A | +---------+-------------------------------+ Figure A-1: Time vs. sequence number at TCP B Demizu Expires April 2005 [Page 12] Internet-Draft October 2004 A.2.1.2 Limitation The rule above has an obvious limitation: If A.4 (which was A.5 in section 2) arrived at the receiver earlier than both Y.3 and Z.4, Z.4 would be accepted, but Y.3 would be discarded. In other words, the rule avoids discarding only the last data segment. It is an open issue whether this limitation is acceptable or not. One way to avoid discarding the last N segments would be as follows: Both the TSval values and the sequence numbers of the last N segments should be recorded. Then, the TSval value on a segment sent by Fast Retransmit, NewReno, etc. should be that of the oldest TSval value among the recorded segments whose sequence number is not less than the sequence number plus the data length of the segment being sent. This idea would overcome the limitation, but it would introduce some complexity. Therefore, this memo does not recommend this idea. A.2.1.3 Side Effects With the rule above, the TSval value on a segment sent by Fast Retransmit, NewReno, etc. may be older than the current timestamp clock. Hence, the measured RTT calculated using the SEG.TSecr value on the ACK segment that acknowledges the retransmitted segment may be longer than the actual RTT. In many cases, the gap between the time when the new data segment (e.g., Z.4) was sent and the time when the retransmitted segment (e.g., A.4) is sent would be small. In such cases, the error of the measured RTT would be negligible. One considerable case would be where the congestion window is 2 MSS and Limited Transmit is used. In this case, if only the first data segment is lost, Limited Transmit would keep sending new data segments every RTT, and each data segment would update TS.SndMax. When the third duplicate ACK segment is received, Fast Retransmit retransmits the lost data with TSval = TS.SndMax, which indicates one RTT ago. If an ACK segment for the retransmitted segment was returned, the measured RTT would be one RTT longer than the actual RTT. The worst case would be where NewReno is retransmitting data segments upon each ACK segment which is returned for every RTT. Since data segments sent by NewReno would not update TS.SndMax with the rule above, the errors of measured RTTs would be RTT times the number of successive retransmitted segments. Although it would be better to mismeasure RTTs in limited cases than Demizu Expires April 2005 [Page 13] Internet-Draft October 2004 to discard valid segments in exceptional cases, mismeasurement should be avoided. One way would be as follows: if a segment is sent by Fast Retransmit, NewReno, etc., but the current timestamp clock - TS.SndMax is longer than a fraction of one RTT (e.g. 1/2 * RTT), update TS.SndMax with the current timestamp clock. The fraction of one RTT should be long enough to keep the modification effective. A.2.1.4 How to Distinguish Segments To implement the rule above, it is important to precisely distinguish segments sent by Fast Retransmit, NewReno, etc. from other segments. It is implementation-dependent. In some implementations, it would be easy to add a new flag to the arguments of the TCP output routine to indicate whether the output request was triggered by Fast Retransmit, NewReno, etc. or not. In other implementations, segments would be distinguishable by their sequence numbers. If the sequence number of a data segment being sent was less than SND.NXT, the data segment would be sent by Fast Retransmit, NewReno, etc. Note that some implementations, such as BSD-derived implementations, temporarily lower SND.NXT on sending such a segment. Also note that the sequence number of a RST segment sent in reply to an unacceptable segment received in unestablished states may be less than SND.NXT. A.2.1.5 Formal Description The rule above can be formalized as follows: The TCP per-connection state is augmented by a new timestamp: TS.SndMax. Initially: TS.SndMax = my.TSclock On sending each segment: If (the segment is NOT sent by Fast Retransmit, NewReno, etc. or my.TSclock - TS.SndMax > k * RTT where k=1/2) then TS.SndMax = my.TSclock. SEG.TSval = TS.SndMax The method of distinguishing segments sent by Fast Retransmit, Demizu Expires April 2005 [Page 14] Internet-Draft October 2004 NewReno, etc. from other segments is implementation-dependent. See appendix A.2.1.4. A.2.2 Sender-side Modification Using the Maximum TSecr Received A.2.2.1 Proposal The rule described in appendix A.2.1 avoids discarding only the last data segment. If this limitation is unacceptable in real networks, the TSval value on a retransmitted segment sent by Fast Retransmit, NewReno, etc. needs to be changed to some older value. The oldest candidate for such value would be the maximum value of TSecr that has been received. The variable to record the value is called TS.SndMin in this memo. Note that the TS.SndMin value is older than or equal to the TS.Recent value of the peer node. In the case where TS.SndMin is used for TSval on such a retransmitted segment, during recover phase, TS.SndMin would be unchanged and equal to the TS.Recent value at the peer node. Therefore, the TSval value on any new data segments sent would be newer than or equal to the TS.Recent value at the peer node. Thus, new data segments would never be discarded by PAWS. A.2.2.2 Side Effects Since the TSval value on a segment sent by Fast Retransmit, NewReno, etc. would be older than the current timestamp clock, the measured RTT would be longer than the actual RTT, and errors would be worse than the rule described in appendix A.2.1. An error of each measured RTT would be equal to the difference between the time when the recovery phase starts and the current time, because the TS.SndMin value would not be changed during recovery phase. For example, when a retransmitted segment was sent by Fast Retransmit, the error would be one RTT. When a retransmitted segment was sent by NewReno, the error would be several RTTs. To avoid falsely increasing the estimated RTT, measured RTTs need to be ignored during recovery phase. In addition, segments retransmitted by SACK to fill holes could be discarded by PAWS due to reordering if an original segment whose TSval is newer than the TS.SndMin value is delayed but arrives earlier than the retransmitted segment, and advances RCV.NXT of the receiver. Fortunately, this scenario would be unlikely to occur. Demizu Expires April 2005 [Page 15] Internet-Draft October 2004 Furthermore, Eifel [RFC3522] would not work with this rule. A.2.2.3 Formal Description The rule described in appendix A.2.2 can be formalized as follows: The TCP per-connection state is augmented by a new timestamp: TS.SndMin. Initially: TS.SndMin = my.TSclock On receiving a segment: /* After the segment passes all TCP tests */ if (TS.SndMin < SEG.TSecr) then TS.SndMin = SEG.TSecr On sending a segment: If (SEG.SEQ < SND.NXT) then SEG.TSval = TS.SndMin else SEG.TSval = my.TSclock A.2.3 Tradeoffs between Sender-side Modifications The difference between proposals in appendices A.2.1 and A.2.2 is in the TSval value on retransmitted segments sent by Fast Retransmit, NewReno, etc. Appendix A.2.1 proposes to use TS.SndMax, while appendix A.2.2 proposes to use TS.SndMin. There would be many other candidates between the TS.SndMin value and the TS.SndMax value, such as the idea described in A.2.1.2. In general, sender-side modification can be described as follows: - When a segment *is* sent by Fast Retransmit, NewReno, etc., use a specific value between the TS.SndMin value and the TS.SndMax value for the TSval on the segment. - Otherwise, use the current timestamp clock for the TSval on the segment. The key to sender-side modification is the choise of the TSval value. There are tradeoffs, though. Demizu Expires April 2005 [Page 16] Internet-Draft October 2004 - If the TSval value was equal to TS.SndMax, - New data segments except the last one might be discarded by PAWS. - RTTM would mismeasure RTTs in some limited cases. - If the value was near TS.SndMin, - New data segments would not be discarded by PAWS. Segments retransmitted by SACK might be discarded by PAWS. - RTTM would not work. Measured RTTs should be ignored. - Eifel would not work. - If some other value between the TS.SndMax value and the TS.SndMin value is used for TSval, both characteristics above would be mixed. As seen above, tweaking TSval on a retransmitted segment sent by Fast Retransmit, NewReno, etc. would break various assumptions of existing mechanisms utilizing the TCP Timestamps option, and would cause number of side effects. Some of them are listed above, however, there would be unknown side effects. As a result, sender-side modifications are half measures. It is impossible to choose a perfect TSval value. A.3 Replacement of PAWS A.3.1 Outline of the Idea The third idea would replace PAWS with a new algorithm that is more tolerant to reordering. For example, suppose the following tuple is recorded for every CHUNK_SIZE bytes of data. If more than one segment contained a target byte, this tuple was created only for the first segment containing the target byte. In this tuple, the received time field keeps the time when the tuple was recorded. If its value becomes older than LIFE_TIME ago, the tuple should be invalidated. The sequence number field keeps the sequence number of the target byte. The timestamp field keeps the SEG.TSval value of the segment containing the target byte. Demizu Expires April 2005 [Page 17] Internet-Draft October 2004 Any incoming segment should be checked as follows: if the second latest tuple is not older than LIFE_TIME ago, and the SEG.TSval value on the received segments is older than the value of the timestamp field in the second latest tuple, the incoming segment would be an old duplicate segment. If an incoming segment was an old duplicate segment, it should be discarded and an ACK segment should be sent in reply. PAWS forces that incoming SEG.TSval values is monotonically non-decreasing. If the SEG.TSval value on an incoming segment is older than TS.Recent, such a segment is discarded by PAWS. Contrarily, the SEG.TSval values on segments that pass the test above may not be monotonically non-decreasing. Hence, the algorithm updating TS.Recent should take care by itself to keep TS.Recent monotonically non-decreasing. A.3.2 Formal Description The idea outlined in appendix A.3.1 can be formalized as follows. - There are three parameters: NELEM, CHUNK_SIZE, and LIFE_TIME. - The TCP per-connection state is augmented by an array: TSRec. - TSRec is an array of a structure which consists of the following members. - rcv_time: when the target byte is received first time. - rcv_seqno: the sequence number of the target byte. - rcv_tsval: the SEG.TSval value of the segment. - The number of elements in the TSRec array is NELEM. - When an original SYN is received: /* Record it in the first tuple. */ TSRec[0].rcv_time = current_time; TSRec[0].rcv_seqno = IRS; /* Initial Receive Seq-no */ TSRec[0].rcv_tsval = SEG.TSval; /* Copy the first tuple to others as their initial values. */ for (i = 1; i < NELEM; i++) { TSRec[i] = TSRec[0]; } - When other segment is received: Demizu Expires April 2005 [Page 18] Internet-Draft October 2004 /* Execute the test. */ if (current_time - TSRec[NELEM-1].rcv_time <= LIFE_TIME && SEG.TSval < TSRec[NELEM-1].rcv_tsval) { Send an ACK segment in reply. Discard the incoming segment. } /* Update RCV.NXT */ /* Record it in a new tuple if necessary. */ if (SEG.SEQ > 0 && RCV.NXT - TSRec[0].rcv_seqno > CHUNK_SIZE) { /* Move elements in the TSRec array backward. */ for (i = NELEM - 1; i > 0; i++) { TSRec[i] = TSRec[i-1]; } /* Record the received segment in a new tuple. */ TSRec[0].rcv_time = current_time; TSRec[0].rcv_seqno += CHUNK_SIZE; TSRec[0].rcv_tsval = SEG.TSval; } A.3.3 Parameters The values of the parameters would be as follows. - The idea outlined in appendix A.3.1 assumes NELEM == 2. - The maximum value of CHUNK_SIZE would be 2^32 / 2 / NELEM. If CHUNK_SIZE was larger than this value, the rcv_seqno field of TSRec[NELEM-1] would be wrapped. The minimum value of CHUNK_SIZE would be the current receive window size. The maximum size of the current receive window size is determined by the TCP Window Scale option [RFC1323]: (2^16 - 1) * 2^14 == 2^30 - 2^14. Hence, the maximum possible values of NELEM, CHUNK_SIZE, and receive window size depend on each other. If NELEM is 2, the maximum possible value of CHUNK_SIZE is 2^30, which is larger than the maximum size of receive window size. - The best value for LIFE_TIME would be 24 days, as the same as that in section 4.2.3 of [RFC1323]. As a result, this memo recommends NELEM = 2, CHUNK_SIZE = 2^30, and LIFE_TIME = 24 days. Demizu Expires April 2005 [Page 19] Internet-Draft October 2004 A.4 Comparison of Ideas The ideas described in appendices A.1 and A.2 have problems to be solved, while the idea described in Appendix A.3 does not have any known issues. Hence, this memo recommends the modification in appendix A.3. Author's Address Noritoshi Demizu National Institute of Information and Communications Technology(NICT) 4-2-1 Nukui-Kitamachi, Koganei, Tokyo 184-8795, Japan Phone: +81-42-327-7432 (Ex. 5813) E-mail: demizu@nict.go.jp Copyright Statement Copyright (C) The Internet Society (2004). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. Demizu Expires April 2005 [Page 20] Internet-Draft October 2004 The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Demizu Expires April 2005 [Page 21]