INTERNET-DRAFT C. Sapuntzakis Expires July 2000 Cisco Systems D. Cheriton Cisco Systems February 2000 TCP RDMA option draft-csapuntz-tcprdma-00.txt Status of this Memo This document is an Internet-Draft and is NOT offered in accordance with Section 10 of RFC2026, and the author does not provide the IETF with any rights other than to publish as an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other docu- ments at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in pro- gress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) Cisco Systems (1999-2000). All Rights Reserved. Abstract The TCP option introduced in this draft reduces the overhead of receiving data with TCP-based protocols such as NFS and HTTP. It enables the construction of a simple hardware accelerator that copies data directly from the incoming packet into application buffers, avoiding expensive copies in the protocol stack. Even without hardware acceleration, the option enables the protocol stack to decrease the number of copies it must do. Sapuntzakis, Cheriton [Page 1] Internet-Draft TCP RDMA option 22 February 2000 The TCP RDMA option is an annotation and requires no modifications to overlying protocols. It can be used with popular protocols such as HTTP, NFS, and CIFS, along with new protocols. The TCP option also provides a bit to indicate application-level message boundaries. The bit enables out-of-order processing of the TCP receive queue, potentially decreasing service times in the presence of packet drops and improving performance on parallel sys- tems. Sapuntzakis, Cheriton [Page 2] Internet-Draft TCP RDMA option 22 February 2000 Table of Contents 1. Glossary 2. Introduction 3. RDMA option 3.1. Usage 3.1.1. RID 3.1.2. Data Offset 3.1.3. Data Length 3.1.4. Total RDMA Length 3.1.5. Buffer Offset 3.1.6. Message Aligned (A) bit 3.1.7. Unsolicited (U) bit 3.1.8. Other constraints 3.2. Negotiating use of the option 3.3. Multiple options 3.4. Interactions with TCP congestion control 4. Examples 5. RID Formats 5.1. NFS 5.1.1. NFS RID Format 5.1.1.1. RPC XID 5.1.1.2. Operation 5.1.1.3. Zeroes 5.1.2. READ RPC replies 5.1.3. WRITE RPC requests 5.1.4. Message Aligned (A) bit 5.2. HTTP 5.2.1. RID format 5.2.2. GET responses 5.2.3. POST or PUT requests 5.3. Common Internet File System (CIFS) 5.3.1. Tag Format 5.3.1.1. Pid and Mid 5.3.1.2. Operation Index 5.3.1.3. Zeros 5.3.2. Unsolicited Bit 5.3.3. Message Aligned (A) bit 5.4. SCSI 6. Security considerations 6.1. Receiver security considerations 7. Authors' Addresses 8. References Sapuntzakis, Cheriton [Page 3] Internet-Draft TCP RDMA option 22 February 2000 1. Glossary remote DMA (RDMA) - the transfer of application data from a remote buffer into a contiguous, usually aligned, local buffer RDMA data - the application data being transferred via RDMA unsolicited data - data that a receiver did not request 2. Introduction Currently, doing remote DMA (RDMA) between processors over TCP pro- tocols such as HTTP and NFS requires much processing on the client and server machines, especially at speeds of a gigabit or higher. To see where this overhead comes from, it is instructive to look at an example. Consider the problem of an 8 kilobyte NFS transfer coming in from an Ethernet and eventually ending up in an application's memory. Ethernet's MTU is around 1500 bytes so the sender sends at least 6 packets across the Ethernet. At the receiver, the six packets arrive at the network interface. For each of the six packets, the network interface card on the receiver copies the entire packet to the host's memory. The network interface notifies the host software of the arrival of the packets. The host software then does IP and TCP processing, which eventually results in the software copying the TCP payload into a TCP receive buffer. NFS parses the data in the TCP receive buffer to find the file pages. NFS copies the file pages to the buffer cache. Once in the buffer cache, the operating system maps the pages into the application's address space. These memory-to-memory copies cost valuable main memory bandwidth at clients and servers. To improve performance, it is necessary to reduce the number of such copies. One way to do this is to have the network interface card write the file data into the final location (e.g. the buffer cache) the first time. This requires that the net- work interface card recognize file data in incoming packets. For NFS and HTTP, the problem of recognizing file data involves parsing the protocol headers. This is complex and does not lend itself to a simple hardware realization. Sapuntzakis, Cheriton [Page 4] Internet-Draft TCP RDMA option 22 February 2000 This memo defines a new TCP option, the RDMA option, which circum- vents the parsing of complex protocol headers. The sender places the option on TCP segments containing RDMA data. The RDMA option describes to the receiver the location of the RDMA data in the TCP payload. An RDMA identifier (RID) in the option allows multiple outstanding RDMA transfers on a TCP connection by allowing the sender and receiver to uniquely tag the RDMAs. The layout of the RID depends on the specific higher layer protocol (e.g. NFS). The TCP RDMA option is an annotation and requires no modifications to overlying protocols. This memo specifies the RDMA option in detail in section 3. The use of this option with NFS, HTTP, SCSI, and CIFS, is specified in sec- tion 5. Sapuntzakis, Cheriton [Page 5] Internet-Draft TCP RDMA option 22 February 2000 3. RDMA option 3.1. Usage Kind: 25 (decimal) Length: 2 or 4 or 16 or 20 bytes Byte/ 0 | 1 | 2 | 3 | / | | | | |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0| +---------------+---------------+-+-+---------------------------+ 0 | 25 | Length |A|U| RDMA ID | +---------------+---------------+-+-+---------------------------+ 4 | RDMA ID (RID) | +---------------------------------------------------------------+ 8 | Buffer Offset | +-------------------------------+-------------------------------+ 12| Data Offset | Data Length | +-------------------------------+-------------------------------+ 16| Total RDMA Length | +---------------------------------------------------------------+ 3.1.1. RID All segments in a single RDMA transfer carry the same 46-bit RDMA ID (RID). The RID is an application-level identifier that the receiver can use to map the transfer to an application buffer. The exact value of the RID depends on the overlying protocol. RID formats for several popular protocols are given in section 5. The RDMA ID is stored in network byte order. That is, bits 40-45 of the RID get placed in bits 0-5 of byte 2. Bits 0-7 of the RID get placed in bits 0-7 of byte 7. 3.1.2. Data Offset The data offset specifies the number of bytes from the beginning of the TCP payload to the RDMA transfer data. Sapuntzakis, Cheriton [Page 6] Internet-Draft TCP RDMA option 22 February 2000 The data offset MUST not exceed the length of the TCP payload. 3.1.3. Data Length The data length specifies the number of bytes of RDMA transfer data in this segment, starting at the data offset. The data length MUST not cause the option to describe bytes past the end of the TCP segment. A data length of zero is valid. 3.1.4. Total RDMA Length The total RDMA length is the number of bytes that will be transferred using this RID. If the sender does not know the length of the RDMA when the segment is sent, the sender should send the 16 byte version of this option, leaving the total RDMA length field off. The total RDMA length, when present, MUST be the same for all seg- ments in the RDMA transfer. A total RDMA length of zero is valid. 3.1.5. Buffer Offset If this RDMA transfer is going into a separate buffer on the receiver, the buffer offset field specifies the offset in that buffer. At that offset, the receiver should write the RDMA data demarcated by the data offset and data length fields. 3.1.6. Message Aligned (A) bit The message aligned bit, when 1, indicates that byte 0 of the TCP payload corresponds to the start of a new application-layer mes- sage. The message aligned (A) bit is bit 7 of byte 2. The four byte version of the option may be sent if the sender Sapuntzakis, Cheriton [Page 7] Internet-Draft TCP RDMA option 22 February 2000 wishes to only communicate a message aligned state. 3.1.7. Unsolicited (U) bit In NFS and other RPC-based protocol, transfers from the server to the client (e.g. reads) occur in the response to an explicit request by the client. The explicit request by the client indicates that the client has an allocated buffer waiting for the data from the transfer (or at least has had the opportunity to do so). The client can use the explicit request to communicate some identifier to the server that the server places in the response. In the response, that identifier, embedded in the RID, can be used to associate the data with a client buffer. However, transfers from the client to the server (e.g. writes) often occur on the request. There is usually no opportunity in these protocols for a client to obtain any kind of identifier for the server's application buffer. Indeed, the server may not even have an application buffer allocated for the client request. To indicate this special situation, the unsolicited bit is used. The unsolicited bit (U) is bit 6 of byte 2. The sender SHOULD set the unsolicited bit (U) to one if the RID is not expected by the receiver. 3.1.8. Other constraints The RDMA option MUST appear on every segment containing data that is part of the RDMA transfer. The sender MUST align the RDMA option on a 4 octet boundary rela- tive to the TCP header. 3.2. Negotiating use of the option For the purpose of options negotiation, the length field MAY be set to 2 to prevent any accidental RDMA transfers. Sapuntzakis, Cheriton [Page 8] Internet-Draft TCP RDMA option 22 February 2000 3.3. Multiple options Correct implementations MAY only look at the first RDMA option in a segment. The TCP segments MUST conform to the rules layed out in section 3 when all RDMA options but the first in the segment are stripped. The most important of these requirements is that the RDMA option MUST appear on every segment that contains data that is part of the RDMA transfer. 3.4. Interactions with TCP congestion control The RDMA option may result in segments that are under maximum seg- ment size (MSS) being sent. This may slow the opening of congestion windows on systems that do so based on the number MSS packets received. Sapuntzakis, Cheriton [Page 9] Internet-Draft TCP RDMA option 22 February 2000 4. Examples The figure below is a representation of a TCP stream. It has a single RDMA transfer that occupies two contiguous sections of the TCP stream (section 1 and section 2). Sequence number +----------------+ 0 | Header | | | | | +----------------+ 100 | Transfer | | Section 1 | / / / / | | +----------------+ 2100 | Trailer | +----------------+ 2200 | Header | | | +----------------+ 2300 | Transfer | | Section 2 | / / / / | | +----------------+ 4300 The table below illustrates how this section of the TCP stream will be turned into 6 TCP segments with the RDMA option. The TCP maximum segment size for this stream is 1000 bytes. The sequence number comes from the TCP header. Sapuntzakis, Cheriton [Page 10] Internet-Draft TCP RDMA option 22 February 2000 +------------------------------------------------------+ | Segment | Sequence | Buffer | Data | Data | | Number | Number | Offset | Offset | Length | | | | | | | +------------------------------------------------------+ | 1 | 0 | 0 | 100 | 900 | | 2 | 1000 | 900 | 0 | 1000 | | 3 | 2000 | 1900 | 0 | 100 | | 4 | 2200 | 2000 | 100 | 900 | | 5 | 3200 | 2900 | 0 | 1000 | | 6 | 4200 | 3900 | 0 | 100 | +------------------------------------------------------+ Segment #3 is only 200 bytes, part data and part trailer. If avail- able to the TCP stack at the time, the TCP stack could have sent out the next header as part of the segment. Below is such a segmen- tation. +------------------------------------------------------+ | Segment | Sequence | Buffer | Data | Data | | Number | Number | Offset | Offset | Length | | | | | | | +------------------------------------------------------+ | 1 | 0 | 0 | 100 | 900 | | 2 | 1000 | 900 | 0 | 1000 | | 3 | 2000 | 1900 | 0 | 100 | | 4 | 2300 | 2000 | 0 | 1000 | | 5 | 3300 | 3000 | 0 | 1000 | +------------------------------------------------------+ Note: not putting application headers at the front of a TCP segment may cause decreased performance with some receivers. In either segmentation, segment 3 cannot include any of Transfer Part 2 since the RDMA option can only describe one transfer per packet. Thus, segment 3 will always be less than MSS, even if the stack has more to send. Sapuntzakis, Cheriton [Page 11] Internet-Draft TCP RDMA option 22 February 2000 5. RID Formats 5.1. NFS In NFS, file pages are transferred using the NFS READ and WRITE RPCs. When issuing a READ, the NFS client presumably has an appli- cation buffer (e.g. block cache buffer) waiting to absorb it. When receiving a WRITE, the NFS server may not have a waiting applica- tion buffer to absorb the write. 5.1.1. NFS RID Format RID format for NFS protocol: 4 4 3 3 3 5 0 9 2 1 0 +---------+----------+--------------------------+ | Zero | Operation| RPC XID | +---------+----------+--------------------------+ 5.1.1.1. RPC XID The NFS protocols work on top of ONC RPC which associates with each RPC a 32-bit transaction ID (XID). 5.1.1.2. Operation NFS version 4 allows multiple read and write "operations" per RPC. These operations share the same XID since they are part of the same RPC. To disambiguate the RDMAs resulting from these operations, the RID contains an Operation Index in bits 32-39 of the RID. The operation index is zero for the first operation, one for the second, and so on. Note that the operation index is independent of whether the opera- tion results in an RDMA. If only the third operation in an RPC Sapuntzakis, Cheriton [Page 12] Internet-Draft TCP RDMA option 22 February 2000 results in an RDMA, then the RID for that RDMA will have a 2 in the operation index field. The operation index MUST be zero for NFS versions 2 and 3. 5.1.1.3. Zeroes Bits 40-45 MUST be set to zero by the sender and received as zeros by the receiver. 5.1.2. READ RPC replies For the file pages in NFS READ responses, the server MUST NOT set the unsolicited bit to 1. If the READ RPC fails and no data is returned, the server SHOULD indicate zero length RDMA transfer. 5.1.3. WRITE RPC requests For NFS WRITE calls, the client SHOULD set the unsolicited bit to one, since the server is not expecting the WRITE. 5.1.4. Message Aligned (A) bit The message aligned bit, when used on an NFS connection, indicates the start of an ONC RPC message at byte 0 of a payload. For the purposes of this specification, the start of an ONC RPC message is the four byte length field that is defined for the tunneling of RPC over TCP. 5.2. HTTP 5.2.1. RID format Sapuntzakis, Cheriton [Page 13] Internet-Draft TCP RDMA option 22 February 2000 4 3 3 5 2 1 0 +-------------------+-------------------------+ | Zero | Request idx | +-------------------+-------------------------+ On an HTTP/1.1 connection, the server sends back responses in the order it received requests. Thus, the index of the request, where the first request is index 0, is sufficient to disambiguate the RDMAs. 5.2.2. GET responses The unsolicited bit SHOULD be set to zero. Note, the HTTP server may not know the length of the response, so clients should be prepared to receive the 16 byte option. 5.2.3. POST or PUT requests In POST or PUT requests, the client sends data to the server. The unsolicited bit SHOULD be set to one. 5.3. Common Internet File System (CIFS) The Common Internet File System (CIFS) is based on top of an RPC system known as Server Message Block (SMB). 5.3.1. Tag Format 4 4 3 3 3 1 1 5 0 9 2 1 6 5 0 +-------+------------+-----------+-------------+ | Zero | Operation | PID | MID | +-------+------------+-----------+-------------+ Sapuntzakis, Cheriton [Page 14] Internet-Draft TCP RDMA option 22 February 2000 5.3.1.1. Pid and Mid In SMB, a request is uniquely identified by a 64-bit quantity that includes 4 16-bit fields: Tree Id, User Id, Process Id (PID), and Multiplex Id (MID). There is insufficient room in the RDMA tag to include all four fields. However, the PID and MID originate from the client and are uninterpreted by the server. The client can assign PIDs and MIDs so as to disambiguate concurrent requests. Thus, a CIFS client using the RDMA option MUST ensure that two concurrent SMB requests do not share the same PID and MID fields. 5.3.1.2. Operation Index CIFS supports compound requests that can result in multiple transfers per SMB. The operation index, in bits 32-39, corresponds to the index of the operation in the SMB that caused the RDMA. The first operation is given index zero and so on. Operations are logi- cally assigned indexes whether or not they cause an RDMA. 5.3.1.3. Zeros Bits 40-45 MUST be set to zero by the sender and received as zeros by the receiver. 5.3.2. Unsolicited Bit For CIFS operations that return data from the server, the unsoli- cited bit SHOULD be set to zero. For CIFS operations that send data from the client, the unsolicited bit SHOULD be set to one. 5.3.3. Message Aligned (A) bit The message aligned bit, when used on a CIFS connection, indicates the start of a NetBIOS message at byte 0 of a payload. For the pur- poses of this specification, the start of an NetBIOS message is the four byte length field that is defined for the tunneling of NetBIOS over TCP. Sapuntzakis, Cheriton [Page 15] Internet-Draft TCP RDMA option 22 February 2000 5.4. SCSI The SCSI Architecture model [SAM, SAM2] lays out the requirements for SCSI transports. [SCSI/TCP] is just such a transport. The [SCSI/TCP] document defines the RID structure for SCSI. Sapuntzakis, Cheriton [Page 16] Internet-Draft TCP RDMA option 22 February 2000 6. Security considerations The RDMA option potentially leaks information about an encrypted TCP stream. The presence of or absence of the option, the size and position of the RDMA, and the RID may all leak information to a passive listener. The TCP RDMA option is not protected by SSL or TLS, which only pro- tect the TCP payload. It is, however, protected by the IPsec AH and ESP headers. 6.1. Receiver security considerations A malicious sender may attempt an RDMA transfer larger than the receiving DMA buffer. A secure receiver MUST do bounds checking on the offsets to avoid buffer overruns. When mapping from RIDs to buffers, a receiver should take into account the TCP connection to decrease the opportunity for mali- cious senders to interfere with RDMAs taking place on other connec- tions. Some receivers may set aside buffers for unsolicited transfers. A malicious sender can monopolize those buffers, potentially causing performance degradation to the rest of the system, by doing a series of small, unsolicited transfers. The receiver may wish to place quotas on the size and number of outstanding unsolicited transfers on a single connection. Sapuntzakis, Cheriton [Page 17] Internet-Draft TCP RDMA option 22 February 2000 7. Authors' Addresses Constantine Sapuntzakis Cisco Systems, Inc. 170 W. Tasman Drive San Jose, CA 95134 USA Phone: +1 408 525 5497 Email: csapuntz@cisco.com David Cheriton Cisco Systems, Inc. 170 W. Tasman Drive San Jose, CA 95134 USA Phone: +1 408 527 8207 Email: cheriton@cisco.com 8. References [CIFS] Leach, P., "A Common Internet File System (CIFS/1.0) Proto- col Preliminary Draft", http://www.cifs.com/specs/draft-leach- cifs-v1-spec-01.txt, December 1997 [HTTP] Gettys, J., et al., "Hypertext Transfer Protocol - HTTP/1.1", RFC 2616, June 1999 [NFSv3] Callaghan, B., "NFS Version 3 Protocol Specification", RFC 1813, June 1995 [RPC] Srinivasan, R., "RPC: Remote Procedure Call Protocol Specifi- cation Version 2", RFC 1831, August 1995 [SCSI/TCP] Satran, J., et al., "SCSI/TCP", ftp://ftp.ietf.org/internet-drafts/draft-satran-scot-00.txt [SAM] "SCSI-3 Architecture Model", ANSI X3.270:1996, http://www.t10.org/ [SAM2] "SCSI Architecture Model - 2 Draft", ANSI T101157-D, http://www.t10.org/ Sapuntzakis, Cheriton [Page 18] Internet-Draft TCP RDMA option 22 February 2000 [TCP] Postel, J., "Transmission Control Protocol - DARPA Internet Program Protocol Specification", RFC 793, September 1981 Sapuntzakis, Cheriton [Page 19]