Network Working Group S. DiCecco Internet-Draft J. Williams GigaNet, Inc. Expires January 2001 July 14, 2000 VI / TCP (Internet VI) Status of this memo This document is an Internet-Draft and is offered in full accordance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Draft documents are valid for a maximum of six months and may be updated, replaced, or rendered obsolete by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to site them other than as "work in progress". The list of current Internet-Drafts can be accessed at http://www.ietf.org/lid-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.NHtml The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this memo are to be interpreted as described in RFC2119. Table of Contents 1 Abstract 2 2 Overview 3 2.1 VI Architectural Components 3 2.2 VI/TCP 4 2.2.1 Extensions to VI 4 2.2.2 VI/TCP Overview 4 2.2.2.1 Basic VI Components 4 2.2.2.2 Introduction to VI/TCP 5 2.2.2.2.1 VI/TCP Addressing 5 2.2.2.2.2 VI/TCP Connection Management 5 DiCecco, Williams [Page 1] Internet-Draft VI / TCP (Internet VI) July 14, 2000 2.2.2.2.3 VI/TCP Protocol Messaging 6 2.2.2.2.4 TCP/IP Options and VI/TCP 6 2.2.2.2.5 VI/TCP Retransmissions 6 2.2.2.2.6 Note on Outstanding RDMA Reads 7 3 The VI/TCP Protocol 7 3.1 VI/TCP Segment Header 8 3.2 VI/TCP Connection Establishment (CE) Header 12 3.3 VI/TCP RDMA Header 14 4 VI/TCP Connection Establishment 15 4.1 Basic Connection Establishment Timeline 15 4.2 Connection Establishment - Active 16 4.3 Connection Establishment - Passive 17 5 Security Considerations 18 6 References 18 7 Author's Addresses 19 1. Abstract The Virtual Interface (VI) architecture [VIAR] describes a high performance design for interfacing distributed applications to accelerated protocol processing. VI seeks to improve the performance of such applications by reducing the latency and overheads associated with standard communications protocol stack processing. VI greatly reduces the processing overhead associated with traditional network architectures by providing applications a protected, directly accessible interface to network hardware - a Virtual Interface. This memo describes extensions to the VI Architecture designed to facilitate operation over TCP/IP. These extensions take the form of enhancements to the VI Provider Library API defined in the VI Architecture Developer's Guide [VIDG], and a "VI Protocol" which supports VI functionality during operation over TCP/IP. The extensions to the VI Architecture which support operation over TCP/IP are intended to be fully compliant with the VI Architecture [VIAR] and its associated Developer's Guide [VIDG]. DiCecco, Williams [Page 2] Internet-Draft VI / TCP (Internet VI) July 14, 2000 2. Overview This section contains a brief overview of VI components and a functional overview of VI operation over TCP/IP 2.1. VI Architectural Components VI is comprised of four architectural components - Virtual Interfaces, Completion Queues, VI Providers, and VI Consumers. Virtual Interfaces (VIs) are the mechanisms that allow VI Consumers direct access to the data transfer services of VI Providers. VI Consumers post data transfer requests, in the form of Descriptors, directly to the VI Provider. Descriptors are structures that contain the information necessary for the VI Provider to process the data transfer (e.g.,data location). Descriptors are posted to Work Queues (send and receive) associated with the VI. Facilities are provided to signal VI Descriptor postings to the network adapter. Processing of posted Descriptors is asynchronous and descriptors are marked when processing completes. VI Consumers remove completed descriptors from Work Queues for reuse in subsequent requests. Completion Queues provide a facility whereby VI Consumers can create a single point of notification for processing completed Descriptors. Once a Work Queue is associated with a Completion Queue, handling of all completions are handled via that Completions Queue. The VI Provider consists of a physical network interface (NIC) and driver functionality. The VI NIC implements the Virtual Interfaces and Completion Queues, and directly performs data transfers. VI NIC drivers provide the control and resource management functions to maintain the VI between consumers and VI NICs. VI Consumers are typically applications programs and their supporting operating system functions. VI Consumers represent the users of a Virtual Interface. Access to the Virtual Interface is through a library referred to as the VI Provider Library [VIDG]. The VI Provider Library provides an application programming interface for hardware connection, endpoint creation and destruction, connection management, memory handling, data transfer, queue management, informational DiCecco, Williams [Page 3] Internet-Draft VI / TCP (Internet VI) July 14, 2000 queries, name services, and error handling. 2.2. VI/TCP This section introduces the fundamentals of VI operation over TCP. 2.2.1. Extensions to VI The proposed protocol supports the VI Architecture as currently defined. In addition, the protocol supports certain enhancements to VI. Extensions to the API defined in [VIDG] would be required to exploit such enhancements. Proposed enhancements are as follows: - Descriptor Flow Control: Transmit descriptors may be posted in advance of the corresponding receive descriptors. The VI Provider will supply flow control. - Security Field: A security field is contained in the Connect Request and Connect Accept PDUs. Use of this field is to be defined. - Attribute Negotiation: VI Architecture requires that incoming connection establishment attempts be rejected unless the calling and called VI Attributes match (e.g., Maximum Transfer Unit Size). The protocol permits downward negotiation of MTU sizes. 2.2.2. VI/TCP Overview This Section provides an overview of how the components of a Virtual Interface are created, managed, and destroyed, and also introduces the data transfer models. 2.2.2.1. Basic VI Components Operations on basic VI architectural components remain largely unchanged with VI/TCP. VI functionality is invoked by a VI Consumer through the API defined in [VIDG]. Access to a VI NIC is achieved by opening a handle to the driver representing the NIC. This handle is used in subsequent operations. All memory used in data transfer is DiCecco, Williams [Page 4] Internet-Draft VI / TCP (Internet VI) July 14, 2000 "registered" with the VI Provider. Memory handles are used to identify the region and to qualify virtual memory addresses. VIs are created by the VI Provider upon request by the VI Consumer. Connections are not established by creation of a VI and no data transfer can occur until the VI is connected to another. VI Work Queues may be associated with Completion Queues to provide a single handling point for completed VI Descriptors. VI provides a connection-oriented data transfer service. Newly created VIs are not pre-associated with other VIs; a VI must be explicitly connected to another to enter its data transfer phase. VI provides two types of data transfers - traditional Send/Receive, and Remote Direct Memory Access (RDMA). 2.2.2.2. Introduction to VI/TCP This Section serves as an introduction to VI operation over TCP/IP. 2.2.2.2.1. VI/TCP Addressing The VI Architecture defines a generic "VI Network Address" format consisting of an "address" portion and a "discriminator" portion. When operating VI/TCP, the address portion contains an IP address and the discriminator is per the VI Architecture [VIAR]. One transport layer port is reserved for passive connection establishment. All incoming VI connections are through this port and VI applications distinguish themselves by the VI Network Address discriminator. For active connection establishment, multiple transport layer ports are used. 2.2.2.2.2. VI/TCP Connection Management With VI/TCP, all VI connections are implemented over an underlying TCP connection. The VI/TCP connection establishment process requires an underlying TCP connection over which VI/TCP protocol may be exchanged. VI connections have a one-to-one correspondence with TCP connections. This is referred to as the VI/TCP connection. When a VI connection is closed, the underlying TCP connection must be closed. Similarly, when a TCP connection is closed, the associated VI connection must be closed. VI Provider's handling VIP_Connect Request primitives [VIDG], first request TCP establish its connection and then perform VI/TCP protocol messaging over this underlying connection. VI Providers must have DiCecco, Williams [Page 5] Internet-Draft VI / TCP (Internet VI) July 14, 2000 accepted an underlying TCP connection before the associated VI connection is accepted. VI/TCP Provider's MUST check that address/handles are valid for the underlying connection. From the perspective of a VI/TCP Provider, TCP connection setup is an atomic operation that either succeeds or fails. If the operation succeeds, VI connection establishment is initiated; otherwise, the VI connection is rejected. 2.2.2.2.3. VI/TCP Protocol Messaging VI/TCP functionality is invoked by a VI Consumer through the API defined in [VIDG]. The VI Provider supplies this functionality. The VI Provider, through use of the VI Protocol, supports this VI/TCP functionality. The VI Protocol defines "messages" to implement VI these functions (e.g., connections establishment). Typically, there is one message per Transmit Descriptor. Each message has a type (e.g., RDMA Write). VI messages are divided into "segments". These segments are sent, in order, over the associated TCP connection. It is recommended, but not required, that there be exactly zero or one VI segment for each TCP segment and that VI segments not be fragmented to span multiple TCP segments. All segments for one VI message will be transmitted before the next message is started. An exception is provided in that RDMA Read Response segments may be interleaved with segments of any message type other than another RDMA Read Response. 2.2.2.2.4. TCP/IP Options and VI/TCP It is strongly recommended that TCP connections supporting VI/TCP implement the timestamp option for PAWS (protection against wrapped sequence numbers) as defined in RFC1323, TCP Extensions for High Performance [PAWS]. 2.2.2.2.5. VI/TCP Retransmissions VI/TCP will retransmit dropped segments, as required. It is recommended that retransmitted segments contain the same data as the original dropped segment. In certain circumstances, this will not be possible without undue burden on an implementation. The following exceptions are DiCecco, Williams [Page 6] Internet-Draft VI / TCP (Internet VI) July 14, 2000 noted: - The "RX Descriptors Posted" field of the protocol's "VI Segment Header" may contain the current value instead of the value at the time of its original posting. Senders must not assume that this value has actually been received since there are no facilities for determining which of the two values has been received from the acknowledgement. - Retransmission is required, but data access results in an access violation and retransmission cannot occur. - Retransmission is required, but cannot occur because the VI/TCP connection has been closed. If a VI NIC is unable to retransmit original data, it may pad and should set the "Transmit Error" bit in the "Type" field of the "VI Segment Header". 2.2.2.2.6. Note on Outstanding RDMA Reads For each RDMA Read Request received, memory allocated for the request must be held until the response is acknowledged. The number of outstanding RDMA Reads must be limited to control resource exhaustion. Discarding excessive RDMA Reads pending completions of outstanding requests does not seem viable in the absence of a deadlock avoidance mechanism. It is proposed that the VI Protocol be extended to provide negotiation of the number of outstanding RDMA Reads during connection establishment. This number would represent a per VI limit and the negotiated value would remain for the lifetime of the VI/TCP connection. 3. The VI/TCP Protocol This section provides the VI/TCP protocol data unit formats. All multibyte formats are to be represented in network byte order (i.e., big-endian). Each VI/TCP PDU contains a VI Segment Header. Optionally, an RDMA Header or CE (connection establishment) Header may be present. The VI/TCP Segment Header provides sufficient features to support non- RDMA send/receives. The RDMA Header must be included for RDMA transfers. The CE Header must be included for connection establishment. Encapsulations are as follows: DiCecco, Williams [Page 7] Internet-Draft VI / TCP (Internet VI) July 14, 2000 | Lower Layer Headers | +----------------------------------------+ | IP Header | +----------------------------------------+ | TCP Header | +----------------------------------------+ | VI/TCP Segment Header | +----------------------------------------+ | (optional) VI/TCP RDMA or CE Header | +----------------------------------------+ | VI/TCP Consumer's Data | 3.1. VI/TCP Segment Header The VI/TCP Segment Header is defined as follows. | Byte 0 | Byte 1 | Byte 2 | Byte 3 | |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0| +---------------+---------------+---------------+---------------+ | Version | Type/Flags | Segment Length | +---------------+---------------+---------------+---------------+ | Data Offset | +---------------+---------------+---------------+---------------+ | Immediate Data | +---------------+---------------+---------------+---------------+ | Message Number | +---------------+---------------+---------------+---------------+ | Message ACK | +---------------+---------------+---------------+---------------+ | Rx Descriptors Posted | Remote Error Code | +---------------+---------------+---------------+---------------+ | VI/TCP User Data | | | Version This is an 8-bit field indicating the VI/TCP version. DiCecco, Williams [Page 8] Internet-Draft VI / TCP (Internet VI) July 14, 2000 Type/Flags Type is a 5-bit field indicating the packet type. Valid message types are: - Send - RDMA Read Request - RDMA Read Response - RDMA Write - NOP - Connect Request - Connect Accept - Connect Reject - Connect No Match Flags is a 3-bit field defined as follows: - BIT 1 : Immediate Data Valid As defined by the VI Architecture [VIAR] - BIT 2 : End of Message Indicates the current segment is the last of a message - BIT 3 : Transmit Error Indicates either transmit length or protection error Segment Length Segment Length is a 16-bit field containing the length of the VI/TCP segment including the VI/TCP Segment Header Data Offset For the initial segment of any message, this 32-bit field will contain zero. For subsequent segments, it will contain the number of bytes already transferred for this message in prior segments. Only segment payload is included in this count; headers are specifically not included. Immediate Data May hold 32-bits of optional user data as described by the VI DiCecco, Williams [Page 9] Internet-Draft VI / TCP (Internet VI) July 14, 2000 Architecture [VIAR]. Message Number Messages are sequentially number by the VI/TCP Provider. The initial Message Number may be varied by an implementation. For RDMA Read Responses, Message Number carries the message number of the corresponding RDMA Read Request. Any two segments with the same type are part of the same message if and only if their message numbers are equal. Segments within a message may be processed out of order. Reordering messages is not supported. Rx Descriptors Posted Indicates the number of receive Descriptors, modulo 2^16, that have been posted during the lifetime of the VI/TCP connection. If Descriptor flow control is in effect, the VI/TCP provider must delay any transmission which would consume receive Descriptors until receive descriptors complete and become available. Notes on NOP NOPs are used to send a Message ACK when a VI has no data to transmit. If a VI has data to transmit, the Rx Descriptors Posted number is included in the transferred segments. However, if a transmitter is idle, a NOP is utilized to permit conveyance of Rx Descriptors Posted in the absence of use data message segments. Idle VI/TCP connections that have an updated Rx Descriptors Posted value, must use NOP to convey this information. NOPs should be sent whenever there have been Rx Descriptors Posted since the last segment sent and the number of unused credits falls below some threshold. Unused credits are the number of Rx Descriptors of which the remote VI has been notified but has not yet used. The threshold value should be set at the time of connection establishment. Message ACK Message ACK is valid only for VI/TCP connections at the Reliable Delivery Level [VIAR]. Message ACK is used in conjunction with Remote Error Code to provide information relating to memory protection or VI Descriptor errors and also to provide facilities for implementation specific error handling. If the VI Error subfield of the Remote Error Code DiCecco, Williams [Page 10] Internet-Draft VI / TCP (Internet VI) July 14, 2000 (Remote Error Codes, next section) indicates "No Error", then Message ACK contains the Message Number from the last VI Message received without error. If VI Error is OTHER THAN "No Error", then Message ACK contains the Message Number of the message segment in error. Message ACK should not be indicated for messages until both message data has been written to host memory and the doorbell has been rung. Message ACKs may be included in any VI Message Segment including that of a NOP message. When a VI/TCP connection is supporting level Reliable Reception, the Message ACK field must be valid and will be used determine when transmit Descriptors will be completed. RDMA Reads are completed upon receipt of a valid response. Message ACK are indicated for messages received in error. In this case, the VI Error Type field of Remote Error Code is set to reflect the appropriate VI error. Remote Error Code is defined in the following paragraph. Message ACK is invalid on subsequent messages. When a VI/TCP connection is supporting level Reliable Delivery or Unreliable Delivery, the contents of Message ACK are undefined and must be ignored by a receiver. Remote Error Code Remote Error Code is comprised of two subfields - the VI Error Type, and the IS Error Code. Both VI Error Type and IS Error Code apply to VI message identified by the Message ACK field. These are defined in the following paragraphs. IS Error Code IS Error Code is an Implementation Specific error code, its semantics are implementation dependent and considered outside the scope of this document. If the IS Error Code is set, Message ACK must be set to indicate the VI Message on which the error occurred. Note that VI errors (see next paragraph) and local errors need not be mutually exclusive and this field may be used to provide supplemental status information. VI Error Type VI Error Type indicates one of the following - No Error DiCecco, Williams [Page 11] Internet-Draft VI / TCP (Internet VI) July 14, 2000 - RDMA Memory Protection Error - VI Descriptor Error When a VI/TCP connection is supporting level Reliable Reception, the VI Error Type field must be valid and is used to update the Status of the VI Descriptor's Control Segment. When a VI/TCP connection is supporting level Reliable Delivery or Unreliable Delivery, VI Error Type shall indicate No Error and be ignored by a receiver. 3.2. VI/TCP Connection Establishment (CE) Header The VI/TCP CE Header is defined as follows. DiCecco, Williams [Page 12] Internet-Draft VI / TCP (Internet VI) July 14, 2000 | Byte 0 | Byte 1 | Byte 2 | Byte 3 | |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0| +---------------+---------------+---------------+---------------+ | Calling Attributes | Calling Discriminator Length | +---------------+---------------+---------------+---------------+ | MTU Size | +---------------+---------------+---------------+---------------+ | | +- -+ | | +- Calling Discriminator -+ | | +- -+ | | +---------------+---------------+---------------+---------------+ | Calling RDMA Read Window | Called Discriminator Length | +---------------+---------------+---------------+---------------+ | | +- -+ | | +- Called Discriminator -+ | | +- -+ | | +---------------+---------------+---------------+---------------+ | Security Information | Calling Attributes The Calling Attributes field contains the following flag bits - Reliable : Reliable reception and delivery are merged - RDMA Write Enable - RDMA Read Enable - Descriptor Flow Control Enabled - Peer-to-peer Connection Establishment Calling/Called Discriminator and Discriminator Lengths These fields are as defined by the VI Architecture [VIAR] DiCecco, Williams [Page 13] Internet-Draft VI / TCP (Internet VI) July 14, 2000 MTU Size MTU Size is "proposed" in Connect Request PDUs and is considered an "agreed" value in a Connect Accept. The agreed value must be the lesser of the called/calling VI/TCP Provider's MTU capability. Security Information To be supplied. A VI/TCP Consumer may use this feature as an additional basis for accepting or rejecting calls. This feature is currently unsupported by the VI Architecture [VIAR] and the programming API [VIDG]. Extensions to the API to expose this feature are to be supplied. 3.2.1. VI/TCP RDMA Header The VI/TCP RDMA Header is defined as follows. | Byte 0 | Byte 1 | Byte 2 | Byte 3 | |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0| +---------------+---------------+---------------+---------------+ | | +- RDMA Address -+ | | +---------------+---------------+---------------+---------------+ | Registered Memory Handle | +---------------+---------------+---------------+---------------+ | RDMA Length | +---------------+---------------+---------------+---------------+ | | | VI/TCP Consumer's Data | | | RDMA Address The RDMA Address field contains the 64-bit data address of the first data segment from the VI Descriptor DiCecco, Williams [Page 14] Internet-Draft VI / TCP (Internet VI) July 14, 2000 Registered Memory Handle The Registered Memory Handle field contains the Memory Handle returned when the region of memory containing the data segment was registered with the VI Provider. This is the same memory handle required by the VI Descriptor. RDMA Length The RDMA Length field contains the length field from the VI Descriptor that indicates the total number of bytes to be transferred across all segments of a message. 4. VI/TCP Connection Establishment This section contains the state machines governing VI/TCP connection establishment. Both active and passive (e.g., listens) scenarios are presented. For peer-to-peer connection establishment, conflicts are resolved lexicographically on IP address. "Lower" (in the lexicographic sense) IP addressed hosts concede to "higher" addressed host during peer connection establishment and the process reverts that of the active/passive case. Receiving a Connect No Match VI/TCP packet type during peer connection establishment results it repeated attempts for a period specified by the VI Consumer's connection timeout value. 4.1. Basic Connection Establishment Timeline DiCecco, Williams [Page 15] Internet-Draft VI / TCP (Internet VI) July 14, 2000 VIPL API [VIDG] | VI/TCP Protocol | VIPL API ------------------------------------------------------------------ | | | | VipConnectWait VipConnectRequest | | <----------------- -----------------> | | | setup TCP connection | | | | Connect Request | | -------------------> | VipConnectWait(ret) | | -----------------> | | VipPostRecv | | <----------------- | | VipConnectAccept | Connect Accept | <---------------- VipConnectReq (ret) | <------------------- | <----------------- | or Connect Reject | | or Connect No Match | 4.2. Connection Establishment - Active The state machine governing active VI/TCP connection establishment is as follows: DiCecco, Williams [Page 16] Internet-Draft VI / TCP (Internet VI) July 14, 2000 +----------------+ (Legend: event - action) | Disconnected | <--------------------------------<+ +----------------+ ^ | VipConnectRequest | | - Setup TCP connection | | | \|/ | +----------------+ TCP setup fail ^ +------>| Connecting +>--------------------------------->+ | +----------------+ ^ | TCP Closes | TCP connection established | | - | - ConnectRequest | | Reestablish \|/ | | +----------------+ ConnectReject or Timeout ^ +------<| Pending Accept |>--------------------------------->+ +----------------+ - close TCP connect. ^ | | | ConnectAccept | \|/ | +----------------+ Vip or TCP disconnect ^ | Connected |>--------------------------------->+ +----------------+ - close TCP connection 4.3. "Connection Establishment - Passive" The state machine governing passive VI/TCP connection establishment is as follows: DiCecco, Williams [Page 17] Internet-Draft VI / TCP (Internet VI) July 14, 2000 +----------------+ +--------------+ | Listening on | ( Listen is implied by + Disconnected + | VI/TCP Port | 1'st VipCreateVI ) +--------------+ +----------------+ ^ | | | incoming TCP connection - Accept TCP connection | | (Legend: event - action) | \|/ | +----------------+ Timeout - close TCP connection ^ | Incoming |---------------------------------------------->+ +----------------+ TCP connection closes ^ | | | incoming Connect Request | \|/ | +----------------+ No Matching Discriminator ^ | Matching +---------------------------------------------->+ +----------------+ - Send Connect NoMatch; close TCP connection ^ | | | Discriminator match | \|/ | +----------------+ ConnectReject - VipConnectReject, close TCP ^ | Pending Accept |---------------------------------------------->+ +----------------+ ^ | | | VipConnectAccept - ConnectAccept | \|/ +----------------+ Vip or TCP disconnect - close TCP connection ^ | Connected |---------------------------------------------->+ +----------------+ 5. Security Considerations No special security considerations exist at this time. 6. References DiCecco, Williams [Page 18] Internet-Draft VI / TCP (Internet VI) July 14, 2000 [VIAR] "Virtual Interface Architecture Specification", Compaq Computer Corp., Intel Corporation, Microsoft Corporation, 1997. [VIDG] "Intel Virtual Interface (VI) Architecture Developer's Guide", Intel Corporation, September 1998. [PAWS] Jacobsen, Braden, Borman, "TCP Extensions for High Performance", RFC 1323, May 1992. 7. Author's Addresses Stephen DiCecco James Williams GigaNet, Inc. Concord Office Center 2352 Main Street Concord, Massachusetts 01742 978.461.0402 (tel) 978.461.0430 (fax) www.giganet.com Email: sdicecco@giganet.com jimw@giganet.com DiCecco, Williams [Page 19]