RDMA Connection Manager Private Data For RPC-Over-RDMA Version 1

RDMA Connection Manager Private Data For RPC-Over-RDMA Version 1 Oracle Corporation

1015 Granger Avenue Ann Arbor MI 48104 United States of America +1 248 816 6463 chuck.lever@oracle.com

Transport Network File System Version 4 NFS-Over-RDMA This document specifies the format of RDMA-CM Private Data exchanged between RPC-over-RDMA version 1 peers as a transport connection is established. Such private data is used to indicate peer support for remote invalidation and larger-than-default inline thresholds. The data format is extensible.

The RPC-over-RDMA version 1 transport protocol enables the use of RDMA data transfer for upper layer protocols based on RPC . The terms "Remote Direct Memory Access" (RDMA) and "Direct Data Placement" (DDP) are introduced in . The two most immediate shortcomings of RPC-over-RDMA version 1 are: Setting up an RDMA data transfer (via RDMA Read or Write) can be costly. The small default size of messages transmitted using RDMA Send forces the use of RDMA Read or Write operations even for relatively small messages and data payloads. The original specification of RPC-over-RDMA version 1 provided an out-of-band protocol for passing inline threshold values between connected peers . However, eliminated support for this protocol making it unavailable for this purpose. Unlike most other contemporary RDMA-enabled storage protocols, there is no facility in RPC-over-RDMA version 1 that enables the use of Remote Invalidation . RPC-over-RDMA version 1 has no means of extending its XDR definition in such a way that interoperability with existing implementations is preserved. As a result, an out-of-band mechanism is needed to help relieve these constraints for existing RPC-over-RDMA version 1 implementations. This document specifies a simple, non-XDR-based message format designed to be passed between RPC-over-RDMA version 1 peers at the time each RDMA transport connection is first established. The purpose of this message format is two-fold: To provide immediate relief from certain performance constraints inherent in RPC-over-RDMA version 1 To enable experimentation with parameters of the base RDMA transport over which RPC-over-RDMA runs

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here.

Section 3.3.2 of defines the term "inline threshold." An inline threshold is the maximum number of bytes that can be transmitted using only one RDMA Send and one RDMA Receive. There are a pair of inline thresholds per transport connection, one for each direction of message flow. If an incoming message exceeds the size of a receiver's inline threshold, the receive operation fails and the connection is typically terminated. To convey a message larger than a receiver's inline threshold, an NFS client uses explicit RDMA data transfer operations, which are more expensive to use than RDMA Send. The default value of inline thresholds for RPC-over-RDMA version 1 connections is 1024 bytes in both directions (see Section 3.3.3 of ). This value is adequate for nearly all NFS version 3 procedures. NFS version 4 COMPOUND operations are larger on average than NFS version 3 procedures , forcing clients to use explicit RDMA operations for frequently-issued requests such as LOOKUP and GETATTR. The use of RPCSEC_GSS security also increases the average size of RPC messages, due to the larger size of RPCSEC_GSS credential material included in RPC headers . If a sender and receiver can somehow agree on larger inline thresholds, frequently-used RPC transactions avoid the cost of explicit RDMA operations.

After an RDMA data transfer operation completes, an RDMA peer can use Remote Invalidation to request that the remote peer RNIC invalidate an STag associated with the data transfer . An RDMA consumer requests Remote Invalidation by posting an RDMA Send With Invalidate Work Request in place of an RDMA Send Work Request. Each RDMA Send With Invalidate carries one STag to invalidate. The receiver of an RDMA Send With Invalidate performs the requested invalidation, and then reports that invalidation as part of the completion of a waiting Receive Work Request. An RPC-over-RDMA responder can use Remote Invalidation when replying to an RPC request that provided Read or Write chunks. The requester thus avoids dispatching an extra Work Request, the resulting context switch, and the invalidation completion interrupt as part of completing an RPC transaction that uses chunks. The upshot is faster completion of RPC transactions that involve RDMA data transfer. There are some important caveats which contraindicate the blanket use of Remote Invalidation: Remote Invalidation is not supported by all RNICs. Not all RPC-over-RDMA requester implementations can recognize when Remote Invalidate has occurred. Not all RPC-over-RDMA responder implementations can generate RDMA Send With Invalidate Work Requests. On one connection in different RPC-over-RDMA transactions, or in a single RPC-over-RDMA transaction, an RPC-over-RDMA requester can expose a mixture of STags that may be invalidated remotely and some that must not. No indication is provided at the RDMA layer as to which is which. A responder therefore must not employ Remote Invalidation unless it is aware of support for it in its own RDMA stack, and on the requester. And, without altering the XDR structure of RPC-over-RDMA version 1 messages, it is not possible to support Remote Invalidation with requesters that mix STags that may and must not by invalidated remotely in a single RPC or on the same connection. However, it is possible to provide a simple signaling mechanism for a requester to indicate it can deal with Remote Invalidation of any STag it has presented to a responder. There are some NFS/RDMA client implementations that can successfully make use of such a signaling mechanism.

With an InfiniBand lower layer, for example, RDMA connection setup uses the InfiniBand Connection Manager to establish a Reliable Connection . When an RPC-over-RDMA version 1 transport connection is established, the client (which actively establishes connections) and the server (which passively accepts connections) SHOULD populate the CM Private Data field exchanged as part of CM connection establishment. The transport properties exchanged via this mechanism are fixed for the life of the connection. Each new connection presents an opportunity for a fresh exchange. For RPC-over-RDMA version 1, the CM Private Data field is formatted as described in the following subsection. RPC clients and servers use the same format. If the capacity of the Private Data field is too small to contain this message format, the underlying RDMA transport is not managed by a Connection Manager, or the underlying RDMA transport uses Private Data for its own purposes, the CM Private Data field cannot be used on behalf of RPC-over-RDMA version 1.

The first 8 octets of the CM Private Data field MUST be formatted as follows:

0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Protocol Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Version | Flags | Send Size | Receive Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This field contains a fixed 32-bit value that identifies the content of the Private Data field as an RPC-over-RDMA version 1 CM Private Data message. The value of this field MUST be 0xf6ab0e18, in network byte order. The use of this field is further expanded upon in . This 8-bit field contains a message format version number. The value "1" in this field indicates that exactly eight octets are present, that they appear in the order described in this section, and that each has the meaning defined in this section. Further considerations about the use of this field are discussed in . This 8-bit field contains eight bit flags that indicate the support status of optional features, such as Remote Invalidation. The meaning of these flags is defined in . This 8-bit field contains an encoded value corresponding to the maximum number of bytes this peer will transmit in a single RDMA Send. The value is encoded as described in . This 8-bit field contains an encoded value corresponding to the maximum number of bytes this peer can receive with a single RDMA Receive. The value is encoded as described in .

RPC-over-RDMA version 1 implementations that support the extension described in this document are intended to interoperate fully with RPC-over-RDMA version 1 implementations that do not recognize the exchange of CM Private Data. When a peer does not receive a CM Private Data message which conforms to , it MUST act as if the remote peer supports only the default RPC-over-RDMA version 1 settings, as defined in . In other words, the peer is to behave as if a Private Data message was received in which bit 8 of the Flags field is zero, and both Size fields contain the value zero. The Protocol Number field is provided in order to distinguish RPC-over-RDMA version 1 Private Data from private data inserted by layers below or above RPC-over RDMA version 1. During connection establishment, RPC-over-RDMA version 1 implementations check for this protocol number before decoding subsequent fields. If this protocol number is not present as the first 4 octets, an RPC-over-RDMA receiver MUST ignore the CM-Private Data (ie., behave as if no RPC-over-RDMA version 1 Private Data has been provided).

Because the first 8 octets of the Private Data format described above contain a Version field, subsequent versions of this data structure MUST also start with these 8 octets exactly as they appear here. However, the Private Data format described here can be extended by adding additional fields which follow the first eight octets, or by making use of one of the bits in the Flags field that is marked reserved in this document. To introduce such changes while preserving interoperability, a new Version number is to be allocated, and new fields and bit flags are to be defined. A description of how receivers should behave if they do not recognize the new format is to be provided as well. Such situations may be addressed by specifying the new format in a document updating this one.

The bits in the Flags field are labeled from bit 8 to bit 15, as shown in the diagram above. When the Version field contains the value "1", the bits in the Flags field have the following meaning: When both connection peers have set this flag in their CM Private Data, the responder MAY use RDMA Send With Invalidate when transmitting RPC Replies. Each RDMA Send With Invalidate MUST invalidate an STag associated only with the XID in the rdma_xid field of the RPC-over-RDMA Transport Header it carries. When either peer on a connection clears this flag, the responder MUST use only RDMA Send when transmitting RPC Replies. These bits are reserved and MUST be zero.

Inline threshold sizes from 1KB to 256KB can be represented in the Send Size and Receive Size fields. A sender computes the encoded value by dividing the actual value by 1024 and subtracting one from the result. A receiver decodes this value by performing a complementary set of operations. The requester MUST use the smaller of its own send size and the responder's reported receive size as the requester-to-responder inline threshold. The responder MUST use the smaller of its own send size and the requester's reported receive size as the responder-to-requester inline threshold.

In accordance with , the author requests that IANA create a new registry in the "Remote Direct Data Placement" Protocol Category Group. The new registry is to be called the "RDMA-CM Private Data Identifier Registry". This is a registry of 32-bit numbers that identify the Upper Layer protocol associated with data that appears in the RDMA-CM Private Data area. The information that must be provided to add an entry to this registry will be an IESG-approved Standards Track specification defining the semantics and interoperability requirements of the proposed new value and the fields to be recorded in the registry. The fields in this registry include: Protocol number, Protocol name, RFC Reference. The initial contents of this registry are a single entry: 0xf6ab0e18, RPC-over-RDMA version 1 CM Private Data, this specification. All other values are available to IANA for assignment. New protocol numbers can be assigned at random as long as they do not conflict with existing entries in this registry. Allocation Policy: Standards Action

RDMA-CM Private Data typically traverses the link layer in the clear. A man-in-the-middle attack could alter the settings exchanged at connect time such that one or both peers might perform operations that result in premature termination of the connection.

Key words for use in RFCs to Indicate Requirement Levels In many standards track documents several words are used to signify the requirements in the specification. These words are often capitalized. This document defines these words as they should be interpreted in IETF documents. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements. A Remote Direct Memory Access Protocol Specification This document defines a Remote Direct Memory Access Protocol (RDMAP) that operates over the Direct Data Placement Protocol (DDP protocol). RDMAP provides read and write services directly to applications and enables data to be transferred directly into Upper Layer Protocol (ULP) Buffers without intermediate data copies. It also enables a kernel bypass implementation. [STANDARDS-TRACK] Direct Data Placement Protocol (DDP) / Remote Direct Memory Access Protocol (RDMAP) Security This document analyzes security issues around implementation and use of the Direct Data Placement Protocol (DDP) and Remote Direct Memory Access Protocol (RDMAP). It first defines an architectural model for an RDMA Network Interface Card (RNIC), which can implement DDP or RDMAP and DDP. The document reviews various attacks against the resources defined in the architectural model and the countermeasures that can be used to protect the system. Attacks are grouped into those that can be mitigated by using secure communication channels across the network, attacks from Remote Peers, and attacks from Local Peers. Attack categories include spoofing, tampering, information disclosure, denial of service, and elevation of privilege. [STANDARDS-TRACK] Remote Direct Memory Access Transport for Remote Procedure Call Version 1 This document specifies a protocol for conveying Remote Procedure Call (RPC) messages on physical transports capable of Remote Direct Memory Access (RDMA). This protocol is referred to as the RPC-over- RDMA version 1 protocol in this document. It requires no revision to application RPC protocols or the RPC protocol itself. This document obsoletes RFC 5666. Guidelines for Writing an IANA Considerations Section in RFCs Many protocols make use of points of extensibility that use constants to identify various protocol parameters. To ensure that the values in these fields do not have conflicting uses and to promote interoperability, their allocations are often coordinated by a central record keeper. For IETF protocols, that role is filled by the Internet Assigned Numbers Authority (IANA). To make assignments in a given registry prudently, guidance describing the conditions under which new values should be assigned, as well as when and how modifications to existing values can be made, is needed. This document defines a framework for the documentation of these guidelines by specification authors, in order to assure that the provided guidance for the IANA Considerations is clear and addresses the various issues that are likely in the operation of a registry. This is the third edition of this document; it obsoletes RFC 5226. Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words RFC 2119 specifies common key words that may be used in protocol specifications. This document aims to reduce the ambiguity by clarifying that only UPPERCASE usage of the key words have the defined special meanings. InfiniBand Architecture Specification Volume 1 InfiniBand Trade Association NFS Version 3 Protocol Specification This paper describes the NFS version 3 protocol. This paper is provided so that people can write compatible implementations. This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Remote Direct Memory Access Transport for Remote Procedure Call This document describes a protocol providing Remote Direct Memory Access (RDMA) as a new transport for Remote Procedure Call (RPC). The RDMA transport binding conveys the benefits of efficient, bulk-data transport over high-speed networks, while providing for minimal change to RPC applications and with no required revision of the application RPC protocol, or the RPC protocol itself. [STANDARDS-TRACK] Network File System (NFS) Version 4 Protocol The Network File System (NFS) version 4 protocol is a distributed file system protocol that builds on the heritage of NFS protocol version 2 (RFC 1094) and version 3 (RFC 1813). Unlike earlier versions, the NFS version 4 protocol supports traditional file access while integrating support for file locking and the MOUNT protocol. In addition, support for strong security (and its negotiation), COMPOUND operations, client caching, and internationalization has been added. Of course, attention has been applied to making NFS version 4 operate well in an Internet environment. This document, together with the companion External Data Representation (XDR) description document, RFC 7531, obsoletes RFC 3530 as the definition of the NFS version 4 protocol. Remote Procedure Call (RPC) Security Version 3 This document specifies version 3 of the Remote Procedure Call (RPC) security protocol (RPCSEC_GSS). This protocol provides support for multi-principal authentication of client hosts and user principals to a server (constructed by generic composition), security label assertions for multi-level security and type enforcement, structured privilege assertions, and channel bindings. This document updates RFC 5403.

Thanks to Christoph Hellwig and Devesh Sharma for suggesting this approach, and to Tom Talpey for his comments and review. The author also wishes to thank Bill Baker and Greg Marsden for their support of this work. Special thanks go to Transport Area Director Spencer Dawkins, NFSV4 Working Group Chairs Spencer Shepler and Brian Pawlowski, and NFSV4 Working Group Secretary Thomas Haynes.