RDMA Connection Manager Private Data For RPC-Over-RDMA Version 1
Oracle CorporationUnited States of Americachuck.lever@oracle.com
Transport
Network File System Version 4NFS-Over-RDMA
This document specifies the format of RDMA-CM Private Data
exchanged between RPC-over-RDMA version 1 peers
as part of establishing a connection.
Such private data is used to indicate peer support
for remote invalidation
and
larger-than-default inline thresholds.
The RPC-over-RDMA version 1 transport protocol
enables payload data transfer using
Remote Direct Memory Access (RDMA)
for upper layer protocols based on Remote Procedure Calls (RPC)
.
The terms "Remote Direct Memory Access" (RDMA) and
"Direct Data Placement" (DDP) are introduced in
.
The two most immediate shortcomings
of RPC-over-RDMA version 1 are:
Setting up an RDMA data transfer (via RDMA Read or Write) can be costly.
The small default size of messages transmitted using RDMA Send
forces the use of RDMA Read or Write operations
even for relatively small messages and data payloads.
The original specification of RPC-over-RDMA version 1 provided
an out-of-band protocol for passing inline threshold values
between connected peers
.
However,
eliminated support for this protocol making it unavailable for this purpose.
Unlike most other contemporary RDMA-enabled storage protocols,
there is no facility in RPC-over-RDMA version 1
that enables the use of remote invalidation
.
RPC-over-RDMA version 1 has no means of extending its XDR definition
in such a way that interoperability with existing implementations is preserved.
As a result, an out-of-band mechanism is needed
to help relieve these constraints
for existing RPC-over-RDMA version 1 implementations.
This document specifies a simple, non-XDR-based message format
designed to be passed between RPC-over-RDMA version 1 peers
at the time each RDMA transport connection is first established.
The purpose of such a message exchange is to
enable the connecting peers to indicate support for transport
properties that are not defined in the base RPC-over-RDMA
version 1 protocol defined in
.
The message format can be extended as needed.
In addition, interoperation between
implementations of RPC-over-RDMA version 1 that present this message format to peers
and those that do not recognize this message format is guaranteed.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED",
"MAY", and "OPTIONAL" in this document
are to be interpreted as described in BCP 14
when, and only when, they appear in all capitals, as shown here.
Section 3.3.2 of
defines the term "inline threshold."
An inline threshold is the maximum number of bytes that
can be transmitted using one RDMA Send and one RDMA Receive.
There are a pair of inline thresholds for a connection:
a client-to-server threshold and a server-to-client threshold.
If an incoming message exceeds the size of a receiver's inline threshold,
the receive operation fails and the connection is typically terminated.
To convey a message larger than a receiver's inline threshold,
an NFS client uses explicit RDMA data transfer operations,
which are more expensive to use than RDMA Send.
The default value of inline thresholds for RPC-over-RDMA version 1
connections is 1024 bytes (see Section 3.3.3 of
).
This value is adequate for nearly all NFS version 3 procedures.
NFS version 4 COMPOUND operations
are larger on average
than NFS version 3 procedures
,
forcing clients to use explicit RDMA operations
for frequently-issued requests such as LOOKUP and GETATTR.
The use of RPCSEC_GSS security also increases the average size
of RPC messages,
due to the larger size of RPCSEC_GSS credential material
included in RPC headers
.
If a sender and receiver could somehow agree on larger inline thresholds,
frequently-used RPC transactions avoid the cost of explicit RDMA operations.
After an RDMA data transfer operation completes,
an RDMA peer can use remote invalidation
to request that the remote peer RNIC
invalidate an STag
associated with the data transfer
.
An RDMA consumer requests remote invalidation by posting
an RDMA Send With Invalidate Work Request
in place of an RDMA Send Work Request.
Each RDMA Send With Invalidate carries one STag to invalidate.
The receiver of an RDMA Send With Invalidate performs the
requested invalidation and then reports that invalidation
as part of the completion of a waiting Receive Work Request.
An RPC-over-RDMA responder can use remote invalidation
when replying to an RPC request
that provided Read or Write chunks.
The requester thus avoids dispatching an extra Work Request,
the resulting context switch, and the invalidation completion interrupt
as part of completing an RPC transaction that uses chunks.
The upshot is faster completion of RPC transactions
that involve RDMA data transfer.
There are some important caveats which contraindicate
the blanket use of remote invalidation:
Remote invalidation is not supported by all RNICs.
Not all RPC-over-RDMA responder implementations can generate
RDMA Send With Invalidate Work Requests.
Not all RPC-over-RDMA requester implementations can recognize
when remote invalidation has occurred.
On one connection in different RPC-over-RDMA transactions,
or in a single RPC-over-RDMA transaction,
an RPC-over-RDMA requester can expose a mixture of STags
that may be invalidated remotely
and some that must not be.
No indication is provided at the RDMA layer as to which is which.
A responder therefore must not employ remote invalidation unless it is
aware of support for it in its own RDMA stack, and on the requester.
And, without altering the XDR structure of RPC-over-RDMA version 1 messages,
it is not possible to support remote invalidation with requesters
that mix STags that may and must not be invalidated remotely
in a single RPC or on the same connection.
However, it is possible to provide a simple signaling mechanism
for a requester to indicate it can deal with remote invalidation
of any STag it has presented to a responder.
There are some NFS/RDMA client implementations that
can successfully make use of such a signaling mechanism.
With an InfiniBand lower layer, for example,
RDMA connection setup uses a Connection Manager
when establishing a Reliable Connection
.
When an RPC-over-RDMA version 1 transport connection is established,
the client (which actively establishes connections)
and the server (which passively accepts connections)
populate the CM Private Data field exchanged
as part of CM connection establishment.
The transport properties exchanged via this mechanism
are fixed for the life of the connection.
Each new connection presents an opportunity
for a fresh exchange.
For RPC-over-RDMA version 1, the CM Private Data field
is formatted as described in the following subsection.
RPC clients and servers use the same format.
If the capacity of the Private Data field is too small
to contain this message format,
the underlying RDMA transport is not managed by a Connection Manager,
or the underlying RDMA transport uses Private Data for its own purposes,
the CM Private Data field cannot be used on behalf of RPC-over-RDMA version 1.
The first 8 octets of the CM Private Data field
is to be formatted as follows:
This field contains a fixed 32-bit value that identifies
the content of the Private Data field as an RPC-over-RDMA
version 1 CM Private Data message.
The value of this field is always 0xf6ab0e18, in network byte order.
The use of this field is further expanded upon in
.
This 8-bit field contains a message format version number.
The value "1" in this field indicates that exactly eight octets are present,
that they appear in the order described in this section,
and that each has the meaning defined in this section.
Further considerations about the use of this field are discussed in
.
This 8-bit field contains bit flags that indicate the support
status of optional features, such as remote invalidation.
The meaning of these flags is defined in
.
This 8-bit field contains an encoded value
corresponding to the maximum number of bytes
this peer is prepared to transmit in a single RDMA Send
on this connection.
The value is encoded as described in
.
This 8-bit field contains an encoded value
corresponding to the maximum number of bytes
this peer is prepared to receive with a single RDMA Receive
on this connection.
The value is encoded as described in
.
The extension described in this document is designed to allow
RPC-over-RDMA version implementations that use this extension
to interoperate fully with
RPC-over-RDMA version 1 implementations that do not exchange this information.
Realizing this goal requires that implementations of this extension
follow the practices described in the rest of this section.
RPC-over-RDMA version 1 implementations that support
the extension described in this document
are intended to interoperate fully with
RPC-over-RDMA version 1 implementations
that do not recognize the exchange of CM Private Data.
When a peer does not receive a CM Private Data message
which conforms to
,
it needs to act as if the remote peer supports only the
default RPC-over-RDMA version 1 settings,
as defined in
.
In other words, the peer is to behave as if a Private Data
message was received in which bit 8 of the Flags field is zero,
and both Size fields contain the value zero.
The Format Identifier field is provided in order to distinguish
RPC-over-RDMA version 1 Private Data from private data
inserted by layers below or above RPC-over RDMA version 1.
During connection establishment,
RPC-over-RDMA version 1 implementations
check for this protocol number before decoding subsequent fields.
If this protocol number is not present as the first 4 octets,
an RPC-over-RDMA receiver needs to ignore the CM-Private Data
(ie., behave as if no RPC-over-RDMA version 1 Private Data
has been provided).
Although the message format described in this document
provides the ability for the client and server
to exchange particular information about
the local RPC-over-RDMA implementation,
it is possible that there will be a future need
to exchange additional properties.
This would make it necessary to extend or otherwise modify
the format described in this document.
Any modification faces the problem of interoperating properly
with implementations of RPC-over-RDMA version 1
that are unaware of this existence of the new format.
These include implementations that that do not recognize
the exchange of CM Private Data
as well as
those that recognize only the format described in this document.
Given the message format described in this document,
these interoperability constraints could be met by the following
sorts of new message formats:
A format which uses a different value for the first four bytes of the format,
as provided for in the registry described in
.
A format which uses the same value for the Format Identifier field
and a value other than one (1) in the Version field.
Although it is possible to reorganize
the last three of the eight bytes in the existing format,
extended formats are unlikely to do so.
New formats would take the form of extensions
of the format described in this document with added fields
starting at byte eight of the format
and changes to the definition of previously reserved flags.
The bits in the Flags field are labeled from bit 8 to bit 15,
as shown in the diagram above.
When the Version field contains the value "1",
the bits in the Flags field are to be set as follows:
When both connection peers have set this flag in their CM Private Data,
the responder MAY use RDMA Send With Invalidate
when transmitting RPC Replies.
Each RDMA Send With Invalidate MUST invalidate an STag
associated only with the XID in the rdma_xid field
of the RPC-over-RDMA Transport Header it carries.
When either peer on a connection clears this flag,
the responder MUST use only RDMA Send when transmitting RPC Replies.
These bits are reserved and are always zero.
Inline threshold sizes from 1KB to 256KB
can be represented in the Send Size and Receive Size fields.
A sender computes the encoded value by dividing
the actual value by 1024 and subtracting one from the result.
A receiver decodes this value by performing
a complementary set of operations.
The client uses the smaller of its own send size and
the server's reported receive size
as the client-to-server inline threshold.
The server uses the smaller of its own send size and
the clients's reported receive size
as the server-to-client inline threshold.
In accordance with
,
the author requests that IANA create a new registry in the
"Remote Direct Data Placement"
Protocol Category Group.
The new registry is to be called the
"RDMA-CM Private Data Identifier Registry".
This is a registry of 32-bit numbers that identify
the Upper Layer protocol associated with data
that appears in the RDMA-CM Private Data area.
The information that must be provided to add an entry to this registry will be
an IESG-approved Standards Track specification
defining the semantics and interoperability requirements
of the proposed new value and the fields to be recorded in the registry.
The fields in this registry include:
Field Identifier,
Format Description,
and
Reference.
The initial contents of this registry are a single entry:
Field IdentifierFormat DescriptionReference0xf6ab0e18RPC-over-RDMA version 1 CM Private Data[RFC-TBD]
The Expert Review policy, as defined in Section 4.5 of
is to be used to handle requests to add new entries to
the "File Provenance Information Registry".
New protocol numbers can be assigned at random
as long as they do not conflict with existing entries in this registry.
RDMA-CM Private Data typically traverses the link layer in the clear.
A man-in-the-middle attack could alter the settings exchanged at
connect time such that one or both peers might perform operations
that result in premature termination of the connection.
InfiniBand Architecture Specification Volume 1InfiniBand Trade Association
Thanks to
Christoph Hellwig
and
Devesh Sharma
for suggesting this approach,
and to
Tom Talpey
and
Dave Noveck
for their expert comments and review.
The author also wishes to thank
Bill Baker
and
Greg Marsden
for their support of this work.
Special thanks go to
Transport Area Director
Magnus Westerlund,
NFSV4 Working Group Chairs
Spencer Shepler
and
Brian Pawlowski,
and
NFSV4 Working Group Secretary Thomas Haynes.