The Use of Remote Invalidation with RPC-Over-RDMA Transport Protocols
draft-cel-nfsv4-reminv-design-00

Abstract

Remote Invalidation relieves requesters/initiators of some of the burden of preparing memory to be accessed remotely, thus reducing the latency of transactions that require the use of explicit RDMA operations. This document considers how to introduce Remote Invalidation to RPC-over-RDMA transport protocols.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on February 4, 2017.

Copyright Notice

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

1. Introduction
2. Remote Invalidation In Operation
3. Protocol Elements
4. Recommendations
5. IANA Considerations
6. Security Considerations
7. Acknowledgments
8. References
Author's Address

1. Introduction

Like other RDMA-enabled storage protocols, RPC-over-RDMA Version Two [I-D.cel-nfsv4-rpcrdma-version-two] employs a Read-Write transfer model when using explicit RDMA operations to transfer data. This means an RPC-over-RDMA requester exposes regions of its memory to an RPC-over-RDMA responder, which then uses RDMA Read and Write operations to transfer bulk data payloads.

In preparation for such a transfer, a requester asks its RNIC to assign a steering tag, or STag, to a region of memory containing the data to be moved. At this time, access rights are granted that allow the RNIC to access or update that memory. This act is referred to as memory "registration." RNICs use STags to steer data to and from registered memory regions.

When data movement is complete, each STag is dissociated from its memory region. This act is referred to as memory "invalidation." It prevents further responder access to that memory region by revoking its remote access rights. Invalidation must be done before upper layers (ie, RPC consumers) on the requester are allowed access to memory that was involved in an explicit RDMA operation.

Remote Invalidation is a technique by which an RDMA peer can request that a remote RNIC invalidate an STag associated with memory on that remote peer [RFC5042]. An RDMA consumer requests Remote Invalidation by posting an RDMA Send With Invalidate Work Request in place of an RDMA Send Work Request. RDMA Send With Invalidate is similar to RDMA Send, but takes one additional argument: a single STag to be invalidated by the RNIC that receives the sent message. An RDMA Send message is transmitted with additional header information that conveys the STag that is to be invalidated [RFC5040].

The benefit of Remote Invalidation is that an extra Work Request, context switch, and interrupt to perform memory invalidation are not required by the requester as part of handling the completion of an RPC transaction. STag invalidation begins before the Receive completes, thus invalidation is started (and completes) sooner. The upshot is faster completion of RPC transactions that involve registered memory.

The primary issues at the RPC-over-RDMA protocol level are to provide a mechanism to indicate when Remote Invalidation can be used by the transport, and to provide selection criteria for choosing which STag to invalidate remotely. To provide these, elements of the XDR definition of the RPC-over-RDMA protocol must be altered to some degree, depending on desired flexibility of operation, invasiveness of XDR changes, and broadness of hardware support.

The purpose of this document is to explore generally how Remote Invalidation can be introduced into the RPC-over-RDMA transport protocol. This document does not attempt to propose a detailed specification of any particular mechanism.

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].

2. Remote Invalidation In Operation

When requester memory is registered for remote access, Remote Invalidation could be used as follows:

The requester (client) DMA-maps a memory region that will participate in an RPC transaction, and registers an STag.
The requester transmits the RPC Call, which also conveys the STag, to the responder (server).
The responder processes the RPC transaction. The peer RNICs use the STag to move RPC arguments or results.
The responder transmits the RPC Reply using an RDMA Send With Invalidate Work Request, specifying the same STag.
A Receive Work Request completes on the requester carrying this RPC reply and the STag, which is now invalid.
The requester skips invalidation of the STag, then DMA-unmaps the memory region associated with the STag.

2.1. Determining Remote Invalidation Support Status

An RDMA consumer (an Upper Layer Protocol implementation) that does not support Remote Invalidation might not tolerate the use of RDMA Send With Invalidate by the transport layer. Such a requester performs Local Invalidation on STags that already happen to be invalid, and in some cases this can result in protection errors or other issues.

Thus, to avoid spurious connection termination, a responder must not post an RDMA Send With Invalidate Work Request unless it is sure that the requester's RNIC is prepared to receive the additional header information associated with Remote Invalidation, and the requester itself is prepared to handle remotely invalidated STags properly.

When a requester is Remote Invalidation-enabled, the requester must report its support status to responders using some kind of Upper Layer Protocol mechanism. When a responder does not know the requester's Remote Invalidation support status, it cannot use Remote Invalidation without endangering the connection.

2.2. Selection Of Which STag To Invalidate Remotely

The RDMA Send With Invalidate Work Request invalidates only one STag. RPC-over-RDMA requesters may register more than one STag to handle the movement of payloads for a single RPC. Either the client will have to specify which STag may be remotely invalidated, the protocol will have to specify a fixed way to select which STag to invalidate, or the responder will have to choose arbitrarily which STag to remotely invalidate.

In some circumstances, requesters may wish to utilize STags during transactions that are registered using a mechanism that does not tolerate Remote Invalidation. For example, an STag that is the requester's local DMA rkey should never be invalidated remotely. If a responder attempts to invalidate a such an STag, the result is undefined, but the connection can be terminated or other failures can occur.

Even with Remote Invalidation enabled, requesters remain responsible for ensuring all STags are invalid before RPC transactions complete. To avoid leaving STags registered, a requester must be prepared for the responder or the requester's own RNIC to have not invalidated any of an RPC's STags. When there are multiple STags associated with a single RPC, a requester must be prepared for any number of STags to have been remotely invalidated, including zero.

2.3. Backward-Direction Operation

As of this writing, no current implementation supports direct data placement in the backward-direction. However, existing protocol specifications do not forbid it [I-D.ietf-nfsv4-rfc5666bis] [I-D.ietf-nfsv4-rpcrdma-bidirection] [I-D.cel-nfsv4-rpcrdma-version-two].

When chunks are present in a backward-direction RPC request, Remote Invalidation allows the responder to trigger invalidation of a requester's STags as part of sending a reply, the same as in the forward direction.

However, in the backward direction, the server acts as the requester, and the client is the responder. The server's RNIC, therefore, must support receiving an IETH, and the server must have registered the STags with FRWR. Thus the server must indicate its Remote Invalidation support status to the client (the opposite of forward direction Remote Invalidation).

2.4. Future Enhancements

There are two related enhancements that further reduce the effort needed to invalidate STags associated with complex RPCs:

The ability for one registered STag to represent a list of memory regions that are not contiguous
The ability to specify more than one remote STag in a single Work Request to be remotely invalidated

At this time, the first mechanism has been implemented in at least one RNIC on the market. The second is speculative.

Given support for registering non-contiguous memory regions with one STag, when a requester constructs an RPC that has both a Read list and a Write list, for example, the requester has a choice:

The requester can register a separate STag for each access mode (one STag for memory regions needing read access, and one STag for those needing write access) to provide good data security
The requester can register a single STag with read and write access enabled for the whole set of memory regions, to allow RDMA Send With Invalidate to work optimally

Having the ability to remotely invalidate multiple STags at once allows the combination of optimal performance and optimal security.

3. Protocol Elements

In this section, a number of individual protocol mechanisms are examined. These vary in functionality and invasiveness. Some may be appropriate to use in combination.

3.1. New Protocol Version Requires Support For Remote Invalidation

3.1.1. Description

When a higher protocol version number is negotiated, Remote Invalidation is always enabled. This new protocol version would then be usable only with RNICs that support Remote Invalidation. Both peers assume that Remote Invalidation may be used in either direction.

3.1.2. Similar Existing Implementations

SMB Direct [MS-SMBD]

3.1.3. Advantages

No XDR changes or protocol extensions are required.

Backward-direction use of Remote Invalidation is automatically supported.

3.1.4. Disadvantages

The requester is not in control of which STags in an RPC may be invalidated. Thus, a requester must not advertise STags which must never be invalidated.

Other features and benefits of the new protocol version would not be available when an implementation employs an RNIC that does not support Remote Invalidation. In particular, RNICs that do not support MEM_MGT_EXTENTIONS (i.e., FRWR) could not use the new protocol version.

An extension or addition protocol version bump is required to indicate support for transport-level mechanisms that can invalidate multiple STags at once.

3.2. Remote Invalidation Enabled by Out-of-Band Negotiation

3.2.1. Description

At connection initiation time, messages are exchanged that indicate each peer's Remote Invalidation support status. Without these messages, peers assume Remote Invalidation is not supported.

3.2.2. Similar Existing Implementations

iSER [RFC7145]. Information is exchanged in RDMA-CM connection requests to report an implementation's Remote Invalidation support status.

3.2.3. Advantages

No changes to the base protocol XDR are required.

3.2.4. Disadvantages

Out-of-band messages are required to establish support status.

The requester is not in control of which STags in an RPC may be invalidated. Thus, a requester must not advertise STags which must never be invalidated.

To support backward-direction operation, the server must separately indicate that it supports Remote Invalidation.

To enable support for multiple STag invalidation, this negotiation protocol would have to be extended again to indicate when mechanisms other than RDMA Send With Invalidate are supported by the requester's RNIC.

3.3. Protocol Specifies Fixed Choice Of Which STag To Remotely Invalidate

3.3.1. Description

No new field is introduced to the transport header. Protocol specification determines how the responder chooses which STag is to be invalidated remotely. Some other means is used to determine whether Remote Invalidation can be used or not.

3.3.2. Similar Existing Implementations

iSER [RFC7145]. Two STags fields appear in each request: one advertises Read data and one advertises Write data. When only one STag is used in the request, it may be invalidated remotely. One both STags are used, only the Read STag may be invalidated remotely.

3.3.3. Advantages

No changes to the base protocol XDR are required.

3.3.4. Disadvantages

Out-of-band messages are required to establish support status.

The requester is not in control of which STags in an RPC may be invalidated. Thus, a requester must not advertise STags which must never be invalidated.

This mechanism may not work well for transport protocols that allow multiple read and write STags.

3.4. Requester Specifies One STag Per Transaction That May Be Remotely Invalidated

3.4.1. Description

A field is added to the transport header that contains an STag which may be invalidated by the responder. A special value can be chosen to mean "no STag may be invalidated" for use by requesters that have no support for Remote Invalidation.

3.4.2. Similar Existing Implementations

None.

3.4.3. Advantages

A requester may advertise STags that cannot be invalidated remotely, as long as they are never marked as "may invalidate."

No out-of-band support status negotiation is needed.

Backward-direction RPCs can each indicate whether a backward-direction requester desires or does not support Remote Invalidation.

The responder needs no special logic or assumptions to choose the STag to invalidate remotely.

3.4.4. Disadvantages

Either the base RPC-over-RDMA header XDR definition is altered, or a protocol extension is required.

Requesters transmit a little extra data per RPC, making RPC-over-RDMA messages slightly more costly to send and parse.

This mechanism cannot support the remote invalidation of multiple STags at once.

3.5. Requester Specifies Multiple STags Per Transaction That May Be Remotely Invalidated

3.5.1. Description

A new data structure is added to the transport header that indicate which STags which may be invalidated by the responder.

This information might appear as a new field in the RDMA segment data structure, as each segment has its own STag field. The field indicates whether or not that STag may be invalidated by the responder. Perhaps that field is a boolean, though in XDR, a boolean is a full 32 bits.

Or, this information could appear in the header as an array of STags, to reduce the amount of extra data contained in the RPC-over-RDMA header. Zero array elements means the requester does not support Remote Invalidation.

3.5.2. Similar Existing Implementations

NVMe/Fabrics [NVME]. Each STag in a request has an associated bit flag that indicates whether the responder is allowed to invalidate it remotely.

3.5.3. Advantages

A requester may advertise STags that cannot be invalidated remotely, as long as they are never marked as "may invalidate."

The mechanism allows a requester to request either invalidation of multiple STags at once, or to choose one STag to invalidate remotely.

No out-of-band support status negotiation is needed.

Each backward-direction RPC can indicate whether a backward-direction requester desires or does not support Remote Invalidation.

The responder needs no special logic or assumptions to choose the STag to invalidate remotely.

3.5.4. Disadvantages

The RPC-over-RDMA header XDR definition is possibly extensively altered.

Requesters transmit extra data per RPC. However, it is limited to only one or two 32-bit words in most cases.

3.6. Invalidation of Another RPC's STag

3.6.1. Description

As a subfeature of support for Remote Invalidation, it is possible that a responder can remotely invalidate an STag (using RDMA Send With Invalidate) that refers to registered memory being used in the Read chunk of a different RPC. Such Remote Invalidation would be requested only after the RDMA Read has already been completed.

This can be useful when a responder is replying to an RPC via an inline message, but notices there are other RPC replies pending that have multiple STags, some of which are Read chunks.

3.6.2. Similar Existing Implementations

None

3.6.3. Advantages

This is one way to enable remote invalidation of multiple STags per RPC, using only RDMA Send With Invalidate.

3.6.4. Disadvantages

Additional requester and responder complexity would be required to keep track of STags.

4. Recommendations

When constructing a protocol to support Remote Invalidation, one of these designs, or some combination of them, can be chosen.

In no particular order, the design priorities are:

Broad hardware support: Do not prevent the efficient operation of RNICs that do not handle RDMA Send With Invalidate
Low additional protocol complexity: As little impact on header XDR and header length as possible, to keep collateral performance impact low
Support for explicit RDMA in the backward-direction: Allow the use of Remote Invalidation in both the forward and backward direction

An important question is whether the base RPC-over-RDMA Version Two protocol should support Remote Invalidation, whether Remote Invalidation support should be carried entirely on the shoulders of protocol extensions, or whether some combination of the two is best.

Upper Layer Protocols will likely always be responsible for some degree of signaling Remote Invalidation capabilities, as long as innovation continues at the transport layer (e.g., new RDMA operations that enable Remote Invalidation). Future hardware capabilities are perpetually hazy, limiting the ability to design long-lived protocol support for them. It is also not easy to estimate how long the industry must continue to support less capable devices.

The author's preference is to implement Section 3.4. The target STag can be added to the rpcrdma2_chunk_lists data structure as a single field. No further changes or extensions are needed.

The requester appears to be in the best position to determine which STag may be invalidated remotely. With this choice, the requester can choose based on which STags may be invalidated remotely, or may use criteria based on the strengths of its RNIC (for instance, choosing the largest registered memory region might be beneficial in some cases). Allowing the responder to select from among several choices does not seem to bring additional value, and burdens the responder with additional header parsing costs for each chunk-bearing RPC reply.

Furthermore, the ability to request Remote Invalidation of multiple STags in a single Work Request appears to be somewhat distant. It would require additional Upper Layer Protocol mechanisms to distinguish the new mechanism from using RDMA Send With Invalidate, which we are not in a position to design today. Thus it does not seem worth the extra implementation and protocol complexity of having the requester provide a list of STags for the responder to choose from.

The current RPC-over-RDMA Version Two extensibility model does not allow for changes to the base header XDR outside the addition of new optional RDMA message types. New optional message types would have to be defined to enable more generic forms of Remote Invalidation. Thus, the least amount of effort for RPC-over-RDMA Version Two implementers appears to be building Remote Invalidation signaling into the base RPC-over-RDMA Version Two protocol.

Allowing the feature described in Section 3.6 is likely to increase the complexity of responder and especially requester implementations, as they would have to remember invalidated STags independently of RPC completions. Because it does not require any XDR changes, it could easily be enabled in a future protocol extension. The author's preference is to forbid this behavior in the initial specification, but allow for a future extension to introduce it.

4.1. Example Protocol Change


<CODE BEGINS>

 struct rpcrdma2_chunk_lists {
     enum msg_type               rdma_direction;
     u32                         rdma_inv_handle;
     struct rpcrdma2_read_list   *rdma_reads;
     struct rpcrdma2_write_list  *rdma_writes;
     struct rpcrdma2_write_chunk *rdma_reply;
 };

<CODE ENDS>

As an example of how to proceed, the simplest approach would replace struct rpcrdma2_chunk_lists (as defined in [I-D.cel-nfsv4-rpcrdma-version-two]) with the following:

The following language describes how to utilize the new field:

The requester sets the value of the rdma_inv_handle field to the value of any one of the rdma_handle fields in the RPC-over-RDMA header of the RPC call that may be invalidated remotely. If the RPC-over-RDMA header of the RPC call contains no rdma_handles that may be invalidated remotely, the requester MUST set the value of the rdma_inv_handle field to zero. The requester MUST NOT set the value of the rdma_inv_handle field to the value of an rdma_handle that cannot be invalidated remotely.
As part of forming the RPC-over-RDMA header for the reply, the responder copies the value of the rdma_inv_handle field from the RPC-over-RDMA header of the matching RPC call. If the rdma_inv_handle field in the RPC-over-RDMA header of an RPC call contains zero, the responder MUST NOT use RDMA Send With Invalidate to transmit the matching RPC reply. Otherwise, the responder SHOULD use RDMA Send With Invalidate to transmit the reply to this RPC, specifying the value in the RPC-over-RDMA header's rdma_inv_handle field as the Work Request's inv_rkey. The responder MUST NOT specify any other value in the Work Request's inv_rkey field.

5. IANA Considerations

There are no IANA considerations for this document.

6. Security Considerations

Remote Invalidation metadata is conveyed in the clear in RPC-over-RDMA headers. This does not expose any new information to attackers.

A man-in-the-middle can alter Remote Invalidation metadata while it is in transit. Requesters are prepared to handle the case where responders have not invalidated any STags associated with an RPC. An attacker can cause other STags in flight to be invalidated before the responder is finished with the associated memory. Or an attacker can replace the "to-be invalidated" STag with an STag in the same RPC that should not be invalidated remotely. Any of these might cause loss of connection, or other failures.

A connection relationship is required to exist between a requester and a responder. The requester's RNIC has associated a Protection Domain with that connection. The STag on the requester to be invalidated is associated with that Protection Domain. This protects against arbitrary invalidation of STags by network nodes not part of the connection.

Further discussion appears in [RFC5042].

7. Acknowledgments

Thanks to Sagi Grimberg, Christoph Hellwig, and Tom Talpey. Special thanks go to nfsv4 Working Group Chair Spencer Shepler and nfsv4 Working Group Secretary Thomas Haynes for their support.

8. References

8.1. Normative References

[I-D.cel-nfsv4-rpcrdma-version-two]	Lever, C. and D. Noveck, "RPC-over-RDMA Version Two Protocol", Internet-Draft draft-cel-nfsv4-rpcrdma-version-two-01, June 2016.
[I-D.ietf-nfsv4-rfc5666bis]	Lever, C., Simpson, W. and T. Talpey, "Remote Direct Memory Access Transport for Remote Procedure Call, Version One", Internet-Draft draft-ietf-nfsv4-rfc5666bis-07, May 2016.
[I-D.ietf-nfsv4-rpcrdma-bidirection]	Lever, C., "Bi-directional Remote Procedure Call On RPC-over-RDMA Transports", Internet-Draft draft-ietf-nfsv4-rpcrdma-bidirection-05, June 2016.
[RFC2119]	Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997.
[RFC5040]	Recio, R., Metzler, B., Culley, P., Hilland, J. and D. Garcia, "A Remote Direct Memory Access Protocol Specification", RFC 5040, DOI 10.17487/RFC5040, October 2007.
[RFC5042]	Pinkerton, J. and E. Deleganes, "Direct Data Placement Protocol (DDP) / Remote Direct Memory Access Protocol (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 2007.
[RFC7145]	Ko, M. and A. Nezhinsky, "Internet Small Computer System Interface (iSCSI) Extensions for the Remote Direct Memory Access (RDMA) Specification", RFC 7145, DOI 10.17487/RFC7145, April 2014.

8.2. Informative References

[MS-SMBD]	Microsoft Corporation, "SMB Remote Direct Memory Access (RDMA) Transport Protocol Specification", July 2016.
[NVME]	NVM Express, Inc., "NVM Express Revision 1.2.1", July 2016.

Author's Address

Charles Lever Oracle Corporation 1015 Granger Avenue Ann Arbor, MI 48104 USA Phone: +1 734 274 2396 EMail: chuck.lever@oracle.com