NFSv4 C. Hellwig
Internet-Draft July 02, 2017
Intended status: Standards Track
Expires: January 3, 2018

Parallel NFS (pNFS) RDMA Layout


The Parallel Network File System (pNFS) allows a separation between the metadata (onto a metadata server) and data (onto a storage device) for a file. The RDMA Layout Type is defined in this document as an extension to pNFS to allow the use of RDMA Verbs operations to access remote storage, with a special focus on accessing byte addressable persistent memory.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on January 3, 2018.

Copyright Notice

Copyright (c) 2017 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents ( in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

Table of Contents

1. Introduction

Figure 1 shows the overall architecture of a Parallel NFS (pNFS) system:

    |+-----------+                                 +-----------+
    ||+-----------+                                |           |
    |||           |       NFSv4.1 + pNFS           |           |
    +||  Clients  |<------------------------------>|   Server  |
     +|           |                                |           |
      +-----------+                                |           |
           |||                                     +-----------+
           |||                                           |
           |||                                           |
           ||| Storage        +-----------+              |
           ||| Protocol       |+-----------+             |
           ||+----------------||+-----------+  Control   |
           |+-----------------|||           |    Protocol|
           +------------------+||  Storage  |------------+
                               +|  Systems  |

Figure 1

The overall approach is that pNFS-enhanced clients obtain sufficient information from the server to enable them to access the underlying storage (on the storage systems) directly. See the Section 12 of [RFC5661] for more details. RDMA ([RFC5040] [RFC5041] [IBARCH]) is a technique for moving data efficiently between end nodes. By directing data into destination buffers as it is sent on a network, and placing it via direct memory access by hardware, the benefits of faster transfers and reduced host overhead are obtained. Unlike the RPC RDMA transport [RFC8166] the pNFS RDMA layout does not transfer remote procedural calls over RDMA networks, but instead uses raw RDMA READ and WRITE operations to access a memory region exposed on a storage device.

1.1. Conventions Used in This Document

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].

1.2. General Definitions

The following definitions are provided for the purpose of providing an appropriate context for the reader.

This document defines a byte as an octet, i.e., a datum exactly 8 bits in length.
The "client" is the entity that accesses the NFS server's resources. The client may be an application that contains the logic to access the NFS server directly. The client may also be the traditional operating system client that provides remote file system services for a set of applications.
The "server" is the entity responsible for coordinating client access to a set of file systems and is identified by a server owner.
metadata server (MDS)
The metadata server is a pNFS server which provides metadata information for a file system object. It also is responsible for generating layouts for file system objects. Note that the MDS is also responsible for directory-based operations.

1.3. Code Components Licensing Notice

The external data representation (XDR) description and scripts for extracting the XDR description are Code Components as described in Section 4 of "Legal Provisions Relating to IETF Documents". These Code Components are licensed according to the terms of Section 4 of "Legal Provisions Relating to IETF Documents".

1.4. XDR Description

This document contains the XDR [RFC4506] description of the NFSv4.1 RDMA layout protocol. The XDR description is embedded in this document in a way that makes it simple for the reader to extract into a ready-to-compile form. The reader can feed this document into the following shell script to produce the machine readable XDR description of the NFSv4.1 RDMA layout:

grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'

That is, if the above script is stored in a file called "", and this document is in a file called "spec.txt", then the reader can do:

sh < spec.txt > rdma_prot.x

The effect of the script is to remove leading white space from each line, plus a sentinel sequence of "///".

The embedded XDR file header follows. Subsequent XDR descriptions, with the sentinel sequence are embedded throughout the document.

Note that the XDR code contained in this document depends on types from the NFSv4.1 nfs4_prot.x file [RFC5662]. This includes both nfs types that end with a 4, such as offset4, length4, etc., as well as more generic types such as uint32_t and uint64_t.

   /// /*
   ///  * This code was derived from RFCTBD10
   ///  * Please reproduce this note if possible.
   ///  */
   /// /*
   ///  * Copyright (c) 2010,2015 IETF Trust and the persons
   ///  * identified as the document authors.  All rights reserved.
   ///  *
   ///  * Redistribution and use in source and binary forms, with
   ///  * or without modification, are permitted provided that the
   ///  * following conditions are met:
   ///  *
   ///  * - Redistributions of source code must retain the above
   ///  *   copyright notice, this list of conditions and the
   ///  *   following disclaimer.
   ///  *
   ///  * - Redistributions in binary form must reproduce the above
   ///  *   copyright notice, this list of conditions and the
   ///  *   following disclaimer in the documentation and/or other
   ///  *   materials provided with the distribution.
   ///  *
   ///  * - Neither the name of Internet Society, IETF or IETF
   ///  *   Trust, nor the names of specific contributors, may be
   ///  *   used to endorse or promote products derived from this
   ///  *   software without specific prior written permission.
   ///  *
   ///  */
   /// /*
   ///  *      nfs4_rdma_layout_prot.x
   ///  */
   /// %#include "nfsv41.h"

2. RDMA Layout Description

2.1. Background and Architecture

A pNFS RDMA layout is responsible for mapping from an NFS file (or portion of a file) to memory regions that contain the file. These regions are expressed as extents with 64-bit offsets and lengths using the existing NFSv4 offset4 and length4 types, and map to memory regions that the servers registered, and for which it exposes a handle (R_key or stag) that allows for RDMA READ and RDMA WRITE operations from the client.

The pNFS operation for requesting a layout (LAYOUTGET) includes the "layoutiomode4 loga_iomode" argument, which indicates whether the requested layout is for read-only use or read-write use. A read-only layout may contain holes that are read as zero, whereas a read-write layout will contain allocated, but un-initialized storage in those holes (read as zero, can be written by client). This document also supports client participation in copy-on-write (e.g., for file systems with snapshots) by providing both read-only and un- initialized storage for the same extent in a layout. Reads are initially performed on the read-only storage, with writes going to the un-initialized storage. After the first write that initializes the un-initialized storage, all reads are performed to that now- initialized writable storage, and the corresponding read-only storage is no longer used.

2.2. layouttype4

The layout4 type defined in [RFC5662] is extended with a new value as follows:

    enum layouttype4 {
        LAYOUT4_NFSV4_1_FILES   = 1,
        LAYOUT4_OSD2_OBJECTS    = 2,
        LAYOUT4_BLOCK_VOLUME    = 3,
        LAYOUT4_SCSI            = 4,
        LAYOUT4_RDMA            = 0x80000006
[[RFC Editor: please modify the LAYOUT4_RDMA
  to be the layouttype assigned by IANA]]

This document defines structure associated with the layouttype4 value LAYOUT4_RDMA. [RFC5661] specifies the loc_body structure as an XDR type "opaque". The opaque layout is uninterpreted by the generic pNFS client layers, but obviously must be interpreted by the Layout Type implementation.

2.3. Device Addressing and Discovery

Data operations to a storage device require the client to know the network address of the storage device. The NFSv4.1+ GETDEVICEINFO operation (Section 18.40 of [RFC5661]) is used by the client to retrieve that information.

2.3.1. pnfs_rdma_device_addr4

The "pnfs_rdma_device_addr4" data structure is returned by the server as the storage-protocol-specific opaque field da_addr_body in the "device_addr4" structure by a successful GETDEVICEINFO operation [RFC5661]. It contains the network address of the storage device. The RDMA Connection manager (RDMA/CM) shall be used to establish the queue pair for the RDMA READ and RDMA WRITE operations used by the layout. Details of connection establishment will be provided in future versions of this document.

 /// struct pnfs_rdma_device_addr4 {
 ///       struct netaddr4       addr; /* address of the device */
 /// };

2.4. Data Structures: Extents and Extent Lists

A pNFS RDMA layout is a list of extents within a flat array of data in a device. The RDMA layout describes the individual byte ranges (extents) on the device that make up the file. The offsets and length contained in an extent are specified in units of bytes.

 /// enum pnfs_rdma_extent_state4 {
 ///     PNFS_RDMA_READ_WRITE_DATA = 0, /* the data located by
 ///                                       this extent is valid
 ///                                       for reading and
 ///                                       writing. */
 ///     PNFS_RDMA_READ_DATA      = 1,  /* the data located by this
 ///                                       extent is valid for
 ///                                       reading only; it may not
 ///                                       be written. */
 ///     PNFS_RDMA_INVALID_DATA   = 2,  /* the location is valid; the
 ///                                       data is invalid.  It is a
 ///                                       newly (pre-) allocated
 ///                                       extent.  The client MUST
 ///                                       not read from this
 ///                                       space */
 ///     PNFS_RDMA_NONE_DATA      = 3   /* the location is invalid.
 ///                                       It is a hole in the file.
 ///                                       The client MUST NOT read
 ///                                       from or write to this
 ///                                       space */
 /// };

 /// struct pnfs_rdma_extent4 {
 ///     deviceid4    re_device_id;     /* id of the device on
 ///                                       which extent of file is
 ///                                       stored. */
 ///     offset4      re_file_offset;   /* starting byte offset
 ///                                       in the file */
 ///     uint32       re_handle;        /* registered memory
 ///                                       handle */
 ///     length4      re_length;        /* size in bytes of the
 ///                                       extent */
 ///     offset4      re_storage_offset;/* starting byte offset
 ///                                       in the volume */
 ///     pnfs_rdma_extent_state4 re_state;
 ///                                    /* state of this extent */
 /// };
 /// /* RDMA layout-specific type for loc_body */
 /// struct pnfs_rdma_layout4 {
 ///     pnfs_rdma_extent4 rl_extents<>;
 ///                                    /* extents which make up this
 ///                                       layout. */
 /// };

The RDMA layout consists of a list of extents that map the regions of the file to locations on a device. The "re_storage_offset" field within each extent identifies a location on the device specified by the "re_device_id" field in the extent.

Each extent maps a region of the file onto a portion of the specified device. The re_file_offset, re_length, and re_state fields for an extent returned from the server are valid for all extents. In contrast, the interpretation of the re_storage_offset field depends on the value of re_state as follows (in increasing order):

means that re_storage_offset is valid, and points to valid/initialized data that can be read and written.
means that re_storage_offset is valid and points to valid/initialized data that can only be read. Write operations are prohibited.
means that re_storage_offset is valid, but points to invalid un-initialized data. This data MUST not be read from the device until it has been initialized. A read request for a PNFS_RDMA_INVALID_DATA extent MUST fill the user buffer with zeros, unless the extent is covered by a PNFS_RDMA_READ_DATA extent of a copy-on-write file system. Write requests MUST write whole server-sized blocks to the device; bytes not initialized by the user MUST be set to zero. Any write to parts of a device covered by a PNFS_RDMA_INVALID_DATA extent changes the written portion of the extent to PNFS_RDMA_READ_WRITE_DATA; the pNFS client is responsible for reporting this change via LAYOUTCOMMIT.
means that re_storage_offset is not valid, and this extent MAY not be used to satisfy write requests. Read requests MAY be satisfied by zero-filling as for PNFS_RDMA_INVALID_DATA. PNFS_RDMA_NONE_DATA extents MAY be returned by requests for readable extents; they are never returned if the request was for a writable extent.

An extent list contains all relevant extents in increasing order of the re_file_offset of each extent; any ties are broken by increasing order of the extent state (re_state).

2.4.1. Layout Requests and Extent Lists

Each request for a layout specifies at least three parameters: file offset, desired size, and minimum size. If the status of a request indicates success, the extent list returned MUST meet the following criteria:

The server shall ensure that it has registered handles for the memory regions that the extents in the layout refer to so that RDMA READ and/or RDMA WRITE requests can be performed by the client. Multiple extents may refer to the same handle. The handle shall be invalidated on LAYOUTRETURN operation, including implicit layout returns as part of CB_LAYOUTRECALL operations, or when a layout is revoked.

According to [RFC5661], if the minimum requested size, loga_minlength, is zero, this is an indication to the metadata server that the client desires any layout at offset loga_offset or less that the metadata server has "readily available". Given the lack of a clear definition of this phrase, in the context of the RDMA layout type, when loga_minlength is zero, the metadata server SHOULD:

2.4.2. Layout Commits

 /// /* RDMA layout-specific type for lou_body */
 /// struct pnfs_rdma_range4 {
 ///     offset4      rr_file_offset;   /* starting byte offset
 ///                                       in the file */
 ///     length4      rr_length;        /* size in bytes */
 /// };
 /// struct pnfs_rdma_layoutupdate4 {
 ///     pnfs_rdma_range4 rlu_commit_list<>;
 ///                                    /* list of extents which
 ///                                     * now contain valid data.
 ///                                     */
 /// };

The "pnfs_rdma_layoutupdate4" structure is used by the client as the RDMA layout-specific argument in a LAYOUTCOMMIT operation. The "rlu_commit_list" field is a list covering regions of the file layout that were previously in the PNFS_RDMA_INVALID_DATA state, but have been written by the client and SHOULD now be considered in the PNFS_RDMA_READ_WRITE_DATA state. The extents in the commit list MUST be disjoint and MUST be sorted by rr_file_offset. Implementors should be aware that a server MAY be unable to commit regions at a granularity smaller than a file-system block (typically 4 KB or 8 KB). As noted above, the block-size that the server uses is available as an NFSv4 attribute, and any extents included in the "rlu_commit_list" MUST be aligned to this granularity and have a size that is a multiple of this granularity. Since the block in question is in state PNFS_RDMA_INVALID_DATA, byte ranges not written SHOULD be filled with zeros. This applies even if it appears that the area being written is beyond what the client believes to be the end of file.

2.4.3. Layout Returns

A LAYOUTRETURN operation represents an explicit release of resources by the client. This MAY be done in response to a CB_LAYOUTRECALL or before any recall, in order to avoid a future CB_LAYOUTRECALL. When the LAYOUTRETURN operation specifies a LAYOUTRETURN4_FILE return type, then the layoutreturn_file4 data structure specifies the region of the file layout that is no longer needed by the client.

The LAYOUTRETURN operation is done without any RDMA layout specific data. The opaque "lrf_body" field of the "layoutreturn_file4" data structure MUST have length zero.

2.4.4. Layout Revocation

Layouts MAY be unilaterally revoked by the server, due to the client's lease time expiring, or the client failing to return a layout which has been recalled in a timely manner. For the RDMA layout type this is accomplished by invalidating the handle for the remote memory region exposed to the client. Once the invalidation has completed the HCA will reject all access from the client to the memory region.

2.4.5. Client Copy-on-Write Processing

Copy-on-write is a mechanism used to support file and/or file system snapshots. When writing to unaligned regions, or to regions smaller than a file system block, the writer MUST copy the portions of the original file data to a new location on disk. This behavior can either be implemented on the client or the server. The paragraphs below describe how a pNFS RDMA layout client implements access to a file that requires copy-on-write semantics.

Distinguishing the PNFS_RDMA_READ_WRITE_DATA and PNFS_RDMA_READ_DATA extent types in combination with the allowed overlap of PNFS_RDMA_READ_DATA extents with PNFS_RDMA_INVALID_DATA extents allows copy-on-write processing to be done by pNFS clients. In classic NFS, this operation would be done by the server. Since pNFS enables clients to do direct block access, it is useful for clients to participate in copy-on-write operations. All pNFS RDMA layout clients MUST support this copy-on-write processing.

When a client wishes to write data covered by a PNFS_RDMA_READ_DATA extent, it MUST have requested a writable layout from the server; that layout will contain PNFS_RDMA_INVALID_DATA extents to cover all the data ranges of that layout's PNFS_RDMA_READ_DATA extents. More precisely, for any re_file_offset range covered by one or more PNFS_RDMA_READ_DATA extents in a writable layout, the server MUST include one or more PNFS_RDMA_INVALID_DATA extents in the layout that cover the same re_file_offset range. When performing a write to such an area of a layout, the client MUST effectively copy the data from the PNFS_RDMA_READ_DATA extent for any partial blocks of re_file_offset and range, merge in the changes to be written, and write the result to the PNFS_RDMA_INVALID_DATA extent for the blocks for that re_file_offset and range. That is, if entire blocks of data are to be overwritten by an operation, the corresponding PNFS_RDMA_READ_DATA blocks need not be fetched, but any partial- block writes MUST be merged with data fetched via PNFS_RDMA_READ_DATA extents before storing the result via PNFS_RDMA_INVALID_DATA extents. For the purposes of this discussion, "entire blocks" and "partial blocks" refer to the server's file-system block size. Storing of data in a PNFS_RDMA_INVALID_DATA extent converts the written portion of the PNFS_RDMA_INVALID_DATA extent to a PNFS_RDMA_READ_WRITE_DATA extent; all subsequent reads MUST be performed from this extent; the corresponding portion of the PNFS_RDMA_READ_DATA extent MUST NOT be used after storing data in a PNFS_RDMA_INVALID_DATA extent. If a client writes only a portion of an extent, the extent MAY be split at block aligned boundaries.

When a client wishes to write data to a PNFS_RDMA_INVALID_DATA extent that is not covered by a PNFS_RDMA_READ_DATA extent, it MUST treat this write identically to a write to a file not involved with copy-on-write semantics. Thus, data MUST be written in at least block-sized increments, aligned to multiples of block-sized offsets, and unwritten portions of blocks MUST be zero filled.

2.4.6. Extents are Permissions

Layout extents returned to pNFS clients grant permission to read or write; PNFS_RDMA_READ_DATA and PNFS_RDMA_NONE_DATA are read-only (PNFS_RDMA_NONE_DATA reads as zeroes), PNFS_RDMA_READ_WRITE_DATA and PNFS_RDMA_INVALID_DATA are read/write, (PNFS_RDMA_INVALID_DATA reads as zeros, any write converts it to PNFS_RDMA_READ_WRITE_DATA). This is the only means a client has of obtaining permission to perform direct I/O to storage devices; a pNFS client MUST NOT perform direct I/O operations that are not permitted by an extent held by the client. Client adherence to this rule places the pNFS server in control of potentially conflicting storage device operations, enabling the server to determine what does conflict and how to avoid conflicts by granting and recalling extents to/from clients.

If a client makes a layout request that conflicts with an existing layout delegation, the request will be rejected with the error NFS4ERR_LAYOUTTRYLATER. This client is then expected to retry the request after a short interval. During this interval, the server SHOULD recall the conflicting portion of the layout delegation from the client that currently holds it. This reject-and-retry approach does not prevent client starvation when there is contention for the layout of a particular file. For this reason, a pNFS server SHOULD implement a mechanism to prevent starvation. One possibility is that the server can maintain a queue of rejected layout requests. Each new layout request can be checked to see if it conflicts with a previous rejected request, and if so, the newer request can be rejected. Once the original requesting client retries its request, its entry in the rejected request queue can be cleared, or the entry in the rejected request queue can be removed when it reaches a certain age.

NFSv4 supports mandatory locks and share reservations. These are mechanisms that clients can use to restrict the set of I/O operations that are permissible to other clients. Since all I/O operations ultimately arrive at the NFSv4 server for processing, the server is in a position to enforce these restrictions. However, with pNFS layouts, I/Os will be issued from the clients that hold the layouts directly to the storage devices that host the data. These devices have no knowledge of files, mandatory locks, or share reservations, and are not in a position to enforce such restrictions. For this reason the NFSv4 server MUST NOT grant layouts that conflict with mandatory locks or share reservations. Further, if a conflicting mandatory lock request or a conflicting open request arrives at the server, the server MUST recall the part of the layout in conflict with the request before granting the request.

2.4.7. End-of-file Processing

The end-of-file location can be changed in two ways: implicitly as the result of a WRITE or LAYOUTCOMMIT beyond the current end-of-file, or explicitly as the result of a SETATTR request. Typically, when a file is truncated by an NFSv4 client via the SETATTR call, the server frees any disk blocks belonging to the file that are beyond the new end-of-file byte, and MUST write zeros to the portion of the new end-of-file block beyond the new end-of-file byte. These actions render any pNFS layouts that refer to the blocks that are freed or written semantically invalid. Therefore, the server MUST recall from clients the portions of any pNFS layouts that refer to blocks that will be freed or written by the server before effecting the file truncation. These recalls may take time to complete; as explained in [RFC5661], if the server cannot respond to the client SETATTR request in a reasonable amount of time, it SHOULD reply to the client with the error NFS4ERR_DELAY.

Blocks in the PNFS_RDMA_INVALID_DATA state that lie beyond the new end-of-file block present a special case. The server has reserved these blocks for use by a pNFS client with a writable layout for the file, but the client has yet to commit the blocks, and they are not yet a part of the file mapping on disk. The server MAY free these blocks while processing the SETATTR request. If so, the server MUST recall any layouts from pNFS clients that refer to the blocks before processing the truncate. If the server does not free the PNFS_RDMA_INVALID_DATA blocks while processing the SETATTR request, it need not recall layouts that refer only to the PNFS_RDMA_INVALID_DATA blocks.

When a file is extended implicitly by a WRITE or LAYOUTCOMMIT beyond the current end-of-file, or extended explicitly by a SETATTR request, the server need not recall any portions of any pNFS layouts.

2.4.8. Layout Hints

The layout hint attribute specified in [RFC5661] is not supported by the RDMA layout, and the pNFS server MUST reject setting a layout hint attribute with a loh_type value of LAYOUT4_RDMA_VOLUME during OPEN or SETATTR operations. On a file system only supporting the RDMA layout a server MUST NOT report the layout_hint attribute in the supported_attrs attribute.

2.5. Crash Recovery Issues

A critical requirement in crash recovery is that both the client and the server know when the other has failed. Additionally, it is required that a client sees a consistent view of data across server restarts. These requirements and a full discussion of crash recovery issues are covered in the "Crash Recovery" section of the NFSv41 specification [RFC5661]. This document contains additional crash recovery material specific only to the RDMA layout.

When the server crashes while the client holds a writable layout, and the client has written data to blocks covered by the layout, and the blocks are still in the PNFS_RDMA_INVALID_DATA state, the client has two options for recovery. If the data that has been written to these blocks is still cached by the client, the client can simply re-write the data via NFSv4, once the server has come back online. However, if the data is no longer in the client's cache, the client MUST NOT attempt to source the data from the data servers. Instead, it SHOULD attempt to commit the blocks in question to the server during the server's recovery grace period, by sending a LAYOUTCOMMIT with the "loca_reclaim" flag set to true. This process is described in detail in Section 18.42.4 of [RFC5661].

2.6. Transient and Permanent Errors

The server may respond to LAYOUTGET with a variety of error statuses. These errors can convey transient conditions or more permanent conditions that are unlikely to be resolved soon.

The error NFS4ERR_RECALLCONFLICT indicates that the server has recently issued a CB_LAYOUTRECALL to the requesting client, making it necessary for the client to respond to the recall before processing the layout request. A client can wait for that recall to be receive and processe or it can retry as for NFS4ERR_TRYLATER, as described below.

The error NFS4ERR_TRYLATER is used to indicate that the server cannot immediately grant the layout to the client. This may be due to constraints on writable sharing of blocks by multiple clients or to a conflict with a recallable lock (e.g. a delegation). In either case, a reasonable approach for the client is to wait several milliseconds and retry the request. The client SHOULD track the number of retries, and if forward progress is not made, the client SHOULD abandon the attempt to get a layout and perform READ and WRITE operations by sending them to the server

The error NFS4ERR_LAYOUTUNAVAILABLE MAY be returned by the server if layouts are not supported for the requested file or its containing file system. The server MAY also return this error code if the server is the progress of migrating the file from secondary storage, there is a conflicting lock that would prevent the layout from being granted, or for any other reason that causes the server to be unable to supply the layout. As a result of receiving NFS4ERR_LAYOUTUNAVAILABLE, the client SHOULD abandon the attempt to get a layout and perform READ and WRITE operations by sending them to the MDS. It is expected that a client will not cache the file's layoutunavailable state forever. In particular, when the file is closed or opened by the client, issuing a new LAYOUTGET is appropriate.

3. Security Considerations

The pNFS extension partitions the NFSv4.1+ file system protocol into two parts, the control path and the data path (storage protocol). The control path contains all the new operations described by this extension; all existing NFSv4 security mechanisms and features apply to the control path. The combination of components in a pNFS system is required to preserve the security properties of NFSv4.1+ with respect to an entity accessing data via a client, including security countermeasures to defend against threats that NFSv4.1+ provides defenses for in environments where these threats are considered significant.

The metadata server enforces the file access-control policy at LAYOUTGET time. The client should use suitable authorization credentials for getting the layout for the requested iomode (READ or RW) and the server verifies the permissions and ACL for these credentials, possibly returning NFS4ERR_ACCESS if the client is not allowed the requested iomode. If the LAYOUTGET operation succeeds the client receives, as part of the layout, a set of credentials allowing it I/O access to the specified data files corresponding to the requested iomode. When the client acts on I/O operations on behalf of its local users, it MUST authenticate and authorize the user by issuing respective OPEN and ACCESS calls to the metadata server, similar to having NFSv4 data delegations. If access is allowed, the client uses the corresponding (READ or RW) credentials to perform the I/O operations at the data file's storage devices. When the metadata server receives a request to change a file's permissions or ACL, it SHOULD recall all layouts for that file and it MUST fence off the clients holding outstanding layouts for the respective file by implicitly invalidating the outstanding credentials on all data files comprising before committing to the new permissions and ACL. Doing this will ensure that clients re-authorize their layouts according to the modified permissions and ACL by requesting new layouts. Recalling the layouts in this case is courtesy of the server intended to prevent clients from getting an error on I/Os done after the client was fenced off.

4. IANA Considerations

IANA is requested to assign a new pNFS layout type in the pNFS Layout Types Registry as follows (the value 5 is suggested): Layout Type Name: LAYOUT4_RDMA Value: 0x00000006 RFC: RFCTBD10 How: L (new layout type) Minor Versions: 1

5. References

5.1. Normative References

[LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents", November 2008.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", March 1997.
[RFC4506] Eisler, M., "XDR: External Data Representation Standard", STD 67, RFC 4506, May 2006.
[RFC5661] Shepler, S., Eisler, M. and D. Noveck, "Network File System (NFS) Version 4 Minor Version 1 Protocol", RFC 5661, January 2010.
[RFC5662] Shepler, S., Eisler, M. and D. Noveck, "Network File System (NFS) Version 4 Minor Version 1 External Data Representation Standard (XDR) Description", RFC 5662, January 2010.
[RFC8166] Lever, C., Simpson, W. and T. Talpey, "Remote Direct Memory Access Transport for Remote Procedure Call Version 1", RFC RFC8166, June 2017.

5.2. Informative References

[IBARCH] InfiniBand Trade Association, "InfiniBand Architecture Specification Volume 1 Release 1.3", March 2015.
[RFC5040] Recio, B., Metzler, B., Culley, P., Hilland, J. and D. Garcia, "A Remote Direct Memory Access Protocol Specification", RFC 5040, October 2007.
[RFC5041] Shah, H., Pinkerton, J., Recio, B. and P. Culley, "Direct Data Placement over Reliable Transports", RFC 5041, October 2007.

Appendix A. RFC Editor Notes

[RFC Editor: please remove this section prior to publishing this document as an RFC]

[RFC Editor: prior to publishing this document as an RFC, please replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the RFC number of this document]

Author's Address

Christoph Hellwig EMail: