Network File System Version 4 S. Faibish Internet-Draft January 1, 2020 D. Black Intended status: Informational Dell EMC Expires: January 6, 2021 C. Hellwig July 6, 2020 Using the Parallel NFS (pNFS) SCSI/NVMe Layout draft-faibish-nfsv4-scsi-nvme-layout-00 Abstract This document explains how to use the Parallel Network File System (pNFS) SCSI NVMe Layout Types with transports using the NVMe over Fabrics protocols. This draft picks the previous SCSI over NVMe draft of C. Hellwig and extended it to support all the types of transport protocols supported by NVMe transport over fabrics additional to the SCSI transport protocol introduced in pNFS SCSI Layout. The proposed transport protocols include support for several transports and fabrics and support the RDMA transports. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of Internet-Draft Shadow Directories can be accessed at https://www.ietf.org/standards/ids/internet-draft-mirror-sites/. This Internet-Draft will expire on January 6, 2021. Copyright Notice Copyright (c) 2020 IETF Trust and the persons identified as the document authors. All rights reserved. Faibish Expires January 6, 2021 [Page 1] Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Conventions Used in This Document . . . . . . . . . . . . . 2 1.2. General Definitions . . . . . . . . . . . . . . . . . . . . 2 2. SCSI Layout mapping to NVMe . . . . . . . . . . . . . . . . . . 3 2.1. Volume Identification . . . . . . . . . . . . . . . . . . . 7 2.2. Client Fencing . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1. Reservation Key Generation . . . . . . . . . . . . . . 8 2.2.2. MDS Registration and Reservation . . . . . . . . . . . 8 2.2.3. Client Registration . . . . . . . . . . . . . . . . . . 8 2.2.4. Fencing Action . . . . . . . . . . . . . . . . . . . . 8 2.2.5. Client Recovery after a Fence Action . . . . . . . . . 9 2.3. Volatile write caches . . . . . . . . . . . . . . . . . . . 10 3. Security Considerations . . . . . . . . . . . . . . . . . . . . 10 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 10 5. Normative References . . . . . . . . . . . . . . . . . . . . . 11 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 11 1. Introduction The pNFS Small Computer System Interface (SCSI) layout [RFC8154] is layout type that allows NFS clients to directly perform I/O to block storage devices while bypassing the MDS. It is specified by using concepts from the SCSI protocol family for the data path to the storage devices. This documents explains how to access PCI Express, RDMA or Fibre Channel devices using the NVM Express protocol [NVME] using the SCSI layout. This document does not amend the pNFS SCSI layout document in any way, instead of explains how to map the SCSI constructs used in the pNFS SCSI layout document to NVMe concepts using the NVMe SCSI translation reference. 1.1. Conventions Used in This Document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. Faibish Expires January 6, 2021 [Page 2] Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020 1.2. General Definitions The following definitions are provided for the purpose of providing an appropriate context for the reader. Client The "client" is the entity that accesses the NFS server's resources. The client may be an application that contains the logic to access the NFS server directly. The client may also be the traditional operating system client that provides remote file system services for a set of applications. Server/Controller The "server" is the entity responsible for coordinating client access to a set of file systems and is identified by a server owner. 2. SCSI Layout mapping to NVMe The SCSI layout definition [RFC8154] only references few SCSI specific concepts directly. NVM Express [NVME] Base Specification revision 1.4 and prior revisions define a register level interface for host software to communicate with a non-volatile memory subsystem over PCI Express (NVMe over PCIe). This specification defines extensions to NVMe that enable operation over other interconnects (NVMe over Fabrics). The NVM Express Base Specification revision 1.4 is referred to as the NVMe Base specification. The goal for this draft is to enable an implementer who is familiar with the pNFS SCSI layout (RFC 8154) and the NVMe standards (both NVMe-oF 1.1 and NVMe 1.4) to implement the pNFS SCSI layout over NVMe-oF. The mapping of extensions defined in this document refers to a specific NVMe Transport defined in an NVMe Transport binding specification. This document refers to NVMe Transport binding specification for FC, RDMA and TCP [RFC7525]. The NVMe Transport binding specification for Fibre Channel is defined in INCITS 540 Fibre Channel - Non-Volatile Memory Express [FC-NVMe]. NVMe over Fabrics has the following differences from the NVMe Base specification used with SCSI: - There is a one-to-one mapping between I/O Submission Queues and I/O Completion Queues. NVMe over Fabrics does not support multiple I/O Submission Queues being mapped to a single I/O Completion Queue; Faibish Expires January 6, 2021 [Page 3] Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020 - NVMe over Fabrics does not define an interrupt mechanism that allows a controller to generate a host interrupt. It is the responsibility of the host fabric interface (e.g., Host Bus Adapter) to generate host interrupts; - NVMe over Fabrics does not use the Create I/O Completion Queue, Create I/O Submission Queue, Delete I/O Completion Queue, and Delete I/O Submission Queue commands. NVMe over Fabrics does not use the Admin Submission Queue Base Address (ASQ), Admin Completion Queue Base Address (ACQ), and Admin Queue Attributes (AQA) properties (i.e., registers in PCI Express). Queues are created using the Connect command; - NVMe over Fabrics uses the Disconnect command to delete an I/O Submission Queue and corresponding I/O Completion Queue; - Metadata, if supported, shall be transferred as a contiguous part of the logical block. NVMe over Fabrics does not support transferring metadata from a separate buffer; - NVMe over Fabrics does not support PRPs but requires use of SGLs for Admin, I/O, and Fabrics commands. This differs from NVMe over PCIe where SGLs are not supported for Admin commands and are optional for I/O commands; - NVMe over Fabrics does not support Completion Queue flow control. This requires that the host ensures there are available Completion Queue slots before submitting new commands; and - NVMe over Fabrics allows Submission Queue flow control to be disabled if the host and controller agree to disable it. If Submission Queue flow control is disabled, the host is required to ensure that there are available Submission Queue slots before submitting new commands. NVMe over Fabrics requires the underlying NVMe Transport to provide reliable NVMe command and data delivery. An NVMe Transport is an abstract protocol layer independent of any physical interconnect properties. An NVMe Transport may expose a memory model, a message model, or a combination of the two. A memory model is one in which commands, responses and data are transferred between fabric nodes by performing explicit memory read and write operations while a message model is one in which only messages containing command capsules, response capsules, and data are sent between fabric nodes. The only memory model NVMe Transport supported by NVMe [NVME] is PCI Express, as defined in the NVMe Base specification. While differences exist between NVMe over Fabrics and NVMe over PCIe implementations, both implement the same architecture and command sets. But NVMe SCSI Translation reference is only using the NVMe over Fabrics not the memory model. NVMe over Fabrics utilizes the protocol layering shown in Figure 1. The native fabric communication services and the Fabric Protocol and Physical Fabric layers in Figure 1 are outside the scope of this specification. Faibish Expires January 6, 2021 [Page 4] Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020 +-------------------+ | pNFS host SCSI | | layout over NVMe | +---------+---------+ | v +-------------------+ | NVMe over Fabrics | +---------+---------+ | v +-------------------+ | Transport Binding | +---------+---------+ | v +--------------------+ | NVMe Transport svc | +---------+----------+ | v +-------------------+ | NVMe Transport | +---------+---------+ | v +--------------------+ | Fabric Protocol | +---------+----------+ | v +-------------------+ | Physical Fabric | +---------+---------+ | v +------------------------+ | pNFS SCSI layout | | server/NVMe controller | +------------------------+ Figure 1: pNFS SCSI over NVMe over Fabrics Layering An NVM subsystem port may support multiple NVMe Transports if more than one NVMe Transport binding specifications exist for the underlying fabric (e.g., an NVM subsystem port identified by a Port ID may support both iWARP and RoCE). This draft is also defining NVMe binding implementation that uses the Transport type of RDMA Transport. The RDMA Transport is RDMA Provider agnostic. The diagram in Figure 2 illustrates the layering of the RDMA Transport and common RDMA providers (iWARP, InfiniBand, and RoCE) within the host and NVM subsystem. Faibish Expires January 6, 2021 [Page 5] Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020 +-------------------+ | NVMe Host | +---------+---------+ | RDMA Transport | +--------+---+------------+--+---------+ | iWARP | Infiniband | RoCE | +------------+-----++-----+------------+ || RDMA Fabric vv +------------+------------+--+---------+ | iWARP | Infiniband | RoCE | +---------+--+------------+---+--------+ | RDMA Transport | +-------------------+ | NVM Subsystem | +-------------------+ Figure 2: RDMA Transport Protocol Layers NVMe over Fabrics allows multiple hosts to connect to different controllers in the NVM subsystem through the same port. All other aspects of NVMe over Fabrics multi-path I/O and namespace sharing are equivalent to that defined in the NVMe Base specification. An association is established between a host and a controller when the host connects to a controller's Admin Queue using the Fabrics Connect command. Within the Connect command, the host specifies the Host NQN, NVM Subsystem NQN, Host Identifier, and may request a specific Controller ID or may request a connection to any available controller. The host is the pNFS client and the controller is the NFSv4 server. The pNFS clients connect to the server using different network protocols and different transports excluding PCIe direct connection. While an association exists between a host and a controller, only that host may establish connections with I/O Queues of that controller. NVMe over Fabrics supports both fabric secure channel and NVMe in-band authentication. An NVM subsystem may require a host to use fabric secure channel, NVMe in-band authentication, or both. The Discovery Service indicates if fabric secure channel shall be used for an NVM subsystem. The Connect response indicates if NVMe in-band authentication shall be used with that controller. For SCSI over NVMe over Fabrics only the in-band authentication model will be used as the fabric secure channel is only supported with PCIe transport memory model not supported by SCSI layout protocol. The pNFS SCSI layout uses the Device Identification VPD page (page code 0x83) from [SPC4] to identify the devices used by a layout. There are several ways to build SCSI Device Faibish Expires January 6, 2021 [Page 6] Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020 2.1. Volume Identification Identification descriptors from NVMe Identify data included in the Controller Identify Attributes specific to NVMe over Fabrics specified in the Identify Controller fields in Section 4.1 of [NVMEoF]. This document uses a subset of this information to identify LUs backing pNFS SCSI layouts. To be used as storage devices for the pNFS SCSI layout, NVMe devices MUST support the EUI-64 [RFC8154] value in the Identify Namespace data, as the methods based on the Serial Number for legacy devices might not be suitable for unique addressing needs and thus MUST NOT be used. UUID identification can be added by using a large enough enum value to avoid conflict with whatever T10 might do in a future version of the SCSI [SBC3] standard (the underlying SCSI field in SPC is 4 bits, so an enum value of 32 in this draft MUST be used). For NVMe, these identifiers need to be obtained via the Namespace Identification Descriptors in NVMe 1.4 (returned by the Identify command with the CNS field set to 03h). 2.2. Client Fencing The SCSI layout uses Persistent Reservations to provide client fencing. For this both the MDS and the Clients have to register a key with the storage device, and the MDS has to create a reservation on the storage device. The pNFS SCSI protocol implements fencing using persistent reservations (PRs), similar to the fencing method used by existing shared disk file systems. To allow fencing individual systems, each system MUST use a unique persistent reservation key. The following is a full mapping of the required PR IN and PR OUT SCSI command to NVMe commands which MUST be used when using NVMe devices as storage devices for the pNFS SCSI layout. 2.2.1. Reservation Key Generation Prior to establishing a reservation on a namespace, a host shall become a registrant of that namespace by registering a reservation key. This reservation key may be used by the host as a means of identifying the registrant (host), authenticating the registrant, and preempting a failed or uncooperative registrant. This document assigns the burden to generate unique keys to the MDS, which MUST generate a key for itself before exporting a volume and a key for each client that accesses SCSI layout volumes. One important difference between SCSI Persistent Reservations and NVMe Reservations is that NVMe reservation keys always apply to all controllers used by a host (as indicated by the NVMe Host Identifier) Faibish Expires January 6, 2021 [Page 7] Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020 This behavior is somewhat similar to setting the ALL_TG_PT bit when registering a SCSI Reservation key, but actually guaranteed to work reliably. 2.2.2. MDS Registration and Reservation Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the MDS needs to prepare the volume for fencing using NVMeReservations. Registering a reservation key with a namespace creates an association between a host and a namespace. A host that is a registrant of a namespace may use any controller with which that host is associated (i.e., that has the same Host Identifier, refer to [NVME] section-5.21.1.26) to access that namespace as a registrant. 2.2.3. Client Registration 2.2.3.1 SCSI client Before performing the first I/O to a device returned from a GETDEVICEINFO operation, the client will register the reservation key returned by the MDS with the storage device by issuing a "PERSISTENT RESERVE OUT" command with a service action of REGISTER with the "SERVICE ACTION RESERVATION KEY" set to the reservation key. 2.2.3.2 NVMe Client A client registers a reservation key by executing a Reservation Register command (refer to [NVME] section 6.11) on the namespace with the Reservation Register Action (RREGA) field cleared to 000b (i.e., Register Reservation Key) and supplying a reservation key in the New Reservation Key (NRKEY) field. A client that is a registrant of a namespace may register the same reservation key value multiple times with the namespace on the same or different controllers. There are no restrictions on the reservation key value used by hosts with different Host Identifiers. 2.2.4. Fencing Action 2.2.4.1 SCSI client In case of a non-responding client, the MDS fences the client by issuing a "PERSISTENT RESERVE OUT" command with the service action set to "PREEMPT" or "PREEMPT AND ABORT", the "RESERVATION KEY" field set to the server's reservation key, the service action "RESERVATION KEY" field set to the reservation key associated with the non- responding client, and the "TYPE" field set to 8h (Exclusive Access - Registrants Only). Faibish Expires January 6, 2021 [Page 8] Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020 2.2.4.2 NVMe Client A host that is a registrant may preempt a reservation and/or registration by executing a Reservation Acquire command (refer to section 6.10), setting the Reservation Acquire Action (RACQA) field to 001b (Preempt), and supplying the current reservation key associated with the host in the Current Reservation Key (CRKEY) field. The CRKEY value shall match that used by the registrant to register with the namespace. If the CRKEY value does not match, then the command is aborted with status Reservation Conflict. If the PRKEY field value does not match that of the current reservation holder and is equal to 0h, then the command is aborted with status Invalid Field in Command. A reservation preempted notification occurs on all controllers in the NVM subsystem that are associated with hosts that have their registrations removed as a result of actions taken in this section except those associated with the host that issued the Reservation Release command. After the MDS preempts a client, all client I/O to the LU fails. The client SHOULD at this point return any layout that refers to the device ID that points to the LU. 2.2.5. Client Recovery after a Fence Action A client that detects a NVMe status codes (I/O error) on the storage devices MUST commit all layouts that use the storage device through the MDS, return all outstanding layouts for the device, forget the device ID, and unregister the reservation key. Future GETDEVICEINFO calls MAY refer to the storage device again, in which case the client will perform a new registration based on the key provided. If a reservation holder attempts to obtain a reservation of a different type on a namespace for which that host already is the reservation holder, then the command is aborted with status Reservation Conflict. It is not an error if a reservation holder attempts to obtain a reservation of the same type on a namespace for which that host already is the reservation holder. NVMe over Fabrics [NVMEoF] utilizes the same controller architecture as that defined in the NVMe Base specification [NVME]. Faibish Expires January 6, 2021 [Page 9] Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020 This includes using Submission and Completion Queues to execute commands between a host and a controller. Section 8.20 of [NVME] base specifications describes the relationship between a controller (MDS) and a namespace associated to the Clients. In a static controller model used by SCSI layout, controllers that may be allocated to a particular Client may have different state at the time the association is established. 2.3. Volatile write caches The Volatile Write Cache Enable (WCE) bit (i.e., bit 00) in the Volatile Write Cache Feature (Feature Identifier 06h) is the Write Cache Enable field in the NVMe Get Features command, see Section-5.21.1.6 of [NVME]. If a write cache is enable on a NVMe device used as a storage device for the pNFS SCSI layout, the MDS MUST ensure to use the NVMe FLUSH command to flush the volatile write cache. If there is no volatile write cache on the server, then attempts to access this NVMe Feature cause errors. The Get Features command specifying the Volatile Write Cache feature identifier is expected to fail with Invalid Field in Command status. 3. Security Considerations Since no protocol changes are proposed here, no security considerations apply. But the protocol is assuming that NVMe Authentication commands are implemented in the NVMe Security Protocol as the format of the data to be transferred is dependent on the Security Protocol. Authentication Receive/Response commands return the appropriate data corresponding to an Authentication Send command as defined by the rules of the Security Protocol. As the current draft is only supporting MVMe over fabric In-band protocol the Authentication requirements for security commands are based on the security protocol indicated by the SECP field in the command and DO NOT require authentication when used for NVMe in-band authentication. When used for other purposes, in-band authentication of the commands is required. 4. IANA Considerations The document does not require any actions by IANA. Faibish Expires January 6, 2021 [Page 10] Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020 5. Normative References [NVME] NVM Express, Inc., "NVM Express Revision 1.4", June 10, 2019. [NVMEoF] "NVM Express over Fabrics Revision 1.1", July 26, 2019 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", March 1997. [RFC8154] Hellwig, C., "Parallel NFS (pNFS) Small Computer System Interface (SCSI) Layout", May 2017. [SBC3] INCITS Technical Committee T10, "SCSI Block Commands-3", ANSI INCITS INCITS 514-2014, ISO/IEC 14776-323, 2014. [SPC4] INCITS Technical Committee T10, "SCSI Primary Commands-4", ANSI INCITS 513-2015, 2015. [FC-NVMe] INCITS Technical Committee T10, "Fibre Channel - Non-Volatile Memory Express", ANSI INCITS 540, 2018. [RFC7525] Sheffer, Y., "Recommendations for Secure Use of Transport Layer Security (TLS) and Datagram Transport Layer Security (DTLS)" alsa known as BCP 195. Author's Address Sorin Faibish Dell EMC 228 South Street Hopkinton, MA 01774 United States of America Phone: +1 508-249-5745 Email: faibish.sorin@dell.com David Black Dell EMC 176 South Street Hopkinton, MA 01748 United States of America Phone: +1 774-350-9323 Email: david.black@dell.com Christoph Hellwig Email: hch@lst.de Faibish Expires January 6, 2021 [Page 11]