Internet-Draft | LAYOUT_RECOVERY | November 2024 |
Haynes & Myklebust | Expires 24 May 2025 | [Page] |
The Parallel Network File System (pNFS) allows for a file's metadata (MDS) and data (DS) to be on different servers. When the metadata server is restarted, the client can still modify the data file component. During the recovery phase of startup, the metadata server and the data servers work together to recover state (which files are open, last modification time, size, etc.). If the client has not encountered errors with the data files, then the state can be recovered, avoiding resilvering of the data files. With any errors, there is no means by which the client can report errors to the metadata server. As such, the metadata server has to assume that file needs resilvering. This document presents an extension to RFC8435 to allow the client to update the metadata and avoid the resilvering.¶
This note is to be removed before publishing as an RFC.¶
Discussion of this draft takes place on the NFSv4 working group mailing list (nfsv4@ietf.org), which is archived at https://mailarchive.ietf.org/arch/browse/nfsv4/. Working Group information can be found at https://datatracker.ietf.org/wg/nfsv4/about/.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 24 May 2025.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
In the Network File System version4 (NFSv4) with a Parallel NFS (pNFS) Flexible File Layout ([RFC8435]) server, during recovery after a restart, there is no mechanism for the client to inform the metadata server about an error which occurred during a WRITE (see Section 18.32 of [RFC8881]) operation to the data servers in the period of the outage.¶
Using the process detailed in [RFC8178], the revisions in this document become an extension of NFSv4.2 [RFC7862]. They are built on top of the external data representation (XDR) [RFC4506] generated from [RFC7863].¶
See Section 1.1 of [RFC8435] for a set of definitions.¶
The key words 'MUST', 'MUST NOT', 'REQUIRED', 'SHALL', 'SHALL NOT', 'SHOULD', 'SHOULD NOT', 'RECOMMENDED', 'NOT RECOMMENDED', 'MAY', and 'OPTIONAL' in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
When a metadata server restarts, clients are provided a grace recovery period where they are allowed to recover any state that they had established. With open files, the client can send an OPEN (see Section 18.16 of [RFC8881]) operation with a claim type of CLAIM_PREVIOUS (see Section 9.11 of [RFC8881]). The client uses the RECLAIM_COMPLETE (see Section 18.51 of [RFC8881]) operation to notify the metadata server that it is done reclaiming state.¶
The NFSv4 Flexible File Layout Type allows for the client to mirror files (see Section 8 of [RFC8435]). With client side mirroring, it is important for the client to inform the metadata server of any I/O errors encountered with one of the mirrors. This is the only way for the metadata server to determine one or more of the mirrors is corrupt and then repair the mirrors via resilvering. The client can use LAYOUTRETURN (see Section 18.44 of [RFC8881]) and the ff_ioerr4 (see Section 9.1.1 of [RFC8435]) structure to inform the metadata server of I/O errors.¶
A problem is that when the metadata server restarts and the client has errors it needs to report, it can not do so. Section 12.7.4 of [RFC8881] requires that the client MUST stop using layouts. While the intent there is that the client MUST stop doing I/O to the storage devices, it is also true that the layout stateids are no longer valid. The LAYOUTRETURN needs a layout stateid to proceed and the client can not get a layout during grace recovery (see Section 12.7.4 of [RFC8881]) to recover layout state. As such, clients have no choice but to not recover files with I/O errors. In turn, the metadata server MUST assume that the mirrors are inconsistent and pick one for resilvering. It is a MUST because even if the metadata server can determine that the client did modify data during the outage, it MUST NOT assume those modifications were consistent.¶
To fix this issue, the metadata server MUST accept for the lrf_stateid in LAYOUTRETURN (see Section 18.44.1 of [RFC8881]) the anonymous stateid of all zeros (see Section 8.2.3 of [RFC8881]). The client can use this anonymous stateid to inform the metadata server of errors encountered. The metadata server can then accurately resilver the file by picking the mirror(s) that do not have any associated errors.¶
During the grace period, if the client sends a lrf_stateid in the LAYOUTRETURN with any value other than the anonymous stateid of all zeros, then the metadata server MUST now respond with an error of NFS4ERR_GRACE (see Section of 15.1.9.2 [RFC8881]). After the grace period, if the client sends a lrf_stateid in the LAYOUTRETURN with a value of the anonymous stateid of all zeros, then the metadata server MUST now respond with an error of NFS4ERR_NO_GRACE (see Section 15.1.9.3 of [RFC8881]).¶
Also, when the metadata server builds the reply to the LAYOUTRETURN when a lrf_stateid with the value of the anonymous stateid of all zeros it MUST NOT bump the seqid of the lorr_stateid.¶
If the metadata server detects that the layout being returned in the LAYOUTRETURN does not match the current mirror instances found for the file, then it MUST ignore the LAYOUTRETURN and resilver the file in question.¶
The metadata server MUST resilver any files which are neither explicitly recovered with a CLAIM_PREVIOUS nor have a reported error via a LAYOUTRETURN. The client has most likely restarted and lost any state.¶
A write intent occurs when a client opens a file and gets a LAYOUTIOMODE4_RW from the metadata server. The metadata server MUST track outstanding write intents and when it restarts, it MUST track recovery of those write intents. The method that the metadata server uses to track write intents is implementation specific, i.e., outside of the scope of this document.¶
The decision to resilver a file depends on how the client recovers the file before the grace period ends. If the client reclaims the file and reports no errors, the metadata server MUST NOT resilver the file. If the client reports an error on the file, then the file MUST be resilvered. If the client does not reclaim or report an error before the grace period ends, then under the old behavior, the metadata server MUST resilver the file.¶
The resilvering process is broadly to:¶
The metadata server MUST NOT resilver a file if there are clients with outstanding write intents. I.e., multiple clients might have the file open with write intents. As it MUST track write intents, it MUST also track the need to resilver. I.e., if the metadata server restarts during the grace period, it MUST restart the file recovery if it replays the write intent else it MUST start the resilvering if it replays the resilvering intent.¶
Whether the metadata server prevents all I/O to the file until the resilvering is done or forces all I/O to go through the metadata server or allows a proxy server to update the new data file as it is being reslivered is all an implementation choice. The constraint is that the metadata server is responsible for the reconstruction of the data file and for the consistency of the mirrors.¶
If the metadata server does allow the client access to the file during the resilvering, then the client MUST have the same layout (set of mirror instances) after the metadata server as before. One way that such a resilvering can occur is for a proxy server to be inserted into the layout. That server will be copying a good mirror instance to a new instance. As it gets I/O via the layout, it will be responsible for updating the copy it is performing. This requirement is that the proxy server MUST stay in the layout until the grace period is finished.¶
The metadata server has no expectations for the client to use this new functionality. Therefore, if the client does not use it, the metadata server will function normally.¶
If the client does use the new functionality and the metadata server does not support it, then the metadata server MUST reply with a NFS4ERR_BAD_STATEID to the LAYOUTRETURN. If the client detects a NFS4ERR_BAD_STATEID error in this scenario, it should fall back to the old behavior of not reporting errors.¶
There are no new security considerations beyond those in [RFC7862].¶
There are no IANA considerations for this document.¶
Tigran Mkrtchyan, Jeff Layton, and Rick Macklem provided reviews of the document.¶