Internet-Draft Abbreviated-Title October 2022
Wang, et al. Expires 27 April 2023 [Page]
Workgroup:
Network Working Group
Internet-Draft:
draft-wang-ffd-framework-00
Published:
Intended Status:
Informational
Expires:
Authors:
H. Wang
Huawei
F. Qin
China Mobile
L. Zhao
Huawei
S. Chen
Huawei

Framework of Fast Fault Detection for IP-baesd Networks

Abstract

The IP-based distributed system and software application layer often use heartbeat to maintain the network topology status. However, the heartbeat setting is long, which prolongs the system fault detection time. IP-based storage network is the typical usage of that scenario. When an IP-based storage network fault occurs, NVMe connections need to be switched over. Currently, no effective method is available for quick detection, switchover is performed only based on keepalive timeout, resulting in low performance.

This document defines the basic framework of how network assisted host devices can quickly detect application connection failures caused by network faults.

Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 27 April 2023.

Table of Contents

1. Introduction

IP-based distributed systems are widely used, and the network is opaque to application-side systems. When an IP network connected to a distributed system encounters a fault that affects IP connectivity, the application system cannot quickly detect the fault. To enable the application system to quickly detect the fault, the application system needs to accelerate keepalive or deploy a detection mechanism, which brings extra overheads to the application system.

The [I-D.guo-ffd-requirement] describes the requirements for these applications. The most typical application scenario is the IP-based NVMe scenario.

IP-based NVMe is an implementation of NVMe over Fabrics that best fits NVMe semantics. It is the development trend of high-speed storage networks in the future. IP-based NVMe for high-speed storage has high requirements on IP networks. In an IP-based NVMe network, when a failure that affects an IP connection occurs, for example, an access link failure or a switch network failure that cannot perform route convergence, the NVMe connection cannot immediately detect the fault. In the current implementation mechanism, this failure can only be detected based on keepalive timeout. Generally, this failure lasts more than 10s. To speed up detection, hosts and storage devices can use fast keepalive or BFD for fast detection. However, the solution introduces additional load on hosts and storage devices, making it difficult to use in large-scale IP-based NVMe.

2. Terminology

NoF : NVMe of Fabrics

FC : Fiber Channel

NVMe : Non-Volatile Memory Express

SAN: Storage Area Network

3. Reference Models

This document describes the framework based of IP-based NVMe as a typical application.

An IP-based NVMe mainly includes three types of roles: an initiator (referred to as a host), a switch, and a target (referred to as a storage device). Initiators and targets are also referred to as client endpoint and server endpoint. Hosts and storage devices use the IP-based NVMe protocol to transmit data over the network to provide high-performance storage services.

               +--+      +--+      +--+      +--+
   Host        |H1|      |H2|      |H3|      |H4|
(Initiator)    +/-+      +-,+      +.-+      +/-+
                |         | '.   ,-`|         |
                |         |   `',   |         |
                |         | ,-`  '. |         |
              +-\--+    +--`-+    +`'--+    +-\--+
              | SW1|    | SW2|    | SW3|    | SW4|
              +--,-+    +---,,    +,.--+    +-.--+
                  `.          `'.,`         .`
                    `.   _,-'`    ``'.,   .`
    IP              +--'`+            +`-`-+
  Network           | SW5|            | SW6|
                    +--,,+            +,.,-+
                    .`   `'.,     ,.-``   ',
                  .`         _,-'`          `.
              +--`-+    +--'`+    `'---+    +-`'-+
              | SW7|    | SW8|    | SW9|    |SW10|
              +-.,-+    +-..-+    +-.,-+    +-_.-+
                | '.   ,-` |        | `.,   .' |
                |   `',    |        |    '.`   |
                | ,-`  '.  |        | ,-`  `', |
  Storage      +-`+      `'\+      +-`+      +`'+
  (Target)     |S1|      |S2|      |S3|      |S4|
               +--+      +--+      +--+      +--+
               Figure 1 : NVMe over IP-based Network

This is a dual-plane NVMe over IP-based Network which applies to a large-scale storage device access network. Storage devices on the dual-homed access network provide NVMe services using two different IP addresses.

When an access link (for example, the S1-SW7 link) or a network-side link (for example, the SW7-SW5 link) fails, H1 cannot access the IP address of S1 connected to SW1. H1 cannot quickly detect the failure. After the keepalive timeout, H1 can detect the failure and then switch the NVMe connection to the IP address that S1 accesses through SW8.

4. Functional Components

The NVMe IP-based SANs consists of storage devices, hosts and switches. The storage device provides services. The host initiates an NVMe connection to the storage device. That is, the host is the Client Endpoint, and the storage device is the Server Endpoint.

4.1. Server Endpoint (Storage Device)

As a service provider, the server endpoint does not need to detect the status of the client. To enable the network to know the information about the server, the server needs to advertise its information to the access switch.

To reduce the complexity of server endpoint, it is suggested to extend the LLDP protocol to support registration.

 +-----------+           +--------+
 | Server EP |           | Switch |
 | (Storage) |           |        |
 +----/------+           +----/---+
      |                       |
      |    Register Msg       |
      |---------------------->|
      |                       |
      \                       \
   Figure 2 : Server Endpoint

4.2. Client Endpoint (Host)

The client needs to quickly obtain the IP reachability status of the service endpoint. In this case, the client needs to send a subscription request to the access switch. In addition, to facilitate the network to know the location of the client endpoint, the client endpoint needs to register its information to the access switch. When the switch network senses a failure required by the client endpoint, the access switch notifies the corresponding client endpoint of the fault state.

Also, to reduce the complexity of client endpoints, it is recommended that the LLDP protocol be extended to support subscriptions. For notification messages initiated by the switch to client endpoints, it is recommended that the L2 extension protocol be used to control the notification scope.

 +-----------+           +--------+
 | Client EP |           | Switch |
 |  (Host)   |           |        |
 +----/------+           +----/---+
      |                       |
      |    Register Msg       |
      |---------------------->|
      |                       |
      |    Subscribe Msg      |
      |---------------------->|
      |                       |
      |   Notification Msg    |
      |<----------------------|
      |                       |
      \                       \
    Figure 3 : Client Endpoint

4.3. Network Device

Network devices, such as access switches, can quickly detect failures on local access links. The client endpoint that needs to obtain the failure may not be connected to that switch. Therefore, the switch that detects the failure needs to synchronize the information to other switches so that the other switches can notify the required endpoint as required.

On a large-scale network, reflector can be used to reduce the number of connections for information synchronization between switches.

To ensure that synchronization messages can be reliably synchronized to other switches, a reliable transmission protocol, such as TCP or Quic, must be used.

 +--------+    +-----------+   +--------+
 | Switch |    | Reflector |   | Switch |
 +----/---+    +-----/-----+   +---/----+
      |              |             |
      |   Sync Msg   |             |
      |------------->|   Sync Msg  |
      |              |------------>|
      \              \             \
    Figure 4 : Network Device

5. Procedures

Here use the IP-based NVMe interaction example to see the complete deployment process of this framework.

5.1. Network Deployment

The IP-based NVMe uses the standard IP technology. Network deployments typically use the current IP technologies. For example, OSPF is usually deployed as an underlay protocol.

5.2. Hosts and Storage devices

Hosts and storage devices are connected to the IP network. As shown by Figure 1, they may access the network in single-homing or dual-homing mode. The administrator assigns access IP addresses to the hosts and storage devices. In most scenarios, these routes can be advertised through the underlay protocol.

To enable IP network devices to know the information about these access nodes, hosts and storage devices need to register their own network information, such as IP addresses and roles, with the access switches after accessing the network. In addition, the host needs to initiate a subscription request to the access switch to notify the access switch of the information about the storage device it cares about.

5.3. Status Infomation Sync And Notification

Hosts and storage devices are connected to different switches. To enable these switches to obtain the registration and subscription information of these hosts and storage devices, synchronizing the information between the switches is needed.

 +------+        +--------+   +-----------+   +--------+     +---------+
 | Host |        | Switch |   | Reflector |   | Switch |     | Storage |
 +--/---+        +----/---+   +-----/-----+   +---/----+     +----/----+
    |  Register Msg   |             |             |               |
    |---------------->|             |             |               |
    |  Subscribe Msg  |             |             |               |
    |---------------->|  Sync Msg   |             |               |
    |                 |------------>|   Sync Msg  |               |
    |                 |             |------------>|  Register Msg |
    |                 |             |   Sync Msg  |<--------------|
    |                 |   Sync Msg  |<------------|               |
    |                 |<------------|             |--/            |
    |                 |             |             |  |Fault       |
    |                 |             |             |  |Detection   |
    |                 |             |   Sync Msg  |<--            |
    |                 |   Sync Msg  |<------------|               |
    | Notification Msg|<------------|             |               |
    |<----------------|             |             |               |
    \                 \             \             \               \
            Figure 7 : Information Advertisement

After detecting a local failure, the switch calculates the IP address affected by the failure. If another access endpoint on the switch wants to obtain the IP address of the failure, the switch notifies that access endpoint of the fault. In addition, the switch needs to synchronize the failure IP address to other switches on the network. After receiving the failure IP address information, other switches notify the access endpoints who need the information.

When a link between network devices or a network device is failure, routes are converged on the network. If services cannot be restored even after route convergence, such as SW7-SW5 shown in Figure 1, is faulty. As a result, H1 cannot access the IP address used by S1 to access SW7. In this case, after detecting the failure, the network device calculates the IP addresses affected by the failure. Then, the network device notifies the required access endpoint of the failure information. As shown in Figure 1, SW1 calculates that the IP address used by S1 to connect to SW7 is unreachable. Therefore, SW1 notifies H1 of the failure so that H1 can quickly switch to another storage device.

6. Security Considerations

NA

7. IANA Considerations

NA

8. References

8.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.

8.2. Informative References

[I-D.guo-ffd-requirement]
Guo, L., Feng, Y., Zhao, J., Qin, F., Zhao, L., and H. Wang, "Requirement of Fast Fault Detection for IP-based Network", Work in Progress, Internet-Draft, draft-guo-ffd-requirement-00, , <https://www.ietf.org/archive/id/draft-guo-ffd-requirement-00.txt>.

Authors' Addresses

Haibo Wang
Huawei
No. 156 Beiqing Road
Beijing
100095
P.R. China
Fengwei Qin
China Mobile
Beijing
China
Lily Zhao
Huawei
No. 3 Shangdi Information Road
Beijing
100085
P.R. China
Shuanglong Chen
Huawei
No. 156 Beiqing Road
Beijing
100095
P.R. China