Netowork Working Group                                            L. Guo
Internet-Draft                                                     CAICT
Intended status: Informational                                   Y. Feng
Expires: 27 April 2023                                      China Mobile
                                                                 J. Zhao
                                                           China Telecom
                                                                  F. Qin
                                                            China Mobile
                                                                 L. Zhao
                                                                 H. Wang
                                                                  Huawei
                                                         24 October 2022


        Requirement of Fast Fault Detection for IP-based Network
                      draft-guo-ffd-requirement-00

Abstract

   The IP-based distributed system and software application layer often
   use heartbeat to maintain the network topology status.  However, the
   heartbeat setting is long, which prolongs the system fault detection
   time.  This document describes the requirements for a fast fault
   detection solution of IP-based network.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 27 April 2023.


Guo, et al.               Expires 27 April 2023                 [Page 1]

Internet-Draft              Abbreviated-Title               October 2022


Copyright Notice

   Copyright (c) 2022 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . .   3
     3.1.  IP-based NVMe . . . . . . . . . . . . . . . . . . . . . .   3
     3.2.  Distributed Storage . . . . . . . . . . . . . . . . . . .   6
     3.3.  Cluster Computing . . . . . . . . . . . . . . . . . . . .   8
   4.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   8
     4.1.  Normative References  . . . . . . . . . . . . . . . . . .   8
     4.2.  Informative References  . . . . . . . . . . . . . . . . .   9
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   9

1.  Introduction

   In the face of ever-expanding data, the powerful single-server system
   cannot meet the requirements of data analysis and storage.  At the
   same time, with the increase of Ethernet network bandwidth and scale,
   the distributed system that communicates through the network emerges
   and develops rapidly.  Heartbeat is a common network topology
   maintenance technology used in distributed systems and software
   application layers.  However, if the heartbeat is set too short, the
   current network congestion may lead to misjudgment.  If the value of
   this parameter is too long, the judgment is slow.  Generally, you
   need to balance and set the parameters based on various conditions.
   IP-based NVMe, distributed storage and Cluster Computing are used for
   core application scenarios.  The requirements for performance and
   impact of faults on services are increasing.  This document describes
   application scenarios and capability requirements for fast fault
   detection in scenarios such as IP-based NVMe, artificial
   intelligence, and distributed storage.


Guo, et al.               Expires 27 April 2023                 [Page 2]

Internet-Draft              Abbreviated-Title               October 2022


2.  Terminology

   FC: Fiber Channel

   NVMe: Non-Volatile Memory Express

   IP-based NVMe: using RDMA or TCP to transport NVMe through Ethernet

   NoF: NVMe of Fabrics

3.  Use Cases


3.1.  IP-based NVMe

   For a long time, the key storage applications and high performance
   requirements are mainly based on FC networks.  With the increase of
   transmission rates, the medium has evolved from HDDs to solid-state
   storage and the protocol has evolved from SATA to NVMe.  The
   emergence of new NVMe technologies brings new opportunities.  With
   the development of the NVMe protocol, the application scenario of the
   NVMe protocol is extended from PCIe to other fabrics, solving the
   problem of NVMe extension and transmission distance.  The block
   storage protocol uses NoF to replace SCSI, reducing the number of
   protocol interactions from application hosts to storage systems.  The
   end-to-end NVMe protocol greatly improves performance.

   Fabrics of NoF include Ethernet, Fibre Channel and InfiniBand.
   Comparing FC-NVMe to Ethernet- or InfiniBand-based Network
   alternatives generally takes into consideration the advantages and
   disadvantages of the networking technologies.  Fibre Channel fabrics
   are noted for their lossless data transmission, predictable and
   consistent performance, and reliability.  Large enterprises tend to
   favor FC storage for mission-critical workloads.  But Fibre Channel
   requires special equipment and storage networking expertise to
   operate and can be more costly than IP-based alternatives.  Like FC,
   InfiniBand is a lossless network requiring special hardware.  IP-
   based NVMe storage products tend to be more plentiful than FC-NVMe-
   based options.  Most storage startups focus on IP-based NVMe.  But
   unlink FC, The Ethernet switch does not notify the change of device
   status.  When the device is faulty, relying on the NVMe link
   heartbeat message mechanism, the host takes tens of seconds to
   complete service failover.


Guo, et al.               Expires 27 April 2023                 [Page 3]

Internet-Draft              Abbreviated-Title               October 2022


                   +--------------------------------------+
                   |          NVMe Host Software          |
                   +--------------------------------------+
                   +--------------------------------------+
                   |   Host Side Transport Abstraction    |
                   +--------------------------------------+

                      /\      /\      /\      /\      /\
                     /  \    /  \    /  \    /  \    /  \
                      FC      IB     RoCE    iWARP   TCP
                     \  /    \  /    \  /    \  /    \  /
                      \/      \/      \/      \/      \/

                   +--------------------------------------+
                   |Controller Side Transport Abstraction |
                   +--------------------------------------+
                   +--------------------------------------+
                   |          NVMe SubSystem              |
                   +--------------------------------------+
                Figure 1: NVMe SubSystem

   This section describes the application scenarios and capability
   requirements of the IP-based NVMe storage that implements fast fault
   detection similar to FC.

   The NVMe over RDMA or IP-based network in storage includes three
   types of roles: an initiator (referred to as a host), a switch, and a
   target (referred to as a storage device).  Initiators and targets are
   also referred to as endpoint devices.


Guo, et al.               Expires 27 April 2023                 [Page 4]

Internet-Draft              Abbreviated-Title               October 2022


                            +--+      +--+      +--+      +--+
                Host        |H1|      |H2|      |H3|      |H4|
             (Initiator)    +/-+      +-,+      +.-+      +/-+
                             |         | '.   ,-`|         |
                             |         |   `',   |         |
                             |         | ,-`  '. |         |
                           +-\--+    +--`-+    +`'--+    +-\--+
                           | SW |    | SW |    | SW |    | SW |
                           +--,-+    +---,,    +,.--+    +-.--+
                               `.          `'.,`         .`
                                 `.   _,-'`    ``'.,   .`
                    IP           +--'`+            +`-`-+
               Network           | SW |            | SW |
                                 +--,,+            +,.,-+
                                 .`   `'.,     ,.-``   ',
                               .`         _,-'`          `.
                           +--`-+    +--'`+    `'---+    +-`'-+
                           | SW |    | SW |    | SW |    | SW |
                           +-.,-+    +-..-+    +-.,-+    +-_.-+
                             | '.   ,-` |        | `.,   .' |
                             |   `',    |        |    '.`   |
                             | ,-`  '.  |        | ,-`  `', |
               Storage      +-`+      `'\+      +-`+      +`'+
               (Target)     |S1|      |S2|      |S3|      |S4|
                            +--+      +--+      +--+      +--+
           Figure 2: NVMe over IP-based Network

   Hosts and storage devices are connected to the network separately and
   in order to achieve high reliability, each host and storage device
   are connected to dual network planes simultaneously.  The host can
   read and write data services when an NVMe connection is established
   between the host and the storage device.

   When a storage device link is faulty during running, the host cannot
   detect the fault status of the indirectly connected device at the
   transport layer.  Based on the IP-based NVMe protocol, the host uses
   the NVMe heartbeat to detect the status of the storage device.  The
   heartbeat message interval is 5s.  Therefore, it takes tens of
   seconds to determine whether the storage device is faulty and perform
   service switchover using the multipath software.  Failure tolerance
   time for core applications cannot be reached.  In order to obtain the
   best customer experience and business reliability requirement, we
   need to enhance fault detection and failover for IP-based NVMe.


Guo, et al.               Expires 27 April 2023                 [Page 5]

Internet-Draft              Abbreviated-Title               October 2022


   In this proposal, a fast fault detection solution with switch
   participation is proposed.  This scheme utilizes the ability of
   switches to detect faults quickly at the physical layer and link
   layer, and allows the switch to synchronize the detected fault
   information in the IP network, and then notify the fault status to
   the endpoint devices.

   Fault detection procedure: The host can detect the fault status of
   the storage device and quickly switch to the standby path.

   1.  If a storage fault occurs, the access switch detects the fault at
       the storage network layer or link layer.

   2.  The switch synchronizes the status to other switches on the
       network.

   3.  The switch notifies the storage fault information to the hosts.

   4.  Quickly disconnect the connection from the storage device and
       trigger the multipathing software to switch services to the
       redundant path.  The fault should be detected within 1s.

           +----+       +-------+     +-------+    +-------+
           |Host|       |Switch |     |Switch |    |Storage|
           +----+       +-------+     +-------+    +-------+
              |             |            |-+           |
              |             |            |1|           |
              |             |            |-+           |
              |             |<----2------|             |
              |             |            |             |
              |<----3-------|            |             |
              |             |            |             |
              |<----4-------|------------|-----------> |
              |             |            |             |
        Figure 3: Switches interact with hosts and storage devices

3.2.  Distributed Storage

   Distributed storage cluster devices are interconnected through a
   network (back-end IP network) to establish a cluster.  When a link
   fault on a node or node fault occurs in the storage cluster, other
   nodes in the storage cluster cannot detect the fault status of the
   indirectly connected devices through the transport layer.  Based on
   the IP protocol, management or master nodes in a storage cluster use
   heartbeats to detect the status of storage nodes.  It takes 10
   seconds or more to determine whether a storage device is faulty and
   switch services to another normal storage node.  Services cannot be
   accessed during the fault.  To achieve the best customer experience


Guo, et al.               Expires 27 April 2023                 [Page 6]

Internet-Draft              Abbreviated-Title               October 2022


   and service reliability, we need to enhance the fault detection and
   failover of IP-based cluster nodes.

                Storage      +--+      +--+      +--+      +--+
                cluster      |S1|      |S2|      |S3|      |S4|
                             +--+      +--+      +--+      +--+
                              |           '.   ,-`          |
                              |            .`',_            |
                              |    _ ..--`       `'--.._    |
                            +-\--+                       +-\--+
                            | SW |                       | SW |
                            +--,-+_                     _+-.--+
                                `. `'--..._   _ .. -- '`_.`
                                  `.    _,-'` -._     .`
                BACK Storage      +--'`+         +`-`-+
                IP Network        | SW |         | SW |
                                  +----+         +----+
            Figure 4: Distributed storage

   The fast fault detection solution in this proposal can be used in
   this scenario.  This solution takes advantage of the switch's ability
   to quickly detect faults at the physical layer and link layer, and
   allows the switch to synchronize fault information detected on the IP
   network.  Then, the system notifies the storage cluster management
   node or the primary node of the fault status.

   Fault detection procedure:

   1.  If a storage fault occurs, the access switch detects the fault at
       the storage network layer or link layer.

   2.  The switch synchronizes the status to other switches on the
       network.

   3.  The switch notifies the storage fault information to the storage
       management or master node.  The fault should be detected within
       1s.


Guo, et al.               Expires 27 April 2023                 [Page 7]

Internet-Draft              Abbreviated-Title               October 2022


      +------+       +-------+     +-------+    +-------+
      |master|       |Switch |     |Switch |    |Storage|
      +------+       +-------+     +-------+    +-------+
         |               |            |-+           |
         |               |            |1|           |
         |               |            |-+           |
         |               |<----2------|             |
         |               |            |             |
         |<----3---------|            |             |
         |               |            |             |

   Figure 5: Switches interact with controller

3.3.  Cluster Computing

   In cluster computing scenarios, for example, HPC cluster applications
   and AI cluster applications, cluster node faults and failures may
   occur on any node at any time.  To implement cluster HA, cluster
   services can be switched over from one node to another.  In this
   scenario, the cluster is called HA-Cluster, which does not have
   obvious impact on cluster customers.  The HA cluster software is used
   to implement automatic fault check and service switchover.  An HA
   cluster with only two nodes is also called dual-system hot backup.
   That is, two servers back up each other.  When one server is faulty,
   the other server takes over services.  In this way, the system can
   provide services continuously without manual intervention.  Dual-
   system hot backup is only a type of HA cluster.  The HA cluster
   system can support more than two nodes and provides more advanced
   functions than dual-system hot backup to meet the changing
   requirements of users.  Generally, the HA cluster software can use
   heartbeat+pacemaker to implement HA.  The fault detection time is
   longer than 30 seconds.

   The fast fault detection solution in this proposal can be used in
   this scenario.  The switchover time can be within seconds (RTO <
   min), which is the highest-level product in the disaster recovery
   standard.

   Fault detection procedure is similar to that of distributed storage.

4.  References

4.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.


Guo, et al.               Expires 27 April 2023                 [Page 8]

Internet-Draft              Abbreviated-Title               October 2022


4.2.  Informative References

   [ODCC-2020-05016]
              Open Data Center Committe, "NVMe over RoCEv2 Network
              Control Optimization Technical Requirements and Test
              Specifications", 2020.

Authors' Addresses

   Liang Guo
   CAICT
   No.52, Hua Yuan Bei Road, Haidian District,
   Beijing
   100191
   China
   Email: guoliang1@caict.ac.cn


   Yi Feng
   China Mobile
   12 Chegongzhuang Street, Xicheng District
   Beijing
   China
   Email: fengyiit@chinamobile.com


   Jizhuang Zhao
   China Telecom
   South District of Future Science and Technology in Beiqijia Town, Changping District
   Beijing
   China
   Email: zhaojzh@chinatelecom.cn


   Fengwei Qin
   China Mobile
   12 Chegongzhuang Street, Xicheng District
   Beijing
   China
   Email: qinfengwei@chinamobile.com


   Lily Zhao
   Huawei
   No. 3 Shangdi Information Road, Haidian District
   Beijing
   China
   Email: Lily.zhao@huawei.com


Guo, et al.               Expires 27 April 2023                 [Page 9]

Internet-Draft              Abbreviated-Title               October 2022


   Haibo Wang
   Huawei
   No. 156 Beiqing Road
   Beijing
   P.R. China
   Email: rainsword.wang@huawei.com


Guo, et al.               Expires 27 April 2023                [Page 10]