Internet-Draft FARE in SUN June 2026
Xu, et al. Expires 12 December 2026 [Page]
Workgroup:
Network Working Group
Internet-Draft:
draft-xu-rtgwg-fare-in-mp-son-00
Published:
Intended Status:
Standards Track
Expires:
Authors:
X. Xu
China Mobile
Z. He
Broadcom
N. Wang
Intel
N. Wang
Hygon
W. Wan
Sugon
H. Wang
Moore Threads
J. Guo
Biren Technology
X. Li
Enflame Technology
T. Zhou
Resnics Technology
Y. Yang
Centec
Y. Xia
Tencent
W. Zhang
Tencent
P. Wang
Baidu
Y. Zhuang
Huawei Technologies
F. Yang
Cloudnine Information Technologies
C. Li
Metanet Networking Technology
X. Wang
Ruijie Networks
R. Glebov
Yandex

Fully Adaptive Routing Ethernet in Multi-Plane Scale-Out Networks

Abstract

FARE‑BGP enables weighted ECMP load balancing using a path‑bandwidth extended community. FARE‑in‑SUN extends this mechanism from switches to GPUs for scale‑up networks, which are typically multi‑plane. Large AI training clusters increasingly adopt multi‑plane scale-out network topologies. This document further extends FARE‑BGP from switches to RoCE NICs (RNICs) for such multi‑plane scale‑out networks. The document also presents two techniques to address route scalability concerns caused by the injection of numerous host routes.

Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 12 December 2026.

Table of Contents

1. Introduction

Large AI training clusters (beyond 100,000 GPUs) increasingly use multi‑plane scale‑out network topologies (see below) to reduce the total number of switches and links. In such a topology, a high‑speed RNIC is split into multiple lower‑speed lanes, each connected to an independent CLOS fabric (a “plane”). Because there are no links between planes, the RNIC itself must decide which plane to use for each packet or flow. In other words, the RNIC must know the reachability of each plane and then perform global load balancing across planes.


   =========================================
   # +----+ +----+ +----+ +----+           #
   # | S1 | | S2 | | S3 | | S4 | (Spine)   #
   # +----+ +----+ +----+ +----+           #
   #                              Plane-1  #
   # +----+ +----+ +----+ +----+           #
   # | L1 | | L2 | | L3 | | L4 | (Leaf)    #
   # +----+ +----+ +----+ +----+           #
   =========================================

   ===================================     ===================================
   # +-----+ +-----+ +-----+ +-----+ #     # +-----+ +-----+ +-----+ +-----+ #
   # |RNIC1| |RNIC2| |RNIC3| |RNIC4| #     # |RNIC1| |RNIC2| |RNIC3| |RNIC4| #
   # +-----+ +-----+ +-----+ +-----+ #     # +-----+ +-----+ +-----+ +-----+ #
   #              Server-1           #     #             Server-n            #
   #================================== ... ===================================

   =========================================
   # +----+ +----+ +----+ +----+           #
   # | L1 | | L2 | | L3 | | L4 | (Leaf)    #
   # +----+ +----+ +----+ +----+           #
   #                              Plane-2  #
   # +----+ +----+ +----+ +----+           #
   # | S1 | | S2 | | S3 | | S4 | (Spine)   #
   # +----+ +----+ +----+ +----+           #
   =========================================


                              Figure 1

(For simplicity, the diagram above omits the connections between RNICs and leaf switches. In practice, each RNIC is multi‑homed to one leaf switch in every plane.)

FARE‑in‑SUN [I-D.xu-rtgwg-fare-in-sun] describes how to extend the FARE‑BGP protocol [I-D.xu-idr-fare] from switches to GPUs for scale‑up networks. Because scale‑up shares the same multi‑plane architectural pattern as multi-plane scale-out networks, the adaptive routing approach defined in FARE‑in‑SUN can be applied directly to multi‑plane scale‑out networks.

The solution described in this document is almost identical to FARE‑in‑SUN, with the following two essential differences. First, FARE‑BGP is extended from switches to RNICs rather than to GPUs. Second, in a scale‑up network, the number of route entries is small (typically a few hundred) and can be installed directly on GPUs. In an isolated multi‑plane scale‑out network with 100,000 GPUs and four planes, each plane may propagate up to 100,000 host routes – a total of 400,000 routes. Storing all these routes on an RNIC is impractical. Therefore, the RNIC must suppress the routing table using the techniques described in Section 4.

This document describes how to extend the Fully Adaptive Routing Ethernet (FARE) using BGP (FARE-BGP in short) as described in , which was originally designed for scale-out netowrks, to scale-up networks.

2. Terminology

This memo makes use of the terms defined in [RFC2119].

3. Solution Description

In an isolated multi‑plane scale‑out network, an RNIC connects to each plane and is configured as a stub BGP speaker per plane. It establishes separate BGP sessions with the attached leaf switches of each plane. The BGP neighbor discovery [I-D.xu-idr-neighbor-autodiscovery] can be used to simplify configuration.

Through these sessions, the RNIC learns routes to remote GPUs together with the path‑bandwidth extended community. Because the RNIC participates in BGP with each plane independently, it aggregates per‑plane path‑bandwidth information and performs weighted load balancing across planes. The RNIC thus performs the same Weighted Equal‑Cost Multi‑Path (WECMP) functions as a FARE‑capable switch, distributing traffic in proportion to the path bandwidth of each ECMP route.

Two modes of WECMP are supported:

In an isolated multi‑plane scale‑out network with 100,000 GPUs and four planes, each plane may propagate up to 100,000 host routes – a total of 400,000 routes. Storing all these routes on an RNIC is impractical. Two complementary approaches can reduce the number of routes the RNIC must store.

3.1. Route Aggregation with Explicit Unreachable Host Route Advertisement

It's straightfoward to resort to route aggregation mechanism, i.e., aggregating host routes when advertising them from leaf to spine. However, naive aggregation can cause route blackholes: if a specific host within an aggregate becomes unreachable, the aggregated route still points to that plane. Consequently, traffic destined for that host will still be forwarded according to the aggregated route and then dropped.

To address this issue, the switches MUST explicitly advertises unreachable host routes for a given RNIC to the other RNICs. When a RNIC becomes unreachable via a particular plane, the leaf switch advertises this unreachability to the RNIC using one of two methods:

Upon receiving such an advertisement, the RNIC updates its forwarding table as follows:

Example: Suppose an RNIC has a default route (0.0.0.0/0) with next‑hops pointing to planes A, B, C, and D. Host X (a specific /32) becomes unreachable via plane A. The RNIC learns an unreachable advertisement for X. It then creates a host route for X with next‑hops set to {B, C, D} – i.e., the original aggregated next‑hops minus the next‑hop associated with plane A. Traffic to X will never be sent to plane A, avoiding blackholes.

This technique dramatically reduces BGP table size on the RNIC: the RNIC only needs to store aggregated routes (e.g., a handful of default routes per plane) plus explicit unreachable host routes for the small number of hosts that are actually unreachable. The majority of reachable hosts are covered by aggregates and require no per‑host state. The approach is especially effective when unreachability is rare, which is typical in well‑managed clusters.

Switches within each plane does not need to install the unreachable host route into their FIB tables.

3.2. Prefix‑ORF‑Based Route Filtering

Since a given RNIC communicates only with a limited subset of GPUs (due to AI training parallelism patterns), it’s possible for the RNIC to filter routes to retain only those it actually needs.

The RNIC sends Address Prefix ORF entries to its BGP peer (leaf switch) per plane. These entries indicate the host routes for remote RNICs the local RNIC is interested in. The peer filters outbound route updates accordingly, sending only the requested routes. In this way, the RNIC stores only a limited number of routes.

For switches, there is no need install host routes for remote RNICs. Therefore, the FIB-suppression mechanism as described in Virtual Aggregation Auto-configuration [I-D.ietf-grow-va-auto] could be reused.

4. Acknowledgements

TBD.

5. IANA Considerations

TBD.

6. Security Considerations

TBD.

7. References

7.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.

7.2. Informative References

[I-D.ietf-grow-va-auto]
Francis, P., Xu, X., Ballani, H., Jen, D., Raszuk, R., and L. Zhang, "Auto-Configuration in Virtual Aggregation", Work in Progress, Internet-Draft, draft-ietf-grow-va-auto-05, , <https://datatracker.ietf.org/doc/html/draft-ietf-grow-va-auto-05>.
[I-D.xu-idr-fare]
Xu, X., Hegde, S., Patel, K., He, Z., Wang, J., Huang, H., Zhang, Q., Wu, H., Liu, Y., Xia, Y., Wang, P., Tiezheng, and R. Glebov, "Fully Adaptive Routing Ethernet using BGP", Work in Progress, Internet-Draft, draft-xu-idr-fare-05, , <https://datatracker.ietf.org/doc/html/draft-xu-idr-fare-05>.
[I-D.xu-idr-neighbor-autodiscovery]
Xu, X., Talaulikar, K., Bi, K., Tantsura, J., Triantafillis, N., and X. Chen, "BGP Neighbor Discovery", Work in Progress, Internet-Draft, draft-xu-idr-neighbor-autodiscovery-13, , <https://datatracker.ietf.org/doc/html/draft-xu-idr-neighbor-autodiscovery-13>.
[I-D.xu-rtgwg-fare-in-sun]
Xu, X., He, Z., Wang, N., Wang, H., Guo, J., Li, X., Zhou, T., Yang, Y., Xia, Y., Zhang, W., Wang, P., Zhuang, Y., Yang, F., Li, C., and X. Wang, "Fully Adaptive Routing Ethernet in Scale-Up Networks", Work in Progress, Internet-Draft, draft-xu-rtgwg-fare-in-sun-02, , <https://datatracker.ietf.org/doc/html/draft-xu-rtgwg-fare-in-sun-02>.
[RFC7306]
Shah, H., Marti, F., Noureddine, W., Eiriksson, A., and R. Sharp, "Remote Direct Memory Access (RDMA) Protocol Extensions", RFC 7306, DOI 10.17487/RFC7306, , <https://www.rfc-editor.org/info/rfc7306>.

Authors' Addresses

Xiaohu Xu
China Mobile
Zongying He
Broadcom
Nan Wang
Intel
Nan Wang
Hygon
Wei Wan
Sugon
Hua Wang
Moore Threads
Jian Guo
Biren Technology
Xiang Li
Enflame Technology
Tianyou Zhou
Resnics Technology
Yongtao Yang
Centec
Yinben Xia
Tencent
Weifeng Zhang
Tencent
Peilong Wang
Baidu
Yan Zhuang
Huawei Technologies
Fajie Yang
Cloudnine Information Technologies
Chao Li
Metanet Networking Technology
Wang Xiaojun
Ruijie Networks
Roman Glebov
Yandex