Internet-Draft | Gap Analysis of Fantel | June 2025 |
Geng, et al. | Expires 11 December 2025 | [Page] |
Modern networks require fast, adaptive Traffic Engineering (TE) to support demanding applications like AI training and real-time services. Existing mechanisms for load balancing, protection, and flow control often lack responsiveness and scalability. This document analyzes key gaps in current TE solutions and proposes fast notification as a low-latency, event-driven enhancement. Fast notification enables real-time network awareness and quicker reactions to dynamic conditions, improving overall network efficiency and reliability.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 11 December 2025.¶
Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
In use cases such as AI training, a lossless and adaptive network is required to ensure reliable and congestion-free data transfer. These workloads demand high throughput, low latency, and zero packet loss across dynamically shifting traffic patterns.¶
To meet these demands, networks rely on Traffic Engineering (TE) mechanisms, including load balancing, protection, and flow control. However, existing solutions face limitations in responsiveness, coverage, and operational overhead, especially in high-speed, large-scale environments.¶
This document provides a gap analysis focused on three key TE areas:¶
For each area, we analyze current limitations and explore how fast notification mechanisms can help fill these gaps.¶
Load balancing ensures efficient utilization of available bandwidth and reduces congestion. In modern networks, dynamic load balancing is essential but often lacks real-time responsiveness.¶
In-situ OAM (IOAM) provides visibility into traffic by embedding telemetry data directly in packets. It enables measurement of path latency, loss, and performance metrics.¶
However, IOAM has notable drawbacks:¶
Telemetry Export Delays: IOAM data is extracted and reported by the device CPU to a controller. This adds latency and limits responsiveness.¶
Controller Reaction Time: Centralized controllers typically process telemetry in software, resulting in delayed decision-making.¶
Gap: These factors reduce the effectiveness of real-time load balancing.¶
To address the above:¶
Proactive Signaling: Fast notification can signal network conditions (e.g., congestion) before service degradation occurs.¶
Event-Driven Control: Control loops can dynamically adjust traffic distribution without relying on polling or telemetry aggregation.¶
Lightweight Signaling: Avoids the overhead of traditional telemetry processing.¶
Protection mechanisms ensure service continuity in case of failures. While existing tools like BFD and FRR are widely deployed, they have inherent limitations in speed and scope.¶
BFD is designed for rapid fault detection by sending frequent control packets between peers. While widely used, it presents the following limitations:¶
Overhead vs. Frequency Tradeoff: Higher probe frequency improves detection time but increases CPU and bandwidth usage.¶
Scalability Issues: Maintaining many BFD sessions in large-scale networks strains the control plane.¶
Path Detection Limitations: In scenarios with multiple ECMP paths, BFD struggles to detect the status of specific paths, making it difficult to identify partial failures or asymmetrical degradations.¶
Gap: BFD struggles to balance detection speed with system overhead.¶
FRR reroutes traffic upon link or node failures. However:¶
Local-Only Protection: Typically protects against only adjacent failures.¶
Gap: FRR lacks flexibility and responsiveness in complex topologies.¶
Routing Convergence mechanisms depend on routing protocol convergence, which may take hundreds of milliseconds.¶
Gap: Delay-sensitive services cannot tolerate slow failover.¶
Equal-Cost Multi-Path (ECMP) routing uses multiple paths for load sharing. However:¶
Gap: It lacks fast detection of path degradation or failure, making real-time traffic rebalancing difficult.¶
Flow control ensures congestion-free transmission and optimal throughput. Current mechanisms either react too slowly or lack granular, real-time information.¶
Congestion control is based on end-to-end feedback such as packet loss or RTT increases.¶
End-to-End Delay Sensitivity: Sender-driven control relies on detecting congestion from end-to-end signals, often after at least one RTT. In bursty traffic scenarios such as data centers, this delay may result in buffer bloat or packet loss.¶
Ambiguity in Signal Source: It's also hard to distinguish between congestion and transient fluctuations, leading to overreaction or misjudgment in rate adaptation.¶
Gap: These signals are slow and reactive, especially in high-latency or long-RTT environments.¶
Receiver driven congestion control uses feedback signals from the receiver to adjust transmission rate of the sender.¶
Control Loop Latency: These signals still traverse the network and are subject to RTT delays, especially problematic in high-speed dynamic environments.¶
Bandwidth Overhead: In large-scale or short-flow-intensive environments like data centers, signaling from massive numbers of receivers can impose significant bandwidth and processing overhead.¶
ECN marks packets to indicate congestion, avoiding drops. However:¶
Gap: ECN still relies on end-to-end signaling and lacks precise real-time feedback.¶
INT provides path-level telemetry by inserting metadata at each hop, which is returned to the sender via the ACK. Some congestion control algorithms, such as HPCC, utilize INT for precise load-awareness.¶
However:¶
This document highlights the following gaps in Traffic Engineering mechanisms and how fast notification can enhance each area:¶
Area | Key Gap | Fast Notification Enhancement |
---|---|---|
Load Balancing | Slow telemetry export and software control delays | Event-driven signaling for immediate adjustment |
Protection | BFD/FRR trade off speed for overhead; slow convergence | Lightweight, fast fault alerts across the network topology |
Flow Control | TCP/ECN feedback too slow for real-time adaptation | Real-time congestion feedback from network infrastructure |
Fast notification mechanisms provide a low-latency, low-overhead method for improving responsiveness across load balancing, protection, and flow control. These capabilities are increasingly vital to support demanding applications like distributed AI training and real-time cloud services.¶