Internet-Draft | ML NW sched | October 2025 |
Kompella, et al. | Expires 23 April 2026 | [Page] |
Large Language Models (LLMs) are pushing the boundaries of technology. The scale that they have reached currently vastly exceeds the capacity of any single compute unit (XPU); this requires a distributed approach where multiple XPUs are connected via a "backend" network, typically in a single data center. We are approaching the point where the scale exceeds that of a single data center, thus requiring multiple such data centers connected via a "data center interconnect" network. Training and inferencing are expensive and critical operations, thus they are typically scheduled, i.e., the (compute) resources they need are carefully estimated, allocated and deployed so that these resources are efficiently used. However, while compute investment in these LLM processing clusters dwarfs that of networks, it is becoming increasingly clear that the latter can greatly impact the former. This has been the focus of recent conferences, including the fantel Birds of a Feather meeting in IETF 123, @Scale: Networking and Open Compute Project.¶
This memo proposes that the same care be taken regarding networking resources: that they are estimated, allocated and deployed alongside compute resources; that they have contingency plans in case of network glitches; and that a holistic view be taken in order to optimize the running of training and inferencing jobs.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 23 April 2026.¶
Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Large Language Models (LLMs) are pushing the industry to ever greater scale, both in training and in inference. This leads to more critical use of backend networks and a higher stake in producing timely results. A major learning from recent work is that the network cannot be taken for granted: a dropped or delayed packet can delay, stall or even abort a Machine Learning (ML) job, requiring more effort in checkpointing and managing job restarts, dealing with network congestion, and dealing with network failures. The problems get exacerbated in multi-tenant clusters where multiple jobs are run and job isolation becomes a key requirement. The fantel Birds of a Feather meeting (BoF) illustrated well the role the network plays in ML jobs, the potential for network events to disrupt jobs, and some early thoughts on how to handle these events. While the BoF was very successful in exposing these issues, we believe that adding a proactive approach would be beneficial; this can go hand in hand with the reactive approach of dealing effectively with network events.¶
This memo proposes that the network resources are reserved/scheduled in coordination with ML job scheduler, which is responsible for reserving compute resources (Central Processing Units [CPUs], Graphics Processing Units [GPUs], XPUs, memory, storage, ...). This is especially useful when multiple jobs are run in each cluster; an example is GPUaaS (GPU as a Service), or running several inference jobs simultaneously. Reserving network resources reduces the probability of disruptive network events and improves job isolation. This is the network analogy of reserving compute resources and ideally can be done at the same time. Essentially, when an ML job is scheduled, the “size” of the job (type of model, complexity of model, number of parameters, etc.) determines how many CPU/GPU/XPU cores are needed and how much memory and storage is needed; typically, the same parameters determine the amount of network resources needed during different collective (i.e., inter-XPU) communication stages (Broadcast, AllReduce, Reduce, etc.) Job placement (i.e., which XPUs to allocate for this job?) also determines the source(s) and destination(s) of the communication. If, at the time the job is scheduled, network resources are also reserved (and potentially, backup resources are put in place), the probability that network events can disrupt the job is reduced (although not eliminated).¶
One can do both: couple network resource scheduling with fast event detection, signaling and mitigation for an overall much-reduced impact of network events on job progress. For very long running jobs, network resource reservation can also be done when going from one communication phase to another (such as from Broadcast to AllReduce, or to a quiescent phase).¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
This section provides definitions for terms and abbreviations that are used in this memo.¶
one of several types of processing units: central processing unit (CPU), graphics processing unit (GPU), language processing unit (LPU), tensor processing unit (TPU) and the like. They fall under the category of "compute resources".¶
traffic engineering¶
machine learning, a powerful technique to learn from data without explicit programming, used to solve problems of AI.¶
disaggregated scheduled fabric, a methodology for packet spraying in networks with multipathing.¶
data center interconnect¶
Consider the ML cluster Figure 1:¶
S1 .... S2 / ...\....... / \ Note: L1 & L2 are connected to S2; L1.. L2 L3 L4 L3 & L4 are connected to S1. / \ / \ / \ / \ All links are 400G links. X1 X2 X3 X4 X5 X6 X7 X8
The bottom layer consists of XPUs X1 through X8. The next layer up consists of "leaf" switches L1 through L4. The top layer consists of "spine" switches S1 and S2. All links between layers are 400Gbps; thus there is no oversubscription in the network, provided:¶
However, "fair" load balancing is insufficient unless the load balancing is done on a per-packet (or better, per-cell) basis ("packet spraying") [DSF]. If load balancing is done on a per-flow basis ("flow level multipathing"), it is highly unlikely to be perfectly balanced across the next hops, in which case one next hop may see too much traffic, leading to congestion, packet delays or even packet drops. Disaggregated Scheduled Fabric (DSF) uses per-packet or per-cell load balancing, but it comes at a cost, and may not scale (and scale is a big consideration in these networks).¶
With flow level multipathing, say X1 and X2 are both sending 400G of traffic to L1. L1 tries to load balance X1's traffic to S1 and S2 (in principle, 200G each). In practice, that may turn out to be 220G to S1 and 180G to S2. L1 does the same with X2's traffic; let's say this goes 190G to S1 and 210G to S2. The L1-S1 link will be congested, with 410G of traffic.¶
On the "downward" side (traffic going to the XPUs), there can be an "in-cast" problem: say both X1 and X3 are sending traffic to X6. In the worst case, each sends 400G for a total of 800G to X6, but the L3-X6 link can only transmit 400G. Thus, half the traffic will be dropped.¶
If the entire cluster (here, XPUs X1 through X8) is working on a single ML job, things are a bit simpler (but the issues remain). However, if this cluster is used for inferencing, or multi-tenant workloads, additional considerations arise. Tenant 1 (or inferencing job 1) (T1) may be using XPU X1 and part of X6; tenant 2 (or job 2) (T2) may be using XPU X3 and another part of X6.¶
If T1 and T2 simultaneously require communication to X6, there could be contention for the L3-X6 link. Again, this could lead to congestion, and hence delayed or dropped packets. But now, the issue is inter-tenant.¶
As stated in the Introduction Section 1, such delayed or dropped packets can have big consequences for the jobs that are running. Issues such as these are the motivation for DSF, packet spraying and fast congestion notification.¶
In shared compute environments, such as a compute cluster or a cloud, a scheduler is commonly used to orchestrate access to compute resources. SLURM [SLURM] is a commonly used scheduler in Linux clusters; its documentation says "First, [SLURM] allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work." Another is KAI [KAI] which says "KAI Scheduler is a robust, efficient, and scalable Kubernetes scheduler that optimizes GPU resource allocation for AI and machine learning workloads." There are several other schedulers in common use.¶
A scheduler offers several features. The following are taken from SLURM:¶
Accounting¶
Advanced reservation¶
Gang scheduling (time sharing for parallel jobs)¶
Backfill scheduling¶
Topology optimized resource selection¶
Resource limits by user or bank account¶
Sophisticated multifactor job prioritization algorithms¶
KAI offers the following:¶
Batch Scheduling¶
Bin Packing & Spread Scheduling¶
Workload Priority¶
Hierarchical Queues¶
Resource distribution¶
Fairness Policies¶
Workload Consolidation¶
Elastic Workloads¶
Dynamic Resource Allocation (DRA)¶
GPU Sharing¶
To summarize, a compute scheduler allows effective and optimal sharing of compute resources among multiple tenants and multiple jobs, while ensuring fairness, enforcing limits and enabling accounting. Without a scheduler, multitenancy and multiple jobs would be impractical and chaotic.¶
Note that multi-tenancy is implicit. There may be ways to reserve resources for a particular tenant or group of tenants with allocating them, but the documentation doesn't say how.¶
In shared network environments (which almost all networks are), a scheduler can be used to orchestrate access to network resources -- primarily bandwidth, but also highly prized links(*), QoS, etc.¶
The primary task of network resource scheduling is to reserve resource along a pathway (tunnel) from one or more XPUs (ingresses) to another set of XPUs (egresses). Note that the paradigm here is of uni-directional reservations; this is more general than bidirectional reservations, as the traffic requirements may not be symmetric.¶
Given that X1 wants to send 20Gbps to {X2, X3, X4}, one would create a tunnel from X1 to {X2, X3, X4} with 20Gbps capacity. Note that this traffic might be unicast (distributing different parts of a matrix to the recipients) or broadcast (distributing the same information to all). If further, one wanted to use certain links exclusively, one can color links in the network and state that this tunnel must/must not use links of a certain color. Thus, link coloring is a tool that network administrators can use to hold back links for a subset of job types. The compute analogy would be to hold back some XPUs, mark them "blue" and allow only a subset of jobs to use those XPUs.¶
Link coloring allows a provider to partition their network to optimally serve their customers. While links in a Clos network (as most ML clusters are) are perfectly symmetrical, once one gets into "distributed clusters" that are connected via DCI links, link coloring and other link attributes will find greater use.¶
Reserving bandwidth means that a particular job J1 (probably) won't step on another job J2's traffic. Say J1 is using a tunnel T1 with a reservation of 20G, and J2 is using a tunnel T2 with a reservation of 50G. The reservation procedure ensures any links T1 and T2 traverse in common have sufficient bandwidth for both T1 and T2 (and any other tunnels with reservations). Of course, J1 may use more than its allocated bandwidth; this can negatively impact J2. To reduce/prevent this, one can apply a policer at the ingress of J1's tunnels to ensure that J1 sends no more than its allocated share over each tunnel. This policer can drop traffic over the limit, or simply mark it as such, so that if the other jobs on a common link are not using their full quota, J1's traffic can go through.¶
This last point is crucial for multi-tenancy. A provider who cannot provide hard (or at least soft) guarantees to their customers that they will in fact get the resources they asked (and paid) for will soon be out of business.¶
Elastic bandwidth is a very useful feature that goes along with elastic compute. If a job's requirements are: start me off with 5 XPUs, but expand that to 8 as the need arises, and shrink it back down to 5 when no longer needed, then the job's bandwidth requirements are likely to grow and shrink in tandem. Thus, in addition to making binding reservations, one must be able to adjust those reservations as needs change.¶
Finally, not all jobs (and all customers) are created equal. Priority and preemption are powerful tools in schedulers to give preference to certain jobs over others. Without these tools, a provider would be helpless if their cluster were overrun with low priority jobs. In addition, it would be nice to have a graceful way of managing preemption.¶
All the features mentioned in the last section are available today, in bandwidth-aware traffic engineering (TE).¶
TE constraints allow a user to specify constraints on the path a tunnel will take. These can include acceptable/unacceptable colors and other link properties.¶
Bandwidth reservation allows the allocation of bandwidth resources to a tunnel. Policers are a useful adjunct to enforce limits.¶
Elastic bandwidth (aka "auto-bandwidth") allows a tunnel to dynamically adjust its reservations (within limits).¶
Priority and preemption are implemented by all vendors. Graceful preemption is possible using "soft preemption".¶
There is one missing piece with "regular" TE: ML clusters (and Clos networks in general) make heavy use of multipathing, and often have multiple ingresses and egresses for their communications. Current traffic engineering techniques focus on a single path tunnel from one ingress to one egress. However, a new technique for multipath TE that allows for multiple ingresses and egresses is being developed that could have relevance here [I-D.kompella-teas-mpte].¶
In this section, we look at compute scheduling features, and ask whether the corresponding feature exists in network scheduling.¶
SLURM - Compute Scheduling Features | Network Scheduling (Feature Availability) |
---|---|
Accounting | Yes |
Advanced reservation | Yes (bandwidth calendaring) |
Gang scheduling | Yes (primary effort is on compute) |
Backfill scheduling | N/A |
Topology optimized resource selection | Yes |
Resource limits by user or bank account | Yes (via controller policy) (enforcement via policers) |
Sophisticated multifactor job prioritization algorithms | No (maybe N/A) |
KAI features | Network Scheduling (Feature Availability) |
---|---|
Batch Scheduling | Yes (via multi-ingress/multi-egress tunnels) |
Bin Packing & Spread Scheduling | Yes ("least-fill", "max-fill") |
Workload Priority | Yes |
Hierarchical Queues | Yes (via QoS in the data plane) |
Resource distribution | Yes (via tunnel priority) |
Fairness Policies | Yes |
Workload Consolidation | N/A |
Elastic Workloads | Yes ("auto-bandwidth") |
Dynamic Resource Allocation (DRA) | N/A (multivendor is a given) |
GPU Sharing | Yes (link sharing) |
As can be seen, almost all features are supported; some other features are supported in network scheduling that may not have analogies in compute scheduling.¶
With flow level multipathing, say X1 and X2 both send 400G of traffic to L1. L1 tries to load balance X1's traffic to S1 and S2 (in principle, 200G each). In practice, that may turn out to be 220G to S1 and 180G to S2. However, L1 knows that it's only supposed to send 200G to S1 from X1. S1 adjusts its load balancing weights ("adaptive load balancing") until the traffic sent to each of S1 and S2 is 200G. L1 does the same with X2's traffic; if all works well, L1 will send a total of 400G to each of S1 and S2.¶
On the "downward" side (traffic going to the XPUs), there can be an "in-cast" problem: say both X1 and X3 are sending traffic to X6. Now, X1 has a TE tunnel to X6 with only 200G; similarly for X3. So, in principle, the L3-X6 link should only carry 400G.¶
Reservations can be temporarily exceeded; that is equally true with compute reservations. Depending on the enforcement policies, an oversubscription situation should be temporary and is clearly visible (since accounting is easy), allowing more severe enforcement should it be persistent.¶
As mentioned in the Introduction, to make optimal use of ML clusters, especially when multiple smaller jobs (e.g., inferencing) are run, and multi-tenancy is in play, network scheduling takes on increasing importance as a proactive measure to prevent network events such as congestion. (This works orthogonally to packet spraying.) One can add fast network event notification as a reactive measure. Together, these techniques present a more holistic approach and should allow much better utilization of ML resources.¶
None, for now.¶