Problems with existing DetNet bounded latency queuing mechanisms

The architectural evolution of IP/MPLS networks () in service provider and other “larger-than-building” (), shared-infrastructure service networks () has led to a range of requirements against per-hop forwarding mechanisms which are currently not supported by the current DetNet MPLS forwarding plane and per-hop, per-flow queueing model, Section 3.2, especially with respect to the QoS support of per-hop bounded latency. The authors of this memo think that solutions for these requirements are relatively easily added to the existing DetNet architecture by adding support for already existing and/or proposed, but not standardized per-hop forwarding and queuing options. The following sub-sections summarize the problem, solution goals and requirements as perceived by the authors. The reasoning for these is explained in the following sections. Note that requirements are somewhat overlapping in so far as solving one of them also solves others, but each addresses the problems from a different perspective, and are therefore easier understood for different stakeholders. For example: Operators that do want to see support of DetNet for example for Segment Routing (SR) would not think that this is “naturally” the same as DetNet supporting the DiffServ architecture, even though solutions would have a hard time to support only one of the two.

Forwarders with bounded latency need to support interface speeds of 100 Gbps up to Tbps, likely over a period of 10 years from initial deployment of possible DetNet solutions. Hundreds of interfaces may need to be supported in a single forwarder (fan-in/fan-out). Supporting bounded latency at these speeds and fan-in/fan-out raises cost and feasibility challenges beyond those that had led to past IETF IntServ (GS) standards (, ) or more recent TSN bounded latency solutions. Note that these high speed and scale requirements even cause challenges when DetNet bounded latency traffic is intended to be used for only a small percentage of the interfaces traffic.

Both high-speed hardware and network architecture design (for reasons of simplicity and minimization of shared risk functions) do favor architectures that support a lightweight transit hop forwarding plane design that requires no forwarding plane or control plane operations whose scale support depends on the number of services/service-instances (e.g.: DetNet flows) offered, but at best only on the size of the network (e.g.: no per-flow, per-hop state).

There should be DetNet bounded latency options that work in conjunction with per-transit-hop stateless traffic forwarding such as through Shortest Path First (SPF) routing with IP/MPLS), engineered steering (e.g.: SR) and stateless replication, such as Bit Indexed Explicit Replication with/without Tree Engineering (BIER, BIER-TE).

There should be DetNet bounded latency options that for the purpose of traffic engineering (including assurance of bounded latency across the network) only require per-flow Path Computation Engine (PCE) signaling to network ingress/egress router, but not to transit hop routers.

There should be DetNet bounded latency options that support the DiffServ QoS model instead of only the IntServ model.

There should be DetNet bounded latency options that together with the other requirements also provide a better than worst-case jitter for DetNet traffic.

The DetNet architecture should support signaling and forwarding that would make support for automatically application instantiated DetNet flows scalable and lightweight to operate.

To help readers understand especially the per-hop stateless requirement from above, the following sections summarizes the historical evolution of technologies and operational principles that the authors think are relevant to understand the requirements outlined above and asks to see supported in DetNet.

The original (first and only) IETF standardized packet forwarding layer standardized queuing option for bounded latency in the IETF is “Guaranteed Service”, (GS), see the DetNet bounded latency document, section 6.5. At the time the RFC was published (1997), the standardized signaling was proposed to be RSVP , and the use of RSVP with GS was standardized in . The function to support GS bounded latency in the forwarding plane is the per-flow reshaping on every forwarder hop along the path where GS packets of one flow may get delayed in the egress interface queue due to packets from other GS flows. In typical networks, this is every hop along the path. Early (1990/2000) forwarders for which RSVP was implemented where using so-called “software” forwarding. This meant that the forwarding plane was implemented through a general purpose CPU without additional hardware support for QoS functions such as shaping or queuing. While these forwarders did support traffic flow shaping, GS was never implemented on them and their RSVP implementations did also not support (but ignored) the RSVP TSPEC/RSPEC signaling parameters used for bounded latency. Instead, RSVP implementations only supported the parameters for bandwidth reservation, which was henceforth called Call Admission Control (CAC). In one instance, a software forwarder implementation with RSVP supported the Controlled Load (CL) service , which does not provide for bounded but instead for controlled latency. This service is achieved by creating a per-flow queue and applying weighted fair queuing (WFQ) with weights according to the reserved bandwidth of the flows (see , section 11). This functionality did not proliferate into later generations of routers because the execution cost of WFQ was too high for a multitude of flows and the scheduling accuracy was too inaccurate in interrupt driven CPU software forwarding with higher speed interfaces (100Mbps…1Gbps).

With the rise of forwarding planes with “acceleration” through ASIC based Forwarding Plane Elements (FPE) instead of general purpose CPUs and/or dedicated QoS hardware, the ability of forwarders to support shaping evolved to only be supported, if at all, on DiffServ (DS) boundary nodes, but not on DS interior nodes. This included both shaping as well as complex queuing such as WFQ. The DS architecture, , was specifically targeted to enable the evolving, now common Service Provider network services architecture, in which “high-touch” service functions are only performed on so-called Provider Edge (PE) routers, which as required are DS boundary nodes, whereas the hop-by-hop forwarding through so-called Provider (P) (core) routers is meant to utilize only a reduced set of forwarding functions, specifically excluding per-hop, per-flow QoS forwarding plane functions such as shaping or policing. DiffServ therefore allowed to build higher speed, lower cost forwarding plane P routers. It also enabled to build equally higher speed, lower costs PE routers by supporting boundary node functions only on (lower speed) customer facing interfaces/line cards, but not on core facing interfaces.

With the advent of MPLS , RSVP was extended to support MPLS through the RSVP-TE extensions. RSVP-TE manages p2p (later on also p2mp) MPLS Label Switched Paths (LSP), which when signaled through RSVP-TE are also called RSVP-TE tunnels. These can be seen as the equivalent of IP flows that RSVP manages for IP. RSVP-TE tunnels can support a variety of traffic engineering functions, but none of the implementations known to the authors ever implemented GS or CL services, specifically because hardware forwarding for service provider networks was not designed to support these QoS functions for P Label Switched Routers (LSR). Because CL/GS where not targeted with RSVP-TE, the signaling extensions for Interior Gateway Protocols (IGP) required in the classical RSVP-TE reservation model (such as for IS-IS) have no parameters to signal per-hop GS queuing latency or buffer capacity utilization. In result, the existing IGP signaling for RSVP-TE only supports RSVP-TE to perform bandwidth but not non-queuing path latency resource calculations and therefore no latency based traffic engineering.

Even though RSVP-TE implementations support only DiffServ (but not GS/CL) with respect to per-hop QoS functions, its traffic-steering (path selection) and signaling model introduced per-flow (per-tunnel) control plane and forwarding plane overhead onto every P-hop. Through the 200x’s, this RSVP-TE overhead was seen as undesirable complexity and overhead by many service providers using it. There was also a much larger number of service providers that desired some of the benefits provided by RSVP-TE, but who were not willing to commit to the complexity, costs and operational risk introduced into the network by complex per-flow signaling of RSVP-TE. The on-path, per-hop signaling of RSVP-TE for example introduced so much overhead, that reconvergence of RSVP-TE paths after a failure or recovery took as much as 20 minutes in networks with 10,000 or more RSVP-TE tunnels. The design of RSVP-TE’s (decentralized) on path signaling model specifically showed problematic under high resource utilization. In the original, decentralized RSVP-TE deployment model, ingress PE LSR would perform so-called Constrained Shored Path Forwarding (CSPF) calculations to determine the shortest path with enough free resources for a new flow. Afterwards the ingress PE would signal the path via RSVP-TE. The IGP would signal to all ingress PE how many (bandwidth) resources where left on every link. Under high load, when multiple ingress PE where performing this process in parallel this would cause high load, churn and reservation collisions. These problems of de-centralized RSVP-TE plus IGP signaling lead to the introduction of a so-called Path Computation Element (PCE) based architecture, in which the (competing and uncoordinated) traffic engineering computations on every de-centralized RSVP-TE ingress LSR where replaced by a centralized PCE function (or at least a coordinated PE function), which would send the calculated results back as a path object to the headend LSR, in result limiting the functions of RSVP-TE to the signaling of a steered traffic path through the network to establish the hop-by-hop LSP. The use of a PCE can likewise eliminate all the reservation state dependent signaling from the RSVP-TE IGP extensions, because all the reservation calculations solely need to happen only on the PCE. Nevertheless, the PCE does not eliminate the per-hop signaling overhead of RSVP-TE to establish LSPs and hence it did not eliminate for example the majority of the platform and convergence cost of RSVP-TE in the network, especially for the control plane of P nodes adn could hence not resolve the concerns of service providers who had chosen not to adopt RSVP-TE.

The introduction of centralized PCE had obsoleted most of the reasons for RSVP: headends did not need to do path calculation, and P router did not need to manage the available and allocated bandwdth for TE tunnels. In most service-provider use-cases this left RSVP-TE only serving as a very complex solution to do traffic steering, and the PCE was doing the rest. This ultimately lead to the design of the Segment Routing architecture, and its mapping to the MPLS forwarding plane, SR-MPLS . Later, a mapping to IPv6 was defined with SRv6 . SR relies on strict or loose hop-by-hop hop source routing information, contained in each packet header, therefore eliminating the need to set up per-path flow state via RSVP-TE, and allowed in conjunction with DiffServ for hop-by-hop QoS a complete per-hop, per-flow stateless forwarding solution, arguably therefore lightweight, easy to implement at high performance and scalable to large number of flows.

In the same way as SR eliminated the need for hop-by-hop traffic steering forwarding state from RSVP-TE in P-routers for unicast traffic, Bit Indexed Explicit Replication (BIER) solves this problem for shortest path multicast replication state across P-routers, by replacing it with a BIER packet header and therefore eliminating any per-application/flow, per-hop forwarding state for multicast in P-routers. BIER also removed the associated overhead of prior ingress replication solutions Service Providers where looking into to avoid the per-hop state. Finally, BIER-TE adds traffic steering with replication to the BIER architecture and calls this Tree Engineering. Likewise, this is without the need for per-hop/per-flow steering or replication state.

Service Provider networks have evolved especially in the past 25 years into an architecture, where high-speed, low-cost and high-reliability are based on designs that eliminate or reduce as much as possible any form of unnecessary control-plane and even more so per-flow, per-application plane complexity from P-routers/transit-nodes. This has led to the development of the DiffServ QoS architecture that eliminated IntServ/per-flow QoS from P-routers, and later on to the evolution from MPLS/RSVP-TE to SR and BIER that eliminated per-flow/tunnel forwarding/steering and replication state from the same P-nodes. Finally, early experience with Traffic Engineering churn under high load and todays requirements for often NP-complete optimization lead to an architectural preference for off-path/centralized model for TE calculations via PCE to also free P-routers from signaling complexity and perform dynamic/service-dependent signaling only to PE-routers.

The following subsections look at further into the background for why per-hop, per-flow state can be problematic and discuss problems beyond this core issue.

RSVP-TE was (and is) solely used for services where the operator of a domain explicitly provisions RSVP-TE tunnels across its domain (for example using a PCE) and can therefore fairly easily know the worst-case scaling impact. For example the number of tunnels does is not a chance value arising through dynamic subscriber action, and the number of tunnels in the network is primarily impacted by topological changes and the (over time relatively rare) of occurrences of additional services and/or service instances being provisioned. For RSVP-TE there was never (to the knowledge of the authors) an end-to-end application layer interface such as there was for RSVP over IP, for example as supported by earlier versions of Microsoft Windows QoS enabled IP sockets. When per-flow operations including per-hop signaling or even worse per-hop forwarding plane or QoS state is not a result of well-controlled provisioning or well plannable/predictable failure behavior but instead driven by applications not under the control of network operators, the per-hop state requirements can become much more an operational and cost problem, because of its unpredictability.

The widest experience with dynamic, application based signaling in Service Provider networks likely exist for IP multicast, where creation of per-hop forwarding/replication state is triggered by applications not under the control of network operations but by customer managed applications/application-instances. Managing the amount of state and the control plane load on P-routers was and is one of the mayor concerns when operationalizing IP Multicast services in SPs. Service Provider L2-VPN and L3-VPN services can offer IP Multicast via architectures such as that attempt to solve/reduce the problem of customer application driven, per-multicast application in a variety of ways, but they all come with their own problems: In ingress-replication, the ingress-PE sends a separate unicast copy to every egress-PE. This creates significant excess traffic on links close to the ingress-PE and potentially higher-cost ingress-PE attachment speeds. In L3VPN aggregates-trees, the traffic for multiple trees is sent across a common tree reaching the superset of all egress-PE of all included trees. This reduces the number of trees from one per-customer application to a lower number of aggregates this, but it creates potentially significant excess traffic towards egress-PE that do not need all the aggregated traffic and may even result in a requirement for access core access link speeds for those egress routers. Finally, the per P-router stateless BIER solution solved these issues. It does not require any per P-router, per tree state creation, and achieves a 256x better traffic efficiency than ingress replication (with 256 long BIER bit strings).

With DetNet services being targeted primarily for so-called private networks such as (but not limited to) those for industrial, theme parks, power supply systems, road, river, airport and train transportation networks, it is important to understand how concerns for SP networks will apply to such private networks: While the aforementioned evolution of MPLS networks focused on large-scale service provider networks, the very same architectural evolution is or will also happen in any private MPLS networks in the same way as the DiffServ architecture equally became the only widely adopted QoS architecture in any larger scale (campus or beyond) private networks. While some of the scaling, cost, performance and reliability issues mentioned above for service providers may not equally apply to smaller scale private networks, past experience has shown that that it is unlikely for a critical mass for different solutions to develop across a large variety of vertical private type of networks. For this reason, in the past any larger scale enterprise networks have preferred to adopt solutions that had proven themselves through SP deployments and that where based on cross-vendor IETF based architecture principles and widely, interoperable vendor implementations. Another reason for private network operators looking for service provider calls designs is that it also is simplifies potential service provider based management of the network and/or outsourcing of the network to a service provider. This was seen often when large enterprises that had to support multi-tenants evolved from ad-hoc network virtualization solutions (such as VRF-lite) over to BGP/MPLS-VPN designs and later outsourced those very networks. In that same line of future proofing, networking technologies first developed for enterprises would also be picked up and reused in Service Provider networks as long as they would fit. IP Multicast for example had (since about 1996) ca. 10 years of deployment for business critical enterprise use cases (such as financial market data distribution), before it was adopted widely for IPTV in service providers.

Whereas the previous section points to the practice and benefits to share technologies between private and SP network, this section highlights one core additional requirement of SP networks not found in most private networks from which pre-DetNet deterministic service requirements will likely originate. In architectural terms, the desire and need to minimize or avoid per-application/flow forwarding/control-plane state and per-hop control plane interactions (be it through on-path signaling or direct PCE to P-router signaling) is not primarily a matter of SP/private networks or not even of size, but foremost a matter of whether or not the network itself is seen as the (a) communications fabric of a large distributed application or (b) as an independently running shared infrastructure across a potentially wide variety of application/services with diverging requirements. (a) is the dominant view of the network specifically from many (single) mission specific networks such as many industrial networks and even non-public High Performance Compute (HPC) center architectures. In either of these case, it is a single architectural entity that can control both network infrastructure and application to build a mission optimized compound. For example, switches in HPC Data Centers had traditionally very shallow interface packet buffering for cost reasons, resulting in inferior performance under peak load with predominant older TCP congestion control stacks. Instead of using better, more expensive switches, it was easier to improve application device TCP stacks, leading for example to BBR TCP. While this is very much in line with the desired Internet architecure that is putting a significant responsibility onto transport layer protocols in hosts (not limited to TCP) to behave “fair” or “ideal”, the reality even in many private missions centric networks such as manufacturing plant is different. Dealing with misbehaving user devics or applications is one of the main challenge. In the example, that is the case when a DC is offering public cloud services, where TCP stacks can not be controlled, and hence deeper buffers and/or better AQM are a core requirement. In general: In networks following the (b) shared infrastructure design principle, any resource that needs to be shared across different services or even service instances becomes a potential three party reliability and costing issue between the provider running the network and the two (or more) parties whose services utilize the common resource. Minimizing the total amount of complex, failure-prone and hard to quantify in a cost-effective manner shared resources is thus at the base of any shared infrastructure network design. This again points to the model, where all network control can happen on the edge, and due to the absence of per-hop, per-flow state there simply is no shared flow state table that needs to be managed across multiple different services/subscribers.

Some bounded latency solution require accurate clock synchronization across network nodes performing the bounded latency algorithm. The most commonly used (family of) protocol(s) for this is the Precision Time Protocol (PTP), standardized in IEEE1588 and various market specific profiles thereof. PTP can achieve long-term Maximum Time Interval Errors (MTIE) of as little as 10th of nsec. MTIE is the maximum time difference between the clocks of two PTP nodes measured over long period of time. Implementing PTP in devices comes at a range of design requirements. At high degree of accuracy, PTP requires accordingly accurate local oscillators that includes hardware such as regulated heating to operate under constant temperature. It includes accurate distribution of clock across all components of the system, which can be especially challenging in modular, large-scale devices, and accurate insertion and retrieval of timestamp field into packet headers. While PTP is becoming more and more widely available, consistent support of high accuracy across all target type of switches and routers in wide area networks cannot be taken for granted to be a feasible new requirement raised for DetNet when it did not exist in before. Today, PTP is often found in mobile network fronthauls, but not their backhauls or any other broadband aggregation, distribution or core networks. This is because there is, as of today, no strong business case requirement for PTP at high precision in those networks, whereas technologies such as eCPRI raise such requirements against mobile fronthauls. Instead, those other networks most often resort to at best msec accuracy NTP protocol deployments which is typically sufficient for control-plane and operational event tracing as its main, accuracy defining use-case. The larger the network and more multi-vendor varied the deployed equipment is, the higher will also be the operational cost of maintaining and controlling the accuracy of a PTP service. This primarily has been cited in the past as a reason to not deploy PTP even if hardware was supporting it. This operational challenge will especially apply when PTP support may be required for only a small percentage of traffic in a high speed wide area network. The revenue from the service needs to cover the operational cost incurred by its exclusive components (hardware, software and operations).

This section discusses how low-jitter bounded latency applications can be highly beneficial for DetNet applications. Depending on the bounded latency algorithm, the jitter experienced by packets varies based on the amount of competing traffic. In algorithms and their resulting end-to-end service which this memo calls “in-time”, such as GS and , the experienced latency in the absence of any competing traffic is zero, and in the presence of the maximum amount of permissible competing traffic, latency is the maximum, guaranteed bounded latency. In result, the jitter provided by these algorithms is the highest possible. In algorithms and their resulting end-to-end service which this memo calls “on-time”, the experienced latency is completely or most significantly independent of the amount of competing traffic and the jitter therefore null or minimal. In these algorithms, the network buffers packets when they are earlier than guaranteed, whereas in-time algorithms deliver packets (almost) as fast as possible. This memo argues that on-time queuing algorithms provide an additional value-add over in-time algorithms, especially for use in metropolitan or wide-area networks. Whatever algorithm is used, the receiving application only has a guarantee for the maximum bounded latency, and the real (shorter) latency of any received packet is no indication for the latency of the next packet. Instead, the receiver application has to be prepared for each and any future packet to arrive with the worst possible, e.g.: the bounded latency. The majority of applications require some higher layer function synchronously to the sender application: Rendering of audio/video and other media information needs to happen at the same frequency or event intervals at which the media was encoded. When these applications receive packets earlier than the time at which they can be processed (which is equal or close to the bounded latency), these applications buffer media in a so-called playout buffer and release them only at that target time. Likewise, remote control loops including industrial Programmable Logic Controller (PLC) loops or remote controlling of robots or cars is typically based on synchronous operations. In these applications, early packets are also delayed to then be processed “synchronously” later. In all cases, where applications need to buffer (or otherwise remember) received data when it is too early, in-time queueing latency raises the challenge to application developers to be able to predict the networks worst possible jitter, and this can be particularly challenging for embedded, if not constrained receiver devices with minimum memory to buffer/remember. When these devices are designed against one particular type of network with well-known low jitter, then they will not necessarily operate correctly in networks with larger jitter. And in metropolitan and WAN networks, jitter with in-time services can be highly variable based on its design and the relative location of the communicating nodes in the topology (see for an example network design). One example of such issues was encountered when digital TV receivers (Set Top Boxes, STB) designed for (mostly synchronous) digital cable transmission where evolved to become IPTV STB, but the playout buffer of < 50 msec was not sufficient to compensate for a > 50 msec jitter experienced in IP metropolitan networks. Note that this section does not claim that all applications will benefit from on-time service, nor that no application would benefit more from in-time service than from on-time service. Nevertheless, the authors are not aware of instances of application for whom in-time service would be more beneficial than on-time service. Of course, this comparison is only about the benefit to the application and other factors such as the cost/scale of the service for the network itself have also to be taken into account.

The problems of cost and operational feasibility in shared-infrastructure networks specifically applies to scaling of hardware resources such as per-application-flow forwarding or QoS state in high-speed network routers: Even if the business case makes it clear that only e.g. 1 Gbps worth of traffic may require this advanced state (such as multicast replication or per-flow shaping for bounded latency), it will be more expensive to build this functionality into a 100 Gbps transit switch/router than into a 1 Gbps switch/router. This too is based on experience from migrating services of low-speed mission specific networks, such as IP multicast onto high speed, shared-infrastructure service provider networks. The reason for this higher cost at higher speed is that the 1 Gbps worth of “advanced” traffic still has to be built into 100 times faster hardware and each of the “advanced” packets forwarded would needs to be replicated/shaped 100 times faster. This packet processing issue may look like it applies equally to both per-hop, per-flow stateful based forwarding as well as solely in-packet based mechanisms, in practice, per-flow state may requires a lot more high-speed memory access because of the need to access an entry from a state table. In most cases, this table space can only be made to work at line rate packet processing when it is on-chip, hence it is not only most expensive, it is also crucial to scale right. And as the 1 vs. 100 Gbps example above showed, it is very hard to come by an appropriate scale smaller than “would work for 100% of traffic” - because network operator providing shared infrastructure networks really do not want to be responsible for predicting how individual services may grow in adoption by making a specific hardware selection that constrains any such grows. Last, but not least, on-chip high-speed state tables become even more expensive when they do not only have to be read only, but also when they have to be written at line rate and even worse, when they have to operate for line-rate speed read/write/read control loops: The main issue with scaling state in hardware routers is that designs will be hesitant to work against unclear growth predictions. Even if at some point in time only 1 Gbps of DetNet traffic was expected to be required on a 100 Gbps platform, hardware designers will be much more likely want to scale against the worst (best) case service growth expectation so that customers will not feel that they would buy into a product that becomes obsolete under success. Whereas steering state, such as MPLS label entries can easily scale to hundreds of thousands, the same is not clear about shapers or interleaved regulators. They are more challenging because they require fast (on-chip) read-write memory for the state variables, especially when forwarding is parallelized across multiple execution unit. This does incur additional complexity to split up the state and its packets across multiple execution units and/or to provide consistent cross-execution units shared read/writeable memory. Even only writeable (but not cross-execution units then also readable) memory has traditionally been a sparse resource the faster the forwarding engines are. This can be seen from (often very limited) scale of packet monitoring state such as for IPfix. But the main issue of per-hop, per-flow forwarding state that could be quite dynamic because it might be triggered by applications is the control plane to forwarding-plane-state interactions. Updating hardware forwarding engine state tables is often one of the key performance limits of routers. Adding significant additional state with likely ongoing changes is easily seen as a big contributor to churn in the control plane and likely reason for stability and reduced peak performance under key events such as reconvergence of all or large parts of IGP or BGP routing tables.

The following picture shows an example, worst-case network topology of interest (in the opinion of the authors) for bounded latency considerations. This section does not claim that greenfield rollouts may or want to use all aspects of this topology. What his memo does claim is that many existing brownfield networks, especially large metropolitan areas show all or many of these aspects, and that it would be prudent for bounded latency network technologies to support networks like these so as to not create new constraints against network designers by only supporting physical network topologies optimized for a particular type of service (bounded latency).

An example metropolitan scale network as shown in may consist of one or more rings of forwarders. A ring provides the minimum cost n+1 redundancy between the ring nodes, especially when, as is common in metropolitan networks, new fibre cannot cost-effectively be put into new optimum trenches, but existing fibre and/or trenches have to be used. This is specifically true when the area includes not dense populated suburban areas (higher cost per subscriber and mile for rollouts). Multiple, so-called subtended rings typically occur when existing networks are expanded into new areas: A new ring is simply connected at two most economic points into the existing infrastructure. Likewise, such a topology may become more complicated over time by addition of capacity, which resulting from TE planning calculations may not follow any of the pre-existing ring paths. Edge Data-Center (DC), connections to Exchanges/Peerings or national cores of the provider itself, as well as all subscribers including Mobile Network Towers, and IoT devices connect to these ring directly via PE edge-forwarders and (more often) via additional CE type devices. P nodes may also double as PE nodes. In densely populated regions, P, or PE nodes may have a high number of attached devices, shown in the picture with the example of 100 PE forwarder connecting to a single P forwarder (or rather two P for redundancy and therefore support of PREOF). In summary, the following aspects of these networks are relevant for bounded latency: Link speeds today are at least 100 Gbps and will be Tbps in the near future. Even if only a small percentage of that traffic has to support bounded latency, the queuing mechanism need to support these high-speed interfaces. Fan-in/out at PE or P nodes may be (worst case) in the order of hundred(s) of incoming interfaces. Bounded latency mechanisms whose number of queues depend on the number (#I) of interfaces in a more than linear fashion, such as (#I^2) in the case of , may introduce significant challenges for cost-effective hardware. Through the advent of decentralized edge Data Center and peerings between different operators and content providers, traffic flows of interest will not solely be between one central site from/to subscribers hub&spoke. Instead arbitrary, traffic engineered paths across the topology between any two edges need to be supportable in scale with the bounded latency queuing mechanism. The total number of edge (#E) nodes (PE or CE) for a bounded latency service can easily be in the thousands. Aggregation of bounded latency flows on the order of (#E^2), which is the best option in per-hop, per-flow solutions such as , is likely insufficient to significantly reduce the number of flows that need to be managed across P nodes in such bounded latency queuing mechanisms. The total number of P nodes may be in the hundreds and bounded latency flows in the tenths of thousands. It should also be expected that such flows are not necessarily long-term static but may need to be provisionable in the time-scale order of for example telephone calls (such as flows supporting remote control of devices or operations). Bounded latency solutions that require per-flow, per-node state maintenance on the P nodes themselves may therefore be undesirable from a network operational/complexity/reliability perspective, but also from a hardware engineering cost perspective, especially with respect to the control plane cost of dynamically setting up per-flow bounded latency for flow whenever there is a new flow or all of them whenever there are topology or load changes that make rerouting desirable. Beyond queuing concerns, path selection too specifically for deterministic services is a challenge in these networks: Path lengths may be significantly longer than e.g. 3 hops. In large metropolitan networks, they can reach 20 or more hops. Speed of light end-to-end in these networks will be in the order of low number of msec. End-to-end queuing latency can be in the same range, if not higher. To avoid undesirable re-routing under failure when PREOF and engineered disjoint paths are used, traffic steering needs to support efficiently supportable hop-by-hop traffic steering. In networks designed for source-routing (e..: SR routing), efficiently encoded strict-hop-by-hop steering for as much as those (e.g.: 20) hops may be desirable to support.

gives an overview of the math for the most well-known existing deterministic bounded latency algorithms/solutions. This section reviews the relevant currently standardized algorithms from the perspective of the above listed problems for high-speed, high-scale, shared services infrastructures and to provide additional background about them.

GS is described in section 6.5 of . describes its historical evolution and challenges. We skip further detailing of its issues here to concentrate on IEEE Time Synchronuous Networking - Asynchronous Traffic Shaping , which in general is seen as superior to GS for high speed hardware implementation. All the concerns described in the TSN-ATS section apply equally or even more to GS.

Section 6.4 of describes the bounded latency used for TSN Asynchronous Traffic Shaping . Like GS, this bounded latency solution also relies on per-flow shaper state, except that it uses optimized shapers called “Interleaved Regulator” as explained in section 4.2.1 of . The concept and simplification in interleaved regulators over traditional shapers and the concept of interleaved regulators is a resulting from mathematical work done in the last 10 years starting with . In a system with e.g. N=10,000 flows each with a shaper, the forwarder needs to have 10,000 shapers each of which would need to calculate the earliest feasible send-time of the first queued packet of the flow and all these send-times would need to be compared by a scheduler picking the absolute first packet to send. Of course it is unlikely that the router would have to queue at least one packet for all queues at any point in time, but the complexity to implement the scheduler scales with N. With interleaved regulators, there is still the per-flow state required to hold each flows traffic parameters and its next-packet earliest departure time, but instead of requiring a scheduler to compare N entries, packets are queued into one out of (#IIF,#PRIO) FIFO queues, one queue for all the packets arriving from the same Incoming InterFace (IIF) and targeted the same worst-case queuing latency/PRIOrity (PRIO) on this hop. The shaper now only needs to calculate the earliest departure time of the head of each of these M= #IIF * #PRIO queues and the complexity of a scheduler to select the first packet across those interleave regulators is therefore reduced by a factor of O(N/M). Unfortunately, while industrial ethernet switches today often have no more than 24 IIF, aggregation routers in metropolitan networks may have thousands of IIF, so the benefit of interleaved regulators over per-flow shaper will likely be much higher in classical TSN environments than it would be for example likely DetNet target routers in metropolitan networks. In addition, the aforementioned core problems for shapers (), namely control plane, read/write/read cycle access and scale equally apply to interleaved regulators, so the main optimization benefits of interleaved regulators is for the original targets of / : low-speed (1..10Gbps switches) with limited number of interfaces - but to a much lower degree for likely important type of DetNet deployments.

TSN Cyclic Queuing and Forwarding as described in , section 6.6, is a per-flow, per-transit-hop stateless forwarding mechanism, which solves the concerns with per-hop, per-flow state issues described earlier in this memo. It also provides an on-time service in which the per-hop and end-to-end jitter is very small, namely in the order of a cycle time. operates by forwarders sending packets in periodic cycles. These cycles are derived from clock synchronization: The start of each cycle (and by implication the end of the prior cycle) are simply periodically increasing clock timestamps that have to be synchronized across adjacent forwarders, usually via PTP. This method to operate cycles allows to operate without additional data packet headers, but it is also the reason for the two issues of , and both relate to the so-called dead time (DT). For the receiving node to correctly associate a packet to the same cycle as the sending node, the last bit of the last packet in the cycle on the sending node needs to be received by the receiving node before the cycle ends. explains that DT is the sum of latencies 1,2,3,4 as of Figure 1, but that is missing the MTIE between the forwarders: If a cycle is for example 10 usec, and the PTP MTIE is 1 usec, then only 9 usec of the cycle could be used (without even yet considering the other factors contributing to MTIE). If MTIE is not taken into account, a packet might arrive in time from the perspective of the sending forwarder, but not in the perspective of the 1 usec earlier receiving node. In practice, MTIE should be equal or lower than 1% of the cycle time. When forwarders and links increase in speed, cycle times could become proportionally smaller to reduce per-hop cycle time latency. When this is done, MTIE needs to equally become smaller, raising the costs of the solution. Therefore, has a challenge with higher speed networks. The second and even more important problem is that DT includes the link latency (2 in , Figure 1). With a speed of light in fibre of 200,000 Km, link latency is 10 usec for 2 Km. This makes very problematic and limited in metropolitan and wide-area networks. If the longest link of a network was 10 Km, this would cause a DT on that link of 50 usec and with a cycle time of 100 usec, only 50% bandwidth could be used for cycle-time (bounded latency) traffic (excluding all other DT factors). When links are subject to thermal expansion also known as sag on hanging wires, such as broadband copper wires (Cable Networks), their length can also change by as much as 20% between noon and night temperatures, which without changes in the design has to be taken into account as part of DT. In conclusion, solves many of the problems discussed in this memo, but it’s reliance on timestamp synchronized cycles may pose undesirable challenges with the required accuracy of PTP in high speed network and especially limits ability to support wider-scale networks due to DT.

As this memo outlines, per-hop, per-flow stateless forwarding is the one core requirement for to support Gbps speed metropolitan or wide-area networks. This section gives an overview and evaluation from the perspective of the authors of this memo of currently known non-standardized proposals for per-hop-stateless forwarding with the explicit goal and/or possibility of bounded latency forwarding and in relationship. to the concerns and desires described in the previous sections.

To overcome the challenges outlined in , and (tagged-CQF) propose a modified mechanism in which timestamp based cycle indication of is replaced by indicating the senders cycle in an appropriate packet header field, so that the receiver can accordingly map the received packet to the right local cycle. This approach completely eliminates the link-latency as a factor impacting the effectiveness of the mechanism, because in this approach, the link latency does not impact the DT. Instead the link latency is used to calculate which cycle from the sender needs to be mapped to which cycle on the receiver, and this is programmd during setup of links into the receiving routers cycle mapping table. Depending on the number of cycles configured, it is also possible to compensate for variability in the link-latency and higher MTIE (picture TBD). If one more cycle is used for example, this would allow for MTIE to be the order of one cycle time as opposed to a likely target of 1% of cycle time as in , reducing the required PTP clock accuracy by a factor of 100. This possible reduction in required accuracy of operations by appropriate configuration does not only cover PTP but also extends into any forwarding operation within the nodes, e.g.: it could also reduce the cost of implementation of forwarding hardware at higher speeds accordingly. In MPLS networks, packet tagged CQF with a small number of cycle tags (such as 3 or 4) could easily be realized and standardized by relying on E-LSP where 3 or 4 EXP code points would be used to indicate the cycle value. Given how such deterministic bounded latency traffic is not subject to congestion control, it also does not require additional ECN EXP code points, so those would be available for e.g.: best-effort traffic that should use the same E-LSP.

applies the taged-CQF mechanisms to Segment Routing (SR) by proposing SR style header elements to indicate the per-segment/hop cycle. This eliminates the need to set up on every hop a cycle mapping table. It is unclear to the authors of this memo how big a saving this is given how the PCE would need to update all the ingress router per-flow configurations where header imposition happens when links change, whereas the mapping table approach would require only localized changes on the affected routers.

describes a mechanism in which a source-routed header in the spirit of a Segment Routing (SR) header can be used to enable a per-transit-hop per-flow stateless latency control. For every hop, a maximum latency is specified. The draft outlines a control plane which similarly to packet tagging based CQF or would put the work of admitting flows, determining their paths and admitting their resources along those paths to some form of PCE/SDN-Controller. The basic principle of forwarding in this proposal is to put received packets into a priority heap and schedule them in order of their urgency (shortest latency) for this hop. The draft explicitly does not prescribes specific algorithms on the forwarders to take the indicated latency for the hop into account in a way that the controller can calculate the resource availability, such as specific queuing or scheduling algorithms. It is not entirely clear to the authors of this memo, if the sole indication of such deadline latencies is sufficient to completely eliminate per-transit-hop, per-flow state and still achieve deterministic latency because of the work. Consider that the packets latency for a hop could be used to derive a priority queue on the hop relative to other packets with higher or lower latency for this hop, As was shown in the research work leading up to , the priority queuing on each hop alone is not sufficient to achieve a simple, solely per-hop calculated latency bound under high load because of the problem of multi-hop burst aggregation and the resulting hard to calculate incurred upper latency bound. To overcome that calculation issue, shapers or as in their optimization, interleaved regulators, are used in and GS. Shapers/interleaved requires to maintain across packets from the same flow per-flow state. Nevertheless, appropriate mathematical models for SDN controllers may be possible to develop deterministic per-hop forwarding models relying not only on the per-hop indicated latency but also on additional constraints such as limited number of hops or sufficiently low degrees of maximum admitted amount of traffic. Or else this may be used for to be developed latency models that are not 100% deterministic, but close enough in probability such that the amount of late packets would be in the same order as otherwise unavoidable problems such as BER based packet loss. To that end, the author of has conducted simulations of the proposed mechanism, contrasting it with other mechanisms. These results, which will be published elsewhere, show hat this mechanism excels in cases with high load and a small number of flows with tight budgets. However, some small percentage of packets will miss their end-to-end latency bounds, and must be treated as lost packets. Depending on the algorithms chosen, solutions may or may not rely on strong, weak, or no clock synchronization across nodes.

“High-Precision Latency Forwarding over Packet-Programmable Networks”, NOMS 2020 conference describes a framework for per-transit-hop, per-flow stateless forwarding based on three packet parameters: The minimum and maximum desired end-to-end latency, set by the sender and not changed by the network, and the experienced latency updated by every hop. Routers supporting this LBF mechanism do also extend their routing (e.g.: IGP) to be able to calculate the non-queueing latency towards the destination. Based on the in-packet parameters and the future latency prediction are used to prioritize packets in queuing including giving them higher priority when they are late due to prior hop incurred latency, or delaying them when they are too early. LBF was started as a more fundamental research into how application experience could be improved when they are allowed to indicate such differential min/max latency Service Level Objectives (SLO). Benefits include the ability to compensate for prior hop incurred queuing latency, but also to automatically prioritize packets on a single hop based on their future path length, all without the need for any explicit admission control. The LBF algorithm is completely without need for clock synchronization across nodes. Instead, it assumes mechanisms to know or learn link latency and the remaining latencies (as defined in the DetNet architecture) can be calculated locally (e.g.: latency through a forwarder). The authors have not yet tried to define a mathematical model that would allow to derive completely deterministic behavior for this original LBF algorithm in conjunction with a PCE/SDN controller. Due to the absence of per-flow (shaper/interleaved-regulator), the authors believe that deterministic solutions would as outlined above for SRTSN () likely only be possible under additional assumed constraints.

Bounded Latency for DetNet have been designed by trying to adopt solutions developed either several decades ago (GS) or recently for limited scope and scale L2 networks . To allow DetNet solutions to explore opportunities in larger speed & scale shared network infrastructures, both private and service provider networks, it is highly desirable for DetNet WG (and/or other IETF WGs claiming responsibility in conjunction with DetNet as the driver) to explore the opportunities to standardize additional, and in the opinion of the authors better per-hop forwarding models in support of (near) deterministic bounded latency by mean of standardizing per-flow stateless/”DiffServ” style per-hop forwarding behavior (PHB) with appropriate network packet header parameters.

This document has no security considerations (yet?).

This document has no IANA considerations.

Thanks for Yaakov Stein for reviewing and proposing text for .