Network Working Group L. Dunbar Internet Draft Futurewei Intended status: Standard K. Majumdar Expires: October 25, 2023 Microsoft G. Mishra Verizon H. Song Futurewei April 25, 2023 IP Layer Metrics for Edge Services draft-dunbar-cats-edge-service-metrics-00 Abstract This draft describes the IP Layer metrics and methods to measure the Edge Services' running status and environment for IP network to dynamically optimize the forwarding of low latency edge services without any knowledge above the IP layer. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. This document may not be modified, and derivative works of it may not be created, except to publish it as an RFC and to translate it into languages other than English. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." xxx, et al. Expires October 25, 2023 [Page 1] IP Layer Metrics for 5G Edge Services The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on April 7, 2021. Copyright Notice Copyright (c) 2023 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction..............................................3 1.1. Use Case: 5G Edge Computing..........................3 1.2. Problem 1: Selecting 5G Edge Service Instance Location..................................................4 1.3. Problem 2: UE mobility creates unbalanced anycast distribution..............................................5 2. Conventions used in this document.........................6 3. IP-Layer Metrics Definitions for 5G Edge Services........11 3.1. IP-Layer Service ID.................................11 3.2. IP-Layer metric for Service Instances Load Measurement..............................................11 3.3. Capacity Index in the overall cost..................14 3.4. Site Preference Index in the overall cost...........14 3.5. RTT to an ANYCAST Address in 5G EC..................15 4. Algorithm in Selecting the optimal Target Location.......16 5. Scope of IP Layer Metrics Advertisement..................17 6. Manageability Considerations.............................17 Dunbar, et al. Expires October 25, 2023 [Page 2] IP Layer Metrics for 5G Edge Services 7. Security Considerations..................................17 8. IANA Considerations......................................18 9. References...............................................18 9.1. Normative References................................18 9.2. Informative References..............................18 10. Acknowledgments.........................................19 1. Introduction 1.1. Use Case: 5G Edge Computing In the 5G Edge Computing environment [3GPP-EdgeComputing], one application or service can have multiple instances hosted in different Edge Computing data centers. Those Edge Computing (mini) data centers are usually very close to, or co-located with, 5G base stations to minimize latency and optimize the user experience. When a UE (User Equipment) initiates the packets using the destination address from a DNS reply or its own cache, the packets from the UE are carried in a PDU session through the 5G Core [5GC] to the 5G UPF-PSA (User Plan Function - PDU Session Anchor). The UPF-PSA decapsulates the 5G GTP outer header and forwards the packets from the UEs to the Ingress router of the Edge Computing (EC) Local Data Network (LDN). The LDN for 5G EC, which is the IP Networks from 5GC perspective, is responsible for forwarding the packets to the intended destinations. Routers in the local IP network should be able to select the "best" or "closest" location out of many service instances. However, simply using distance alone as a metric may not be sufficient as there may be many locations in close proximity. Moreover, one of the main aims of locating the service instance close to the user is to provide lower latency. When a UE moves and attaches to another UPF, the packets from the UE can enter the IP network from a different ingress router. It is desirable if the IP network can continue forwarding the packets from the UE to the established service instance. As a user keeps moving further away, a closer service instance might be able to serve the UE better. Network measurements, including latency of various paths are provided to the ingress router to assist in re-selection. Dunbar, et al. Expires October 25, 2023 [Page 3] IP Layer Metrics for 5G Edge Services 1.2. Problem 1: Selecting 5G Edge Service Instance Location Having multiple locations closer to UEs to host one service can greatly improve the user experience. But selecting an optimal location for the service traffic from a UE may not be that simple. Using DNS to reply with the address of the service instance location closest to the requesting UE can encounter issues like: - UE can cache results indefinitely, when the UE moves to a 5G cell site very far away, the cached address may still be used, which can incur large network delay. - The service instance at a specific location whose address replied by the DNS might be heavily loaded causing slow or no response, when there are available low utilized service instances, for the same service, at different locations very close in proximity. - No inherent leverage of proximity information present in the network (routing) layer, resulting in loss of performance - Local DNS resolver become the unit of traffic management Increasingly, Anycast is used extensively by various application providers and CDNs because ANYCAST makes it possible to dynamically load balance across locations that host the application/service instances based on network conditions. Service instances' location selection using Anycast address leverages the proximity information present in the network (routing) layer and eliminates the single point of failure and bottleneck at the DNS resolvers and application layer load balancers. Another benefit of using ANYCAST address is removing the dependency on UEs that use their cached destination IP addresses for extended period. But selection of an ANYCAST location purely based on the network condition can encounter issue of the location selected by network routing information being overutilized while there are available underutilized locations close by. Dunbar, et al. Expires October 25, 2023 [Page 4] IP Layer Metrics for 5G Edge Services 1.3. Problem 2: UE mobility creates unbalanced anycast distribution Another problem of using ANYCAST address for multiple locations of one service in 5G environment is that UEs' frequent moving from one 5G site to another. The frequent move of UEs can make it difficult to plan where the service instances should be hosted. When a large number of UEs using a particular service congregate together unpredictably, the ANYCAST location selected based on routing distance can be heavily utilized, while the instances of the same service at other locations close-by are underutilized. Dunbar, et al. Expires October 25, 2023 [Page 5] IP Layer Metrics for 5G Edge Services +--+ |UE|---\+---------+ +------------------+ +--+ | 5G | +-----------+ | S1: aa08::4450 | +--+ | Site A +----+ +----+ | |UE|----| | Ra | | R1 | S2: aa08::4460 | +--+ | +----+ +----+ | +---+ | | | | | S3: aa08::4470 | |UE1|---/+---------+ | | +------------------+ +---+ |IP Network | L-DN1 |(3GPP N6) | | | | +------------------+ | | | | S1: aa08::4450 | | | +----+ | | | | R3 | S2: aa08::4460 | v | +----+ | | | | S3: aa08::4470 | | | +------------------+ | | L-DN3 +--+ | | |UE|---\+---------+ | | +------------------+ +--+ | 5G | | | | S1: aa08::4450 | +--+ | Site B +----+ +----+ | |UE|----| | Rb | | R2 | S2: aa08::4460 | +--+ | +----+ +----+ | +--+ | | +-----------+ | S3: aa08::4470 | |UE|---/+---------+ +------------------+ +--+ L-DN2 Figure 1: multiple ANYCAST instances in different edge DCs This document describes the measurements at the IP Layer that can reflect the service instances running status and environment at the specific locations. This document also describes the method of incorporating those measurements with IP routing cost to come up with a more optimal criteria in selecting the service instance locations. 2. Conventions used in this document CATS: Computing-Aware Traffic Steering takes into account the dynamic nature of computing resource Dunbar, et al. Expires October 25, 2023 [Page 6] IP Layer Metrics for 5G Edge Services metrics and network state metrics to steer service traffic to a service instance. Service: A monolithic function. A composite service can be built by orchestrating monolithic services. Service instance: A run-time environment (e.g., a server or a process on a server) that makes the functionality of a service available. One service can have multiple instances running at the same or different network locations. CS-ID: The CATS Service ID is an identifier representing a service, which the clients use to access said service. Such an identifier identifies all of the instances of the same service, no matter on where they are actually running. The CS-ID is independent of which service instance serves the service demand. Usually multiple instances provide a (logically) single service, and service demands are dispatched to the different instance by choosing one instance among all available instances. CB-ID: The CATS Binding ID is an identifier of a single service instance of a given CS-ID. Different service instances provide the same service identified through a single CS-ID, but with different CATS Binding IDs. Service demand: The demand for a specific service identified by a specific CS-ID. Service request: The request for a specific service instance. CATS-router: A network device (usually at the edge of the network) that makes forwarding decisions based on CATS information to steer traffic belonging to the same service demand to the same chosen service instance. Dunbar, et al. Expires October 25, 2023 [Page 7] IP Layer Metrics for 5G Edge Services Ingress CATS-Router: A network edge router that serves as a service access point for CATS clients. It steers the service packets onto an overlay path to an Egress CAN-Router linked to the most suitable edge site to access a service instance. CATS-ER: CATS-ER is an egress CATS-Router, i.e., the egress endpoint of an overlay path to a service instance. CATS-ER is used to describe the last router that the service instances are attached. In a 5G EC environment, the CATS-ER can be the gateway router to the Edge Computing Data Center. C-SMA: The CATS Service Metric Agent responsible for collecting service capabilities and status, and for reporting them to the C-PS. NOTE: The above terminologies are the same as those used in 3GPP TR 23.758 C-NMA: The CATS Network Metric Agent responsible for collecting network capabilities and status, and for reporting them to the C-PS C-PS: The CATS Path Selector determines the path toward the appropriate service location and service instances to meet a service demand given the service status and network status information. C-TC: The CATS Traffic Classifier is responsible for determining which packets belong to a traffic flow for a particular service demand, and for steering them on the path to the service instance as determined by the C-PS. Edge DC: Edge Data Center, which provides the Hosting Environment for the edge services. An Edge DC might host 5G core functions in addition to the frequently used application servers. gNB next generation Node B Dunbar, et al. Expires October 25, 2023 [Page 8] IP Layer Metrics for 5G Edge Services PSA: PDU Session Anchor (UPF) SSC: Session and Service Continuity UE: User Equipment UPF: User Plane Function ANYCAST Instance: refer to the service instance at a specific location which is reachable by the ANYCAST address. Service Instance Location: Represent a cluster of servers at one location serving the same Service. One service may have a Layer 7 Load balancer, whose address(es) are reachable from external IP network, in front of a set of service instances. From the IP network perspective, this whole group of instances are considered as one service instance at the location. EC: Edge Computing Edge Computing Hosting Environment: An environment, such as psychical or virtual machines, host the service instances. NOTE: The above terminologies are the same as those used in 3GPP TR 23.758 Edge DC: Edge Data Center, which provides the Edge Hosting Environment. It might be co-located with 5G Base Station and not only host 5G core functions, but also host frequently used Edge server instances. Dunbar, et al. Expires October 25, 2023 [Page 9] IP Layer Metrics for 5G Edge Services L-DN: Local Data Network PSA: PDU Session Anchor (UPF) RTT: Round Trip Time RTT-ANYCAST: A list of Round trip times to a group of routers that have the ANYCAST instances directly attached. SSC: Session and Service Continuity UE: User Equipment UPF: User Plane Function The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. Dunbar, et al. Expires October 25, 2023 [Page 10] IP Layer Metrics for 5G Edge Services 3. IP-Layer Metrics Definitions for 5G Edge Services 3.1. IP-Layer Service ID From network perspective, a service identifier, or IP Layer Service ID, is an ANYCAST address that can represent multiple service instances at different locations that host the service. 3.2. IP-Layer metric for Service Instances Load Measurement There are many network techniques and protocols to optimize forwarding and ensure QoS, such as DSCP/DiffServ, Traffic Engineered (TE) solutions, Segment Routing, etc. But most applications and services don't expose their internal logic to network operators. Their communications are generally encrypted. Most do not respond to PING or ICMP messages initiated by routers or network elements. This document specifies the IP Layer metrics and algorithms that enable the IP networks to dynamically optimize the forwarding of 5G edge computing service without any knowledge above the IP layer. Without knowledge of application internal logics, network layer or IP Layer can monitor the traffic patterns to/from the service instances at each location to gauge the running status of the service at the location. The proposed IP Layer Metrics and algorithm enable the IP networks to be more aware of the service behavior without dependency on getting information from the services themselves. First, the network needs to discover which router(s) has the service instances attached. Those routers are called CATS Egress Router, or CATS-ER for short. CATS-ER is usually the Gateway Router to an Edge Computing Data Center. To discover if a router is the CATS-ER for a specific edge service, the router can periodically send reverse ARP (IPv4) or Neighbor Discovery scan with the address of the Service ID to discover if there are the service instances hosted in its edge computing data center. If yes, the router or routers are identified as the CATS-ER for the Service ID. For one Service ID, there can be many CATS-ERs at different EC Data Centers. Dunbar, et al. Expires October 25, 2023 [Page 11] IP Layer Metrics for 5G Edge Services For a service instance at a specific location, which is identified by the address of the service instance at the IP layer, the CATS-ER can measure the amount of traffic destined towards the address & the amount of the traffic from the specific address, such as: - Total number of packets to the attached service instance (ToPackets); - Total number of packets from the attached service instance (FromPackets); - Total number of Bytes to the attached service instance (ToBytes); - Total number of bytes from the attached service instance (FromBytes); The actual load measurement to the service instance attached to an CATS-ER can be based on one of the metrics above or including all four metrics with different weights applied to each, such as: LoadIndex = w1*ToPackets+w2*FromPackes+w3*ToBytes+w4*FromBytes Where 0<= wi <=1 and w1+ w2+ w3+ w4 = 1. The weights of each metric contributing to the load index of the service instance attached to a CATS-ER can be configured or learned by self-adjusting based on user feedbacks. The raw measurement is useful when the CATS-ER routers cannot be configured with a consistent algorithm to compute the aggregated load index and the raw measurements are needed by a central analytic system. The CATS-ER can advertise either the aggregated Load Index or the raw measurements periodically, by BGP UPDATE messages (in-band) or BGP-LS (via controller), to a group of routers that have traffic destined towards the ANYCAST addresses of those services. Dunbar, et al. Expires October 25, 2023 [Page 12] IP Layer Metrics for 5G Edge Services It is better to have applications or their controllers directly reporting their own workload running status to the network. When it is not feasible to have the third-party application controller provide the workload information to the network operators, the proposed IP layer Load Measurements provide an intelligent estimate of the instance running status at a specific location. Dunbar, et al. Expires October 25, 2023 [Page 13] IP Layer Metrics for 5G Edge Services 3.3. Capacity Index in the overall cost Capacity Index indicates the capacity value for a site or a pod where the edge services are hosted. One Edge Site can be in full capacity, reduced capacity, or completely out of service. Cloud Site/Pod failures and degradation include, but not limited to, a site capacity degradation or entire site going down caused by a variety of reasons, such as fiber cut connecting to the site or among pods within one site, cooling failures, insufficient backup power, cyber threats attacks, too many changes outside of the maintenance window, etc. Fiber-cut is not uncommon within a Cloud site or between sites. When those failure events happen, the Edge (egress) router visible to the ingress routers can be running fine. Therefore, the ingress routers with paths to the egress routers can't use BFD to detect the failures. When there is a failure occurring at an edge site (or pod), many instances can be impacted. In addition, the routes (i.e., the IP addresses) in an Edge Cloud Site might not be aggregated nicely. Instead of many BGP UPDATE messages for each instance to the impacted ingress routers, the egress router can send one single BGP UPDATE indicating the capacity of the site. The ingress routers can switch all or a portion of the instances that are associated with the site depending on how much the site is degraded. Site Capacity should be represented as the percentage of the site availability, e.g., 100%, 50%, or 0%. When a site goes dark, the Index is set to 0. 50 means 50% capacity functioning. 3.4. Site Preference Index in the overall cost As described in [IPv6-StickyService] and [ISPF-EXT-EC], an EC sticky service needs to connect a UE to the service instance that has been serving the UE before the UE moves to a new 5G Site, unless there is failure to that location. Dunbar, et al. Expires October 25, 2023 [Page 14] IP Layer Metrics for 5G Edge Services To achieve the goal of sticking a flow from one specific UE to a specific site, a "site Preference Index" is created. The value of the Site Preference Index can be manipulated for packets of some flows to be steered towards a instance location farther away in routing distance. The "Site Preference Index" enables some sites to be more preferred for handling the UE traffic to a instance than others. 3.5. RTT to an ANYCAST Address in 5G EC ANYCAST used in 5G Edge computing environment is slightly different from the typical ANYCAST address being deployed. Typical ANYCAST address is used to represent instances in vast different geographical locations, such as different continents. ANCAST address for "app.net" for Asia lead packets to a server instance of "app.net" hosted in Asia. Therefore, the RTT for "app.net" in Asia, is a single value that represent the round time trip to the server in Asia that host the "app.net". 5G Edge Computing environment can have one service hosted in multiple Edge Computing DCs close in proximity. Routers, i.e. the ingress router to 5G LDN (Local Data Network), can forward packets for the ANYCAST address of "app.net" to different egress routers that have "app.net" instances attached. If "app.net" is hosted in four different 5G Edge Computing Data Centers. All those DCs have the same ANYCAST address for the "app.net". The RTT to "app.net" ANYCAST address need to be a group of values (instead of one RTT value to a unicast address). The RTT group value should include the CATS-ER router's specific unicast address (e.g., the loopback address) to which the service instance is attached. RTT to "app.net" ANYCAST Address is represented as: List of {Egress Router address, RTT value} This list is called "RTT-ANYCAST". In order to better optimize the ANYCAST traffic, each router adjacent to 5G PSA needs to periodically measure RTT to a list of CATS-ER routers that advertise the ANYCAST address. The RTT to egress router at Site-i is considered as the RTT to the ANYCAST instance at the Site-i. Dunbar, et al. Expires October 25, 2023 [Page 15] IP Layer Metrics for 5G Edge Services 4. Algorithm in Selecting the optimal Target Location The goal of the algorithm is to equalize the traffic among multiple locations of the same ANYCAST address. The main benefit of using ANYCAST is to leverage the IP-layer information to equalize the traffic among multiple locations of the same service, usually identified by one or a group of ANYCAST addresses. For 5G Edge Computing environment, the ingress router to each LDN needs to be notified of the Load Index and Capacity Index of the service instances at different EC site to make the intelligent decision on where to forward the traffic from UEs for the service. The Algorithm needs to take the following attributes into consideration: - Load Measurement Index [Section 3.2], - capacity index [Section 3.3], - Preference Index [Section 3.4], and - network delay [Section 3.5]. Here is an algorithm for a router, e.g., the router directly attached to the 5G PSA, to compare the cost to reach the service instances at Site-i or Site-j: Load-i * CP-j Pref-j * Delay-i Cost-i=min(w *(----------------) + (1-w) *(------------------)) Load-j * CP-i Pref-i * Delay-j Load-i: Load Index at Site-i, it is the weighted combination of the total packets and bytes sent to and received from the service instance at Site-i during a fixed time period. CP-i (Capacity-i) (higher value means higher capacity): capacity index at the site i. Delay-i: Network latency measurement (RTT) to the CATS-ER that has the service instances attached at the site-i. Dunbar, et al. Expires October 25, 2023 [Page 16] IP Layer Metrics for 5G Edge Services Pref-i (Preference Index: higher value means higher preference): Network Preference index for the site-I. w: Weight for load and site information, which is a value between 0 and 1. If smaller than 0.5, Network latency and the site Preference have more influence; otherwise, Server load and its capacity have more influence. 5. Scope of IP Layer Metrics Advertisement Each service might be used by a small group of UEs. Therefore, it is not necessary for CATS-ER router to advertise the IP layer metrics to all other routers in the 5G LDN. Likewise, each EC Data Center may only host a small number of low latency services. "Service ID Bound Group Routers" is used to refer a group of routers that are interested in a group of specific ANYCAST addresses. The IP Layer Metrics for a specific service ID should be advertised among the routers in the "Service ID bound Group Routers". BGP RT Constrained Distribution [RFC4684] can be used to form the "Service ID Bound Group Routers". Since there are much more Service IDs than the number of routers in 5G LDN, a more practical way to form the "Service ID Bound Group of Routers" is for each ingress router to query a network controller upon receiving the first packet to a specific ANYCAST address to be included in the "Service ID Bound Group Routers". There should be a timer associated with Ingress router, as the UE that uses the service ID might move away. Upon timer expires, the Ingress Router is removed from the "Service ID Bound Group of Routers". 6. Manageability Considerations To be added. 7. Security Considerations To be added. Dunbar, et al. Expires October 25, 2023 [Page 17] IP Layer Metrics for 5G Edge Services 8. IANA Considerations To be added. 9. References 9.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC4364] E. rosen, Y. Rekhter, "BGP/MPLS IP Virtual Private networks (VPNs)", Feb 2006. [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . [RFC8200] s. Deering R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", July 2017 9.2. Informative References [3GPP-EdgeComputing] 3GPP TS 23.548 V18.1.1, "3rd Generation Partnership Project; Technical Specification Group Services and System Aspects; 5G System Enhancements for Edge Computing; Stage 2", Release 18, April 2023. [RFC5521] P. Mohapatra, E. Rosen, "The BGP Encapsulation Subsequent Address Family Identifier (SAFI) and the BGP Tunnel Encapsulation Attribute", April 2009. [BGP-SDWAN-Port] L. Dunbar, H. Wang, W. Hao, "BGP Extension for SDWAN Overlay Networks", draft-dunbar-idr-bgp- sdwan-overlay-ext-03, work-in-progress, Nov 2018. Dunbar, et al. Expires October 25, 2023 [Page 18] IP Layer Metrics for 5G Edge Services [SDWAN-EDGE-Discovery] L. Dunbar, S. Hares, R. Raszuk, K. Majumdar, "BGP UPDATE for SDWAN Edge Discovery", draft-dunbar-idr-sdwan-edge-discovery-00, work-in- progress, July 2020. [Tunnel-Encap] E. Rosen, et al "The BGP Tunnel Encapsulation Attribute", draft-ietf-idr-tunnel-encaps-10, Aug 2018. 10. Acknowledgments Acknowledgements to XXX for their review and contributions. This document was prepared using 2-Word-v2.0.template.dot. Dunbar, et al. Expires October 25, 2023 [Page 19] IP Layer Metrics for 5G Edge Services Authors' Addresses Linda Dunbar Futurewei Email: ldunbar@futurewei.com Kausik Majumdar Microsoft Email: kmajumdar@microsoft.com Gyan Mishra Verizon Email: gyan.s.mishra@verizon.com HaoYu Song Futurewei Email: haoyu.song@futurewei.com Dunbar, et al. Expires October 25, 2023 [Page 20]