Internet-Draft TODO - Abbreviation March 2024
Randriamasy, et al. Expires 5 September 2024 [Page]
Workgroup:
Network Working Group
Internet-Draft:
draft-rcr-opsawg-operational-compute-metrics-02
Published:
Intended Status:
Informational
Expires:
Authors:
S. Randriamasy
Nokia Bell Labs
L. M. Contreras
Telefonica
J. Ros-Giralt
Qualcomm Europe, Inc.
R. Schott
Deutsche Telekom

Joint Exposure of Network and Compute Information for Infrastructure-Aware Service Deployment

Abstract

Service providers are starting to deploy computing capabilities across the network for hosting applications such as distributed AI workloads, AR/VR, vehicle networks, and IoT, among others. In this network-compute environment, knowing information about the availability and state of the underlying communication and compute resources is necessary to determine both the proper deployment location of the applications and the most suitable servers on which to run them. Further, this information is used by numerous use cases with different interpretations. This document proposes an initial approach towards a common understanding and exposure scheme for metrics reflecting compute and communication capabilities.

About This Document

This note is to be removed before publishing as an RFC.

The latest revision of this draft can be found at https://giralt.github.io/draft-rcr-opsawg-operational-compute-metrics/draft-rcr-opsawg-operational-compute-metrics.html. Status information for this document may be found at https://datatracker.ietf.org/doc/draft-rcr-opsawg-operational-compute-metrics/.

Source for this draft and an issue tracker can be found at https://github.com/giralt/draft-rcr-opsawg-operational-compute-metrics.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 5 September 2024.

Table of Contents

1. Introduction

Operators are starting to deploy distributed computing environments in different parts of the network with the objective of addressing different service needs including latency, bandwidth, processing capabilities, storage, etc. This translates in the emergence of a number of data centers (both in the cloud and at the edge) of different sizes (e.g., large, medium, small) characterized by distinct dimension of CPUs, memory, and storage capabilities, as well as bandwidth capacity for forwarding the traffic generated in and out of the corresponding data center.

The proliferation of the edge computing paradigm further increases the potential footprint and heterogeneity of the environments where a function or application can be deployed, resulting in different unitary cost per CPU, memory, and storage. This increases the complexity of deciding the location where a given function or application should be best deployed or executed. This decision should be jointly influenced on the one hand by the available resources in a given computing environment, and on the other hand by the capabilities of the network path connecting the traffic source with the destination.

Network and compute aware function placement and selection has become of utmost importance in the last decade. The availability of such information is taken for granted by the numerous service providers and bodies that are specifying them. However, deployments may reach out to data centers running different implementations with different understandings and representations of compute capabilities and smooth operation is a challenge. While standardization efforts on network capabilities representation and exposure are well-advanced, similar efforts on compute capabilitites are in their infancy.

This document proposes an initial approach towards a common understanding and exposure scheme for metrics reflecting compute capabilities. It aims at leveraging on existing work in the IETF on compute metrics definitions to build synergies. It also aims at reaching out to working or research groups in the IETF that would consume such information and have particular requirements.

2. Conventions and Definitions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

3. Problem Space and Needs

Visibility and exposure of both (1) network and (2) compute resources to the application is critical to enable the proper functioning of the new class of services arising at the edge (e.g., distributed AI, driverless vehicles, AR/VR, etc.). To understand the problem space and the capabilities that are lacking in today's protocol interfaces needed to enable these new services, we focus on the life cycle of a service.

At the edge, compute nodes are deployed near communication nodes (e.g., co-located in a 5G base station) to provide computing services that are close to users with the goal to (1) reduce latency, (2) increase communication bandwidth, (3) enable privacy/personalization (e.g., federated AI learning), and (4) reduce cloud costs and energy. Services are deployed on the communication and compute infrastructure through a two-phase life cycle that involves first a service deployment stage and then a service selection stage (Figure 1).

 +-------------+      +--------------+      +-------------+
 |             |      |              |      |             |
 |  New        +------>  Service     +------>  Service    |
 |  Service    |      |  Deployment  |      |  Selection  |
 |             |      |              |      |             |
 +-------------+      +--------------+      +-------------+
Figure 1: Service life cycle.

Service deployment. This phase is carried out by the service provider, and consists in the deployment of a new service (e.g., a distributed AI training/inference, an XR/AR service, etc.) on the communication and compute infrastructure. The service provider needs to properly size the amount of communication and compute resources assigned to this new service to meet the expected user demand. The decision on where the service is deployed and how many resources are requested from the infrastructure depends on the levels of QoE that the provider wants to guarantee to the user base. To make a proper deployment decision, the provider must have visibility on the resources available from the infrastructure, including communication resources (e.g., latency and bandwidth) and compute (e.g., CPU, GPU, memory, storage). For instance, to run a Large Language Model (LLM) with 175 billion parameters, a total aggregated memory of 400GB and 8 GPUs are needed. The service provider needs an interface to query the infrastructure, extract the available compute and communication resources, and decide which subset of resources are needed to run the service.

Service selection. This phase is initiated by the user, through a client application that connects to the deployed service. There are two main decisions that must be performed in the service selection stage: compute node selection and path selection. In the compute node selection step, as the service is generally replicated in N locations (e.g., by leveraging a microservices architecture), the application must decide which of the service replicas it connects to. Similar to the service deployment stage, this decision requires knowledge about communication and compute resources available in each replica. On the other hand, in the path selection decision, the application must decide which path it chooses to connect to the service. This decision depends on the communication properties (e.g., bandwidth and latency) of the available paths. Similar to the service deployment case, the service provider needs an interface to query the infrastructure and extract the available compute and communication resources, with the goal to make informed node and path selection decisions. It is also important to note that, ideally, the node and path selection decisions should be jointly optimized, since in general the best end-to-end performance is achieved by jointly taking into account both decisions. In some cases, however, such decisions may be owned by different players. For instance, in some network environments, the path selection may be decided by the network operator, wheres the node selection may be decided by the application. Even in these cases, it is crucial to have a proper interface (for both the network operator and the service provider) to query the available compute and communication resources from the system.

Table 1 summarizes the problem space, the information that needs to be exposed, and the stakeholders that need this information.

Table 1: Problem space, needs, and stakeholders.
Action to take Information needed Who needs it
Service placement Compute and communication Service provider
Service selection/node selection Compute Network/service provider and/or application
Service selection/path selection Communication Network/service and/or application

4. Use Cases

4.1. Distributed AI Workloads

Generative AI is a technological feat that opens up many applications such as holding conversations, generating art, developing a research paper, or writing software, among many others. Yet this innovation comes with a high cost in terms of processing and power consumption. While data centers are already running at capacity, it is projected that transitioning current search engine queries to leverage generative AI will increase costs by 10 times compared to traditional search methods [DC-AI-COST]. As (1) computing nodes (CPUs and GPUs) are deployed to build the edge cloud through technologies like 5G and (2) with billions of mobile user devices globally providing a large untapped computational platform, shifting part of the processing from the cloud to the edge becomes a viable and necessary step towards enabling the AI-transition. There are at least four drivers supporting this trend:

  • Computational and energy savings: Due to savings from not needing large-scale cooling systems and the high performance-per-watt efficiency of the edge devices, some workloads can run at the edge at a lower computational and energy cost [EDGE-ENERGY], especially when considering not only processing but also data transport.

  • Latency: For applications such as driverless vehicles which require real-time inference at very low latency, running at the edge is necessary.

  • Reliability and performance: Peaks in cloud demand for generative AI queries can create large queues and latency, and in some cases even lead to denial of service. In some cases, limited or no connectivity requires running the workloads at the edge.

  • Privacy, security, and personalization: A "private mode" allows users to strictly utilize on-device (or near-the-device) AI to enter sensitive prompts to chatbots, such as health questions or confidential ideas.

These drivers lead to a distributed computational model that is hybrid: Some AI workloads will fully run in the cloud, some will fully run in the edge, and some will run both in the edge and in the cloud. Being able to efficiently run these workloads in this hybrid, distributed, cloud-edge environment is necessary given the aforementioned massive energy and computational costs. To make optimized service and workload placement decisions, information about both the compute and communication resources available in the network is necessary too.

Consider as an example a large language model (LLM) used to generate text and hold intelligent conversations. LLMs produce a single token per inference, where a token is almost equivalent to a word. Pipelining and parallelization techniques are used to optimize inference, but this means that a model like GPT-3 could potentially go through all 175 billion parameters that are part of it to generate a single word. To efficiently run these computational-intensive workloads, it is necessary to know the availability of compute resources in the distributed system. Suppose that a user is driving a car while conversing with an AI model. The model can run inference on a variety of compute nodes, ordered from lower to higher compute power as follows: (1) the user's phone, (2) the computer in the car, (3) the 5G edge cloud, and (4) the datacenter cloud. Correspondingly, the system can deploy four different models with different levels of precision and compute requirements. The simplest model with the least parameters can run in the phone, requiring less compute power but yielding lower accuracy. Three other models ordered in increasing value of accuracy and computational complexity can run in the car, the edge, and the cloud. The application can identify the right trade-off between accuracy and computational cost, combined with metrics of communication bandwidth and latency, to make the right decision on which of the four models to use for every inference request. Note that this is similar to the resolution/bandwidth trade-off commonly found in the image encoding problem, where an image can be encoded and transmitted at different levels of resolution depending on the available bandwidth in the communication channel. In the case of AI inference, however, not only bandwidth is a scarce resource, but also compute. ALTO extensions to support the exposure of compute resources would allow applications to make optimized decisions on selecting the right computational resource, supporting the efficient execution of hybrid AI workloads.

4.2. Open Abstraction for Edge Computing

Modern applications such as AR/VR, V2X, or IoT, require bringing compute closer to the edge in order to meet strict bandwidth, latency, and jitter requirements. While this deployment process resembles the path taken by the main cloud providers (notably, AWS, Facebook, Google and Microsoft) to deploy their large-scale datacenters, the edge presents a key difference: datacenter clouds (both in terms of their infrastructure and the applications run by them) are owned and managed by a single organization, whereas edge clouds involve a complex ecosystem of operators, vendors, and application providers, all striving to provide a quality end-to-end solution to the user. This implies that, while the traditional cloud has been implemented for the most part by using vertically optimized and closed architectures, the edge will necessarily need to rely on a complete ecosystem of carefully designed open standards to enable horizontal interoperability across all the involved parties. This document envisions ALTO playing a role as part of the ecosystem of open standards that are necessary to deploy and operate the edge cloud.

As an example, consider a user of an XR application who arrives at his/her home by car. The application runs by leveraging compute capabilities from both the car and the public 5G edge cloud. As the user parks the car, 5G coverage may diminish (due to building interference) making the home local Wi-Fi connectivity a better choice. Further, instead of relying on computational resources from the car and the 5G edge cloud, latency can be reduced by leveraging computing devices (PCs, laptops, tablets) available from the home edge cloud. The application's decision to switch from one domain to another, however, demands knowledge about the compute and communication resources available both in the 5G and the Wi-Fi domains, therefore requiring interoperability across multiple industry standards (for instance, IETF and 3GPP on the public side, and IETF and LF Edge [LF-EDGE] on the private home side). ALTO can be positioned to act as an abstraction layer supporting the exposure of communication and compute information independently of the type of domain the application is currently residing in.

Future versions of this document will elaborate further on this use case.

4.3. Optimized Placement of Microservice Components

Current applications are transitioning from a monolithic service architecture towards the composition of microservice components, following cloud-native trends. The set of microservices can have associated SLOs which impose constraints not only in terms of required compute resources (CPU, storage, ...) dependent on the compute facilities available, but also in terms of performance indicators such as latency, bandwidth, etc, which impose restrictions in the networking capabilities connecting the computing facilities. Even more complex constrains, such as affinity among certain microservices components could require complex calculations for selecting the most appropriate compute nodes taken into consideration both network and compute information.

Thus, service/application orchestrators can benefit from the information exposed by ALTO at the time of deciding the placement of the microservices in the network.

6. Metrics Exposure

Regarding metrics exposure one can distinguish the topics of (1) how the metrics are exposed and (2) which kind of metrics need to be exposed. The infrastructure resources can be divided into network and compute related resources. Network based resources can roughly be subdivided according to the network structure into edge, backbone, and cloud resources.

This section intends to give a brief outlook regarding these resources for stimulating additional discussion with related work going on in other IETF working groups or standardization bodies.

6.1. Edge Resources

Edge resources are referring to latency, bandwidth, compute latency or traffic breakout.

6.2. Network Resources

Network resources relate to the traditional network infrastructure. The next table provides an overview of some of the commonly used metrics.

Table 2
Network Kind of Resource
Path #1 QoS
  Latency
  Bandwidth
  RTT
  Packet Loss
  Jitter

6.3. Cloud Resources

The next table provides an example of parameters that could be exposed:

Table 3
CPU Compute Sum of available cpu resources
Memory Compute Sum of available memory
Storage Storage Sum of available storage
Configmaps Object Sum of config maps
Secrets Object Sum of possible secrets
Pods Object Sum of possible pods
Jobs Object Sum of all parallel jobs
Services Object Sum of parallel services

8. Guiding Principles

The driving principles for designing an interface to jointly extract network and compute information are as follows:

P1. Leverage metrics across working groups to avoid reinventing the wheel. For instance:

P2. Aim for simplicity, while ensuring the combined efforts don’t leave technical gaps in supporting the full life cycle of service deployment and selection. For instance, the CATS working group is covering path selection from a network standpoint, while ALTO (e.g., [RFC7285]) covers exposing of network information to the service provider and the client application. However, there is currently no effort being pursued to expose compute information to the service provider and the client application for service placement or selection.

9. GAP Analysis

From this related work it is evident that compute-related metrics can serve several purposes, ranging from service instance instantiation to service instance behavior, and then to service instance selection. Some of the metrics could refer to the same object (e.g., CPU) but with a particular usage and scope.

In contrast, the network metrics are more uniform and straightforward. It is then necessary to consistently define a set of metrics that could assist to the operation in the different concerns identified so far, so that networks and systems could have a common understanding of the perceived compute performance. When combined with network metrics, the combined network plus compute performance behavior will assist informed decisions particular to each of the operational concerns related to the different parts of a service life cycle.

10. Security Considerations

TODO Security

11. IANA Considerations

This document has no IANA actions.

12. References

12.1. Normative References

[I-D.du-cats-computing-modeling-description]
Du, Z., Fu, Y., Li, C., Huang, D., and Z. Fu, "Computing Information Description in Computing-Aware Traffic Steering", Work in Progress, Internet-Draft, draft-du-cats-computing-modeling-description-02, , <https://datatracker.ietf.org/doc/html/draft-du-cats-computing-modeling-description-02>.
[I-D.ietf-alto-performance-metrics]
Wu, Q., Yang, Y. R., Lee, Y., Dhody, D., Randriamasy, S., and L. M. Contreras, "Application-Layer Traffic Optimization (ALTO) Performance Cost Metrics", Work in Progress, Internet-Draft, draft-ietf-alto-performance-metrics-28, , <https://datatracker.ietf.org/doc/html/draft-ietf-alto-performance-metrics-28>.
[I-D.ldbc-cats-framework]
Li, C., Du, Z., Boucadair, M., Contreras, L. M., and J. Drake, "A Framework for Computing-Aware Traffic Steering (CATS)", Work in Progress, Internet-Draft, draft-ldbc-cats-framework-06, , <https://datatracker.ietf.org/doc/html/draft-ldbc-cats-framework-06>.
[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC7285]
Alimi, R., Ed., Penno, R., Ed., Yang, Y., Ed., Kiesel, S., Previdi, S., Roome, W., Shalunov, S., and R. Woundy, "Application-Layer Traffic Optimization (ALTO) Protocol", RFC 7285, DOI 10.17487/RFC7285, , <https://www.rfc-editor.org/rfc/rfc7285>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/rfc/rfc8174>.

12.2. Informative References

[DC-AI-COST]
"Generative AI Breaks The Data Center - Data Center Infrastructure And Operating Costs Projected To Increase To Over $76 Billion By 2028", Forbes, Tirias Research Report , .
[EDGE-ENERGY]
"Estimating energy consumption of cloud, fog, and edge computing infrastructures", IEEE Transactions on Sustainable Computing , .
[I-D.contreras-alto-service-edge]
Contreras, L. M., Randriamasy, S., Ros-Giralt, J., Perez, D. A. L., and C. E. Rothenberg, "Use of ALTO for Determining Service Edge", Work in Progress, Internet-Draft, draft-contreras-alto-service-edge-10, , <https://datatracker.ietf.org/doc/html/draft-contreras-alto-service-edge-10>.
[I-D.dunbar-cats-edge-service-metrics]
Dunbar, L., Majumdar, K., Mishra, G. S., Wang, H., and H. Song, "5G Edge Services Use Cases", Work in Progress, Internet-Draft, draft-dunbar-cats-edge-service-metrics-01, , <https://datatracker.ietf.org/doc/html/draft-dunbar-cats-edge-service-metrics-01>.
[I-D.llc-teas-dc-aware-topo-model]
Lee, Y., Liu, X., and L. M. Contreras, "DC aware TE topology model", Work in Progress, Internet-Draft, draft-llc-teas-dc-aware-topo-model-03, , <https://datatracker.ietf.org/doc/html/draft-llc-teas-dc-aware-topo-model-03>.
[LF-EDGE]
"Linux Foundation Edge", https://www.lfedge.org/ , .
[NFV-INF]
"ETSI GS NFV-INF 010, v1.1.1, Service Quality Metrics", , <https://www.etsi.org/deliver/etsi_gs/NFV-INF/001_099/010/01.01.01_60/gs_NFV-INF010v010101p.pdf>.
[NFV-TST]
"ETSI GS NFV-TST 008 V3.3.1, NFVI Compute and Network Metrics Specification", , <https://www.etsi.org/deliver/etsi_gs/NFV-TST/001_099/008/03.03.01_60/gs_NFV-TST008v030301p.pdf>.
[RFC7666]
Asai, H., MacFaden, M., Schoenwaelder, J., Shima, K., and T. Tsou, "Management Information Base for Virtual Machines Controlled by a Hypervisor", RFC 7666, DOI 10.17487/RFC7666, , <https://www.rfc-editor.org/rfc/rfc7666>.

Acknowledgments

TODO acknowledge.

Authors' Addresses

S. Randriamasy
Nokia Bell Labs
L. M. Contreras
Telefonica
Jordi Ros-Giralt
Qualcomm Europe, Inc.
Roland Schott
Deutsche Telekom