Internet-Draft AI-model July 2023
An Expires 7 January 2024 [Page]
Workgroup:
CATS
Published:
Intended Status:
Standards Track
Expires:
Author:
Q. An
Alibaba Group

Use Case of Computing-Aware AI large model

Abstract

AI models, especially AI large models have been fastly developed and widely deployed to serve the needs of users and multiple industries. Due to that AI large models involve mega-scale data and parameters, high consumption on computing and network resources, distributed computing becomes a natural choice to deploy AI large models.

This document desribes the key concepts and depolyment scenarios of AI large model, to demonstrate the necessity of considering computing and network resources to meet the requirements of AI tasks.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 7 January 2024.

Table of Contents

1. Introduction

AI large model refers to a type of artificial intelligence model that is trained on massive amounts of data using deep learning techniques. These models are characterized by their large size, high complexity, and high computational requirements. AI large models have become increasingly important in various fields, such as natural language processing, computer vision, and speech recognition. Many AI large models have been widely adopted in industry applications such as search engines, and virtual assistants.

There are usually two types of AI large models, AI general model and customized model. AI general large model is a model that can handle multiple tasks and domains, and has wider applicability and flexibility, but may not perform as well as customized model in specific domain tasks. Customized model is trained for specific industries or domains, and more focused on solving specific problems, but may not be applicable to other domains. AI general model usually involve mega-scale parametere, while customized model involves large or middle-scale parameters.

Also, AI large model contains two key phases: traing and inference. Training refers to the process of developing an AI model by feeding it with large amounts of data and optimizing it to learn and improve its performance. During this process, the model is adjusted and refined until it achieves high levels of accuracy and predictive ability. Therefore, training has high demand on accuracy, computing and memory resource . On the other hand, inference is the process of using the trained AI model to make predictions or decisions based on new input data. The model is deployed in a production environment where it is given real-world data and makes predictions based on that data. So, inference focuses more on the balance between computing resource, latency and power cost.

There are mainly four types of AI tasks:

Vison, audio, multimodal tasks often bring on high demand on network resource and computing resource.

2. Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

3. Computing-Aware AI large model

This section presents different use cases of AI large model depolyment, some of which will benefit from the dynamic selection of service instances and the traffic steering.

3.1. AI large model deployment

3.1.1. All-in-cloud deployment

Figure 1 shows the All-in-cloud AI model depolyment. Cloud is highly suitable for training, but may have issues for inference. First is latency, especially for delay-sensitive AI applications. Even if real-time interaction is not needed, high latency will affect user experience. Second is high cost to ensure privacy protection in cloud-based inference.

                          Training + Inference
         +------------------------------------------------------+
         |                                                      |
         |                       Cloud                          |
         |                                                      |
         |   +---------------+          +-------------------+   |
         |   | General Model |          | Customized Models |   |
         |   +---------------+          +-------------------+   |
         +--------+-----------------+---------------+-----------+
                  |                 |               |
                  |                 |               |
             +----+---+        +----+---+       +---+----+
             | Device |        | Device |   ... | Device |
             +--------+        +--------+       +--------+

Figure 1: All-in-Cloud

3.1.2. Cloud-device co-inference deployment

Figure 2 shows the Cloud-device co-inference deployment. It can achieve low latency as the AI inference is deployed on device locally. But may support only limited AI tasks, because in most cases, only compressed, pruned model can be deployed on device.

                          Training + Inference
         +------------------------------------------------------+
         |                                                      |
         |                       Cloud                          |
         |                                                      |
         |   +---------------+          +-------------------+   |
         |   | General Model |          | Customized Models |   |
         |   +---------------+          +-------------------+   |
         +--------+-----------------+------------------+--------+
                  |                 |                  |
                  |                 |                  |
             +----+-----+      +----+-----+       +----+-----+
             |  Device  |      |  Device  |   ... |  Device  |
             | +------+ |      | +------+ |       | +------+ |
             | |Pruned| |      | |Pruned| |       | |Pruned| |
             | |Model | |      | |Model | |       | |Model | |
             | +------+ |      | +------+ |       | +------+ |
             +----------+      +----------+       +----------+
               Inference         Inference          Inference

Figure 2: Cloud-device co-inference

3.1.3. Cloud-edge co-inference deployment

Figure 3 shows the Cloud-edge co-inference AI model depolyment. It can achieve low latency as the AI inference is deployed near to device. And it requires low demand on device resources. But when handling AI inference tasks, if traffic load between device and edge is high or edge computing resource is overloaded, traffic steering is needed to ensure the QoS.

                          Training + Inference
         +------------------------------------------------------+
         |                                                      |
         |                       Cloud                          |
         |                                                      |
         |                 +---------------+                    |
         |                 | General Model |                    |
         |                 +---------------+                    |
         +--------------------------+---------------------------+
                                    |
                                    |     Training + Inference
       +----------------------------+-----------------------------+
       |  +--------------+  +--------------+   +--------------+   |
       |  |     Edge     |  |     Edge     |   |     Edge     |   |
       |  | +----------+ |  | +----------+ |   | +----------+ |   |
       |  | |Customized| |  | |Customized| |   | |Customized| |   |
       |  | |  Models  | |  | |  Models  | |   | |  Models  | |   |
       |  | +----------+ |  | +----------+ |   | +----------+ |   |
       |  +--------------+  +--------------+   +--------------+   |
       +----------+-----------------+---------------+-------------+
                  |                 |               |
                  |                 |               |
             +----+---+        +----+---+       +---+----+
             | Device |        | Device |   ... | Device |
             +--------+        +--------+       +--------+

Figure 3: Cloud-edge co-inference

3.1.4. Cloud-edge-device co-inference deployment

Figure 4 shows the Cloud-edge-device co-inference AI model depolyment. It is a more flexible deployment (also more complex). It can achieve low latency as the AI inference is deployed locally or near to device. And device can work when edge isn’t available. Careful consideration to ensure that edge will only be used when the trade-offs are right. Similar to Cloud-edge co-inference AI model depolyment, traffic steering is needed.


                          Training + Inference
         +------------------------------------------------------+
         |                                                      |
         |                       Cloud                          |
         |                                                      |
         |                 +---------------+                    |
         |                 | General Model |                    |
         |                 +---------------+                    |
         +--------------------------+---------------------------+
                                    |
                                    |     Training + Inference
       +----------------------------+-----------------------------+
       |  +--------------+  +--------------+   +--------------+   |
       |  |     Edge     |  |     Edge     |   |     Edge     |   |
       |  | +----------+ |  | +----------+ |   | +----------+ |   |
       |  | |Customized| |  | |Customized| |   | |Customized| |   |
       |  | |  Models  | |  | |  Models  | |   | |  Models  | |   |
       |  | +----------+ |  | +----------+ |   | +----------+ |   |
       |  +--------------+  +--------------+   +--------------+   |
       +----------+-----------------+---------------+-------------+
                  |                 |                  |
                  |                 |                  |
             +----+-----+      +----+-----+       +----+-----+
             |  Device  |      |  Device  |   ... |  Device  |
             | +------+ |      | +------+ |       | +------+ |
             | |Pruned| |      | |Pruned| |       | |Pruned| |
             | |Model | |      | |Model | |       | |Model | |
             | +------+ |      | +------+ |       | +------+ |
             +----------+      +----------+       +----------+
               Inference         Inference          Inference


Figure 4: Cloud-edge-device co-inference

3.2. Why traffic steering is needed

Many AI tasks brings on high demand on network resource and computing resource: vison, audio, multimodal. Also, it is common that same customized model is deployed in multiple edge sites to achieve load balance and high reliability.

The edge site’s computing resource and network info should be collectively considered to make suitable traffic steering decision. For example, if the available computing resource in nearest edge site is low, the traffic of AI tasks should be steered to another edge with high resource. Also, if multiple AI tasks, delay-sensitive task (live streaming with AI-generated avatar) and delay-tolerant task (text-to-image) arrive in edge, delay-tolerate task should be steered to another edge if the nearest edge’s resource is limited.

4. IANA Considerations

This document makes no request of IANA.

5. Security Considerations

TBD

6. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.

Author's Address

Qing An
Alibaba Group
China