In-situ Flow Information TelemetryFuturewei2330 Central ExpresswaySanta ClaraUSAhaoyu.song@futurewei.comChina MobileNo. 32 Xuanwumenxi Ave., Xicheng DistrictBeijing, 100032P.R. Chinaqinfengwei@chinamobile.comChina TelecomP. R. Chinachenhuan6@chinatelecom.cnLG U+South Koreadaenamu1@lguplus.co.krSK TelecomSouth Koreajongyoon.shin@sk.com
Operation and Management Area
OPSAWGiFIT For efficient network operation, most network operators rely on traditional
Operation, Administration and Maintenance (OAM) methods, which
include proactive and reactive techniques, running in active and
passive modes. As networks increase in scale, they become more
susceptible to measurement accuracy and misconfiguration errors. With the advent of programmable data-plane, emerging on-path telemetry
techniques provide unprecedented flow insight and
fast notification of network issues (e.g., jitter, increased latency, packet loss,
significant bit error variations, and unequal load-balancing).This document outlines an In-situ Flow Information Telemetry (iFIT) reference framework,
which enumerates several high level components and describes how
these components can be assembled to achieve a complete and closed-loop working solution for on-path
telemetry.iFIT addresses several deployment challenges for
on-path telemetry techniques, especially in carrier networks.
As an open framework, it does not detail the implementation
of the components as well as the interface between the components.The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 when, and only when, they appear in all
capitals, as shown here.The sheer complexity of today's networks requires radical rethinking
of existing methods used for network monitoring and troubleshooting. Current
dynamic networks require "on-path" fault monitoring and traffic
measurement solutions for a wide range of use cases which include
intelligent management of existing network traffic, and better
traffic visibility of emerging applications such as large scale
Virtual Server (VS) mobility, fluid content distribution, and
elastic bandwidth allocation.Furthermore, the ability to expedite failure detection, fault
localization, and recovery mechanisms, particularly in the case of
soft failures or path degradation are experienced, without
causing extreme or obvious disruption. This is extremely important for
since these types of network issues are often difficult to localize
with existing Operation, Administration and Maintenance (OAM) methods
and reduce overall network efficiency.Future networks must also support application-aware networking.
Application-aware networking is an emerging industry term and
typically used to describe the capacity of an intelligent network to
maintain current information about user and application connections
that use network resources and, as a result, the operator can
optimize the network resource usage and monitoring to ensure
application and traffic optimality.Application-aware network operation is important for user SLA compliance, service path enforcement,
fault diagnosis, and network resource optimization.
A family of on-path flow telemetry techniques, including
In-situ OAM (IOAM),
Postcard Based Telemetry (PBT),
In-band Flow Analyzer (IFA),
Enhanced Alternate Marking (EAM), and
Hybrid Two Steps (HTS), are emerging, which can provide flow information on
the entire forwarding path on a per-packet basis in real time.
These on-path flow telemetry techniques are very different from the previous active and passive OAM schemes
in that they directly modify the user packets.
Given the unique characteristics of the aforementioned techniques, we may categorize these on-path
telemetry techniques as the hybrid OAM type III, supplementing the classification defined
in . These techniques are
invaluable for application-aware network operations not only in data center and enterprise networks but also in carrier networks which may cross multiple domains.
Carrier network operators have shown strong interest in utilizing such techniques for various purposes. For example, it is vital for the operators who
offer bandwidth intensive, latency and loss sensitive
services such as video streaming and gaming to closely monitor the relevant flows in real time as the indispensable first step for any further measure.However, successfully applying such techniques in carrier networks needs to consider performance, deployability, and flexibility. Specifically,
several practical challenges need to be addressed: C1: On-path flow telemetry incurs extra packet processing which may strain the network data plane.
The potential impact on the forwarding performance creates an unfavorable "observer effect" which not only damages the fidelity of
the measurement but also defies the purpose of the measurement. C2: On-path flow telemetry can generate a huge amount of OAM data which may claim too much transport bandwidth and inundate the servers for data collection,
storage, and analysis. Increasing the data handling capacity is technically viable but expensive.
For example, assume IOAM is applied to all the traffic. One
node will collect a few tens of bytes as telemetry data for each packet. The whole forwarding path might accumulate a data trace with a size similar to
the average size of the original packets.
Exporting the telemetry data will consume almost half of the network bandwidth. C3: The collectible data defined currently are essential but limited. As the network operation evolves to
be declarative (intent-based) and automated, and the trends
of network virtualization, network convergence, and packet-optical integration continue, more data will be needed in an on-demand and interactive
fashion. Flexibility and extensibility on data defining, acquisition, and filtering, must be considered. C4: If we were to apply some on-path telemetry technique in today's carrier networks,
we must provide solutions to tailor the provider's network deployment base and
support an incremental deployment strategy.
That is, we need to support established encapsulation schemes for various predominant protocols such as Ethernet, IPv4, and MPLS with backward compatibility
and properly handle various transport tunnels. C5: Applying only a single underlying telemetry technique may lead to defective result. For example, packet drop can cause the loss of the flow
telemetry data and the packet drop location and reason remains unknown if only In-situ OAM trace option is used.
A comprehensive solution needs the flexibility to
switch between different underlying techniques and adjust the configurations and parameters at runtime. C6: Development of simplified on-path telemetry primitives and models,
including: telemetry data (e.g., nodes, links, ports, paths, flows,
timestamps) query primitives. These may be used by an API-based
telemetry service for external applications, for
monitoring end-to-end latency measurement of network paths and
application latency calculation.This section defines and explains some terms used in this document. Acquiring data about a packets on its forwarding path.
The term refers to a class of data plane telemetry techniques which collect data about user flows and packets along
their forwarding paths. IOAM, PBT, IFA, EAM, and HTS are all on-path telemetry techniques.
Such techniques may need to mark user packets, or insert instruction or data to the headers of user packets. In-situ Flow Information TelemetryA reference framework that supports network OAM applications to apply dataplane on-path telemetry techniques.A network OAM application that applies the iFIT framework. The network domain that participates in an iFIT application.A network node that is in an iFIT domain and is capable of iFIT-specific functions. The entry node to an iFIT domain. Usually the instruction header encapsulation, if needed, happens here. The exit node of an iFIT domain. Usually the instruction header decapsulation, if needed, happens here. To address the aforementioned challenges, we propose an architectural framework based on multiple network operators' requirements and common industry practice,
which can help to build a workable on-path flow telemetry solution.
We name the framework "In-situ Flow Information Telemetry" (iFIT) to reflect the fact that this framework
is dedicated to the on-path telemetry data about user/application flow experience.
As an architectural framework for building a complete solution, iFIT works a level higher than specific data plane OAM techniques, be it active, passive, or hybrid.
The framework is built up on a few high level architectural components (Section 4). By assembling these components,
a closed-loop can be formed to provide a complete solution for static, dynamic, and interactive telemetry applications (Section 5). iFIT is an open framework. It does not enforce any specific implementation on each component, neither does it define interfaces (e.g., API, protocol)
between components.
The choice of underlying on-path telemetry techniques and other implementation details is determined by application implementer.
The network architecture that applies iFIT is shown in Figure 1.
The iFIT domain is confined between the iFIT head nodes and the iFIT end nodes. An iFIT domain may cross multiple network domains.
An iFIT application uses a controller to configure all the iFIT nodes.
The configuration determines what telemetry data are collected. After the telemetry data processing and analyzing, the iFIT application
may instruct the controller to modify the iFIT node configuration and affect the future telemetry data collection.
How applications communicate with the controller is out of scope for this document iFIT supports two basic on-path telemetry data collection modes: passport mode (e.g., IOAM
trace option and IFA), in which telemetry data are carried in user packets and exported at the iFIT end nodes, and postcard mode (e.g., PBT),
in which each node in the iFIT domain may export telemetry data through independent OAM packets. Note that the boundary between the two modes
can be blurry. An application only need to mix the two modes. first uses the analogy of passport and postcard to describe how the packet trace data can be collected and exported.
In the passport mode, each node on the path adds the telemetry data to the user packets. The accumulated data trace is exported at a configured end node.
In the postcard mode, each node directly exports the telemetry data using an independent packet while the user packets are intact.A prominent advantage of the passport mode is that it naturally retains the telemetry data correlation along the entire path. The passport mode also reduces
the number of data export packets and the bandwidth consumed by the data export packets.
These can help to make the data collector and analyzer's work easier.
On the other hand, the passport mode requires more processing on the user packets and increases the size of user packets,
which can cause various problems. Some other issues are
documented in .The postcard mode provides a perfect complement to the passport mode. It addresses most of the issues faced by the passport mode,
at a cost of needing extra effort to
correlate the postcard packets.The high level components of iFIT are listed as follows: Smart flow and data selection policy to address the challenge C1 described in Section 1. Smart data export to address the challenge C2. Dynamic network probe to address C3. Encapsulation and tunneling to address C4. On-demand technique selection and integration to address C5. Note that this document does not directly address the challenge C6 which is left to be a concern for iFIT application implementers.Next we provide a detailed description of each component.In most cases, it is impractical to enable the
data collection for all the flows and for all the packets in a flow due to the potential performance and bandwidth impact. Therefore, a workable solution
must select only a subset of flows and flow packets to enable the data collection, even though this means the loss of some information.
In the data plane, the Access Control List (ACL) provides an ideal means to determine the subset of flow(s).
describes how one can set a sample rate or probability to a flow to
allow only a subset of flow packets to be monitored, how one can collect a different set of data for different packets, and how one can disable or enable data
collection on any specific network node. The document further introduces an enhancement to IOAM to allow any node to accept or deny the data collection
in full or partially.
Based on these flexible mechanisms, iFIT allows applications to apply smart flow and data selection policies to suit the requirements. The applications can
dynamically change the policies at any time based on the network load, processing capability, focus of interest, and any other criteria.
Network operators are usually
more interested in elephant flows which consume more resource
and are sensitive to changes in network conditions. A CountMin Sketch can be used on the
data path of the head nodes, which identifies and reports the elephant
flows periodically. The controller maintains a current set of elephant flows and
dynamically enables the on-path telemetry for only these flows.Applying on-path telemetry on all packets of selected
flows can still be out of reach. A sample rate should be set for these
flows and only enable telemetry on the sampled packets. However, the
head nodes have no clue on the proper sampling rate. An overly
high rate would exhaust the network resource and even cause
packet drops; An overly low rate, on the contrary, would result in the
loss of information and inaccuracy of measurements.An adaptive approach can be used based on the network conditions to
dynamically adjust the sampling rate. Every node gives user traffic
forwarding higher priority than telemetry data export. In case of
network congestion, the telemetry can sense some signals from
the data collected (e.g., deep buffer size, long delay, packet drop,
and data loss). The controller may use these signals to adjust the packet
sampling rate. In each adjustment period (i.e., RTT of the feedback
loop), the sampling rate is either decreased or increased in response
of the signals. An AIMD policy
similar to the TCP flow control mechanism for the rate adjustment can be used.The flow telemetry data can catch the dynamics of the network and the interactions between user traffic and network. Nevertheless, the data inevitably contain
redundancy. It is advisable to remove the redundancy from the data in order to reduce the data transport bandwidth and server processing load.
In addition to efficient export data encoding (e.g., IPFIX or
protobuf), iFIT nodes have several other ways to reduce the export data by
taking advantage of network device's capability and programmability.
iFIT nodes can cache the data and send the accumulated data in batch
if the data is not time sensitive.
Various deduplication and compression techniques can be applied on the batch data.
From the application perspective, an application may only be interested in some special events which can be derived from the telemetry data. For example,
in case that the forwarding delay of a packet exceeds a threshold, or a flow changes its forwarding path is of interest, it is unnecessary to send
the original raw data to the data collecting and processing servers. Rather, iFIT takes advantage of the in-network computing capability of network devices
to process the raw data and only push the event notifications to the subscribing applications.
Such events can be expressed as policies. An policy can request data export only on change, on exception, on timeout, or on threshold.Network operators are
interested in the anomalies such as path change, network congestion, and packet drop. Such anomalies are hidden in raw telemetry data
(e.g., path trace, timestamp). Such anomalies can be described as events and programmed into the device data plane.
Only the triggered events are exported. For example, if a
new flow appears at any node, a path change event is triggered;
if the packet delay exceeds a predefined threshold in a node, the
congestion event is triggered; if a packet is dropped due to buffer
overflow, a packet drop event is triggered.The export data reduction due to such optimization is substantial.
For example, given a single 5-hop 10Gbps path, assume a moderate number of 1 million packets per second are monitored,
and the telemetry data plus the export packet overhead consume less than 30 bytes per hop.
Without such optimization,
the bandwidth consumed by the telemetry data can easily exceed 1Gbps (>10% of the path bandwidth),
When the optimization is used, the bandwidth consumed by the telemetry data is negligible.
Moreover, the pre-processed telemetry data
greatly simplify the work of data analyzers.Due to limited data plane resource and network bandwidth,
it is unlikely one can monitor all the data all the time. On the other hand, the data needed by applications may
be arbitrary but ephemeral. It is critical to meet the dynamic data requirements with limited resource.Fortunately, data plane programmability allows iFIT to dynamically load new data probes. These on-demand probes are called
Dynamic Network Probes (DNP). DNP is the technique to enable probes for customized data collection
in different network planes. When working with IOAM or PBT, DNP is loaded to the data plane through incremental programming or
configuration. The DNP can effectively conduct data generation, processing, and aggregation. DNP introduces enough flexibility and extensibility to iFIT. It can implement the optimizations for export data reduction motioned in the previous section.
It can also generate custom data as required by today and tomorrow's applications.
Following are some possible DNPs that can be dynamically deployed to support iFIT applications. A flow sketch is a compact online data structure for approximate flow statistics which can be used to facilitate
flow selection. The aforementioned CountMin Sketch is such an example. Since a sketch consumes data plane resources, it should only be deployed when
needed. The policies that choose flows and packet sampling rate can change during the lifetime of an application. An application may need to interactively count flows based on different flow granularity or maintain hit counters for selected
flow table entries. DNP can be used to program the events that conditionally trigger data export.Since the introduction of IOAM, the IOAM option header encapsulation schemes in various network protocols have been proposed with
the omission of some protocols, such as
MPLS and IPv4, which are still prevalent in carrier networks. iFIT provides solutions to apply the on-path flow telemetry techniques in such networks.
PBT-M does not introduce new headers to the packets so the trouble of encapsulation
for a new header is avoided. In case a technique that requires a new header is preferred,
provides a means to encapsulate
the extra header using an MPLS extension header. As for IPv4, it is possible to encapsulate the new header in an IP option.
For example, RAO can be used to indicate the presence of the new header. A recent proposal
that introduces the IPv4 extension header may lead to a long term solution.
In carrier networks, it is common for user traffic to traverse various tunnels for QoS, traffic
engineering, or security. iFIT supports both the uniform mode and the pipe mode for tunnel support as described in
. With such flexibility, the operator can either gain a true end-to-end visibility or apply
a hierarchical approach which isolates the monitoring domain between customer and provider.
With multiple underlying data collection and export techniques at its disposal, iFIT can flexibly adapt to different network conditions and
different application requirements.For example, depending on the types of data that are of interest, iFIT may choose either IOAM or PBT to collect the data; if an application needs
to track down where the packets are lost, it may switch from IOAM to PBT.iFIT can further integrate multiple data plane monitoring and measurement techniques together and present a comprehensive data plane telemetry
solution to network operating applications. The iFIT architectural components can work together to form closed-loop applications, as shown in Figure 2.An iFIT application may pick a suite of telemetry techniques based on its requirements and apply an initial technique to the data plane.
It then configures the iFIT head nodes to
decide the initial target flows/packets and telemetry data set, the encapsulation and tunneling scheme based on the
underlying network architecture, and the iFIT-capable nodes to decide the initial telemetry data export policy.
Based on the network condition and the analysis results of the telemetry data,
the iFIT application can change the telemetry technique, the flow/data selection policy,
and the data export approach in real time without breaking the normal network operation.
Many of such dynamic changes can be done through loading and unloading DNPs.We should avoid confusion between this closed telemetry loop and the closed control loop. The latter term is often used in the context of network automation.
In such a closed control loop, telemetry also plays an important role. Based on the telemetry results,
applications can automatically change the network policy or configuration.
In such a context, iFIT is just a part of the loop. The closed-loop nature of the iFIT framework allows numerous new
applications which enable future network operation architecture.
describes an intelligent performance management based on the
network condition. The idea is to split the monitoring network
into clusters. The cluster partition that can be applied to every
type of network graph and the possibility to combine clusters at
different levels enable the so-called Network Zooming. It allows
a controller to calibrate the network telemetry, so that it can
start without examining in depth and monitor the network as a
whole. In case of necessity (packet loss or too high delay), an
immediate detailed analysis can be reconfigured. In particular,
the controller, that is aware of the network topology, can set up
the most suited cluster partition by changing the traffic filter
or activate new measurement points and the problem can be localized
with a step-by-step process.An iFIT application on top of the controllers
can manage such mechanism and the iFIT closed-loop architecture allows its dynamic and flexible
operation.In this example, a user can express high level intents for network monitoring. The controller translates an intent
and configure the corresponding DNPs in iFIT nodes which collect necessary network information. Based on the realtime
information feedback, the controller runs a local algorithm to determine the suspicious flows. It then deploys ACLs to the iFIT head node to
initiate the high precision per-packet on-path telemetry for these flows. iFIT is an open framework for applying on-path telemetry techniques.
Combining with algorithmic and architectural schemes that fit into the framework components, iFIT framework enables a practical telemetry solution based on two basic on-path traffic
data collection modes: passport and postcard.The operation of iFIT differs from both active OAM and passive OAM as defined in . It does not generate any active probe packets
or passively observe unmodified user packets. Instead, it modifies selected user packets to collect useful information about them. Therefore, the iFIT operation
can be considered the hybrid type III mode, which can provide more flexible and accurate network OAM.More challenges and corresponding solutions for iFIT may need to be covered. For example,
how iFIT can fit in the big picture of autonomous networking and support closed control loops.
A complete iFIT framework should also consider the cross-domain operations.
We leave these topics for future revisions.No specific security issues are identified other than those have been discussed in the drafts on on-path flow information telemetry.This document includes no request to IANA.Other major contributors of this document include Giuseppe Fioccola, Daniel King, Zhenqiang Li, Zhenbin Li, Tianran Zhou, and James Guichard.
We thank Shwetha Bhandari, Joe Clarke, and Frank Brockners for their constructive suggestions for improving this document.An improved data stream summary: the count-min sketch and its applicationsWhere is the debugger for my software-defined network?