Autonomic Networking Use Case for Distributed Detection of SLA Violations
draft-irtf-nmrg-autonomic-sla-violation-detection-00

Abstract

This document describes a use case for autonomic networking in distributed detection of SLA violations. It is one of a series of use cases intended to illustrate requirements for autonomic networking.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on December 25, 2014.

Copyright Notice

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

1. Introduction
2. Problem Statement
3. Benefits of an Autonomic Solution
4. Intended User and Administrator Experience
5. Analysis of Parameters and Information Involved
5.1. Device Based Self-Knowledge and Decisions
5.2. Interaction with other devices
5.3. Information needed from Intent
5.4. Monitoring, diagnostics and reporting
6. Comparison with current solutions
7. Related IETF Work
8. Acknowledgements
9. IANA Considerations
10. Security Considerations
11. References
11.1. Normative References
11.2. Informative References
Authors' Addresses

1. Introduction

The Internet has been improving dramatically in terms of size and capacity, and accessibility in the last years. Besides that, the communication requirements of distributed services and applications running on top of the Internet have become increasingly accurate. Performance issues caused by violations on these requirements usually present significant financial loss to organizations and end users. Thus, the service level requirements of critical networked services provided have become a critical concern for network administrators. To ensure that SLAs are not being violated, which would usually incur in costly penalties, service levels need to be constantly monitored at the network infrastructure layer. To that end, network measurements must take place. Network measurement mechanisms are performed through either active or passive measurement techniques. In passive measurement, network conditions are said to be checked in a non intrusive way because no monitoring traffic is created by the measurement process itself. In the context of IP Flow Information EXport (IPFIX) WG, several documents were produced to define passive measurement mechanisms (e.g., flow records specification [RFC3954]). Active measurement, on the other hand, is intrusive because it injects synthetic traffic into the network to measure the network performance. The IP Performance Metrics (IPPM) WG produced documents that describe active measurement mechanisms, such as: One-Way Active Measurement Protocol (OWAMP) [RFC4656], Two-Way Active Measurement Protocol (TWAMP) [RFC5357], and Cisco Service Level Assurance Protocol (SLA) [RFC6812]. Active measurement mechanisms usually offer better accuracy and privacy than passive measurement mechanisms. Furthermore, active measurement mechanisms are able to detect end-to-end network performance problems in a fine-grained way. As a result, active is preferred over passive measurement for SLA monitoring. Measurement probes must be hosted and activated in network devices to compute the current network metrics (e.g., considering those described in [RFC4148]). This activation should dynamic in order to follow changes in network conditions, such as those related with routes being added or new customer demands.

2. Problem Statement

The activation of active measurement probes (sender and responder considering the architecture described by Cisco [RFC6812]) is expensive in terms of the resource consumption, e.g., CPU cycle and memory footprint, which could be useful for primary network functions (e.g., routing and switching). Besides that, the probes also increase the network load because of the injected traffic. The resources required and traffic generated by the measurement probes are a function of the number of measured network destinations, i.e., with more destinations the larger will be the resources and the traffic needed to deploy the probes. Thus, to have a better monitoring coverage it is necessary to deploy more probes what consequently turns increases consumed resources. Otherwise, enabling the observation of just a small subset of all network flows can lead to an insufficient coverage. The current best practice in feasible deployments of active measurement solutions to distribute the available measurement probes along the network consists in relying entirely on the human administrator expertise to infer which would be the best location to activate the probes. This is done through several steps. First, it is necessary to collect traffic information in order to grasp the traffic matrix. Then, the administrator uses this information to infer which are the best destinations for measurement probes. After that, the administrator activates probes on the chosen subset of destinations considering the available resources. This practice, however, does not scale well because it is still labor intensive and error-prone for the administrator to compute which probes should be activated given the set of critical flows that needs to be measured. Even worse, this practice completely fails in networks whose critical flows are too short in time and dynamic in terms of traversing network path, like in modern cloud environments. That is so because fast reactions are necessary to reconfigure the probes and administrators are not just enough in computing and activating the new set of probes required every time the network traffic pattern changes. Finally, the current active measurements practice usually covers only a fraction of the network flows that should be observed, which invariably leads to the damaging consequence of undetected SLA violations. Management software can be embedded inside network devices to control the deployment of active measurement mechanisms. In fact, this is done by some network equipment vendors, specially to avoid the starvation of the network devices (e.g., due to configuration errors and lack of experience from human administrators). However, the current approach do not enhance the active measurement capabilities in important terms, such as scalability and efficiency. For example, the number of local available measurements (and, consequently, detected SLA violations) is still bounded by the number of deployed probes. Thus, if the number of SLA violation is greater than the number of available probes, only a fraction of the violations will be observed. Also, devices cannot share resources and knowledge about the networking infrastructures in order to take advantage of remote management information (e.g., measurement results).

3. Benefits of an Autonomic Solution

The use case considered here is distributed autonomic detection of SLA violations. The use of Autonomic Netowrking (AN) properties can help the activation of measurement probes [P2PBNM-Nobre-2012]. Peer-to-Peer (P2P) technology can be embedded in network devices in order to improve the probe activation decisions using autonomic loops. Thus, it would be possible to coordinate the probe activation and to share measurement results among different network devices. The problem to be solved by AN in the present use case is how to steer the process of measurement probe activation by a complete solution that sets all necessary parameters for this activation to operate efficiently, reliably and securely, with minimal human intervention and without the need for. An autonomic solution for the distributed detection of SLA violations can provide several benefits. First, this solution could optimize the resource consumption and avoid resource starvation on the network devices. This optimization comes from different sources: sharing of measurement results, better efficiency in the probe activation decisions, etc. Second, the number of detected SLA violations could be increased. This increase is related with a better coverage of the network. Third, the solution could decrease the time necessary to detect SLA violations. Adaptivity features of an autonomic loop could capture faster the network dynamics than an human administrator. Finally, the solution could help to reduce the workload of human administrator, or, at least, to avoid their need to perform operational tasks. The active measurement model assumes that a typical infrastructure will have multiple network segments and Autonomous Systems (ASs), and a reasonably large number of several of routers and hosts. It also considers that multiple Service Level Objectives (SLOs) can be in place in a given time. Since interoperability in a heterogenous network is a goal, features found on different active measurement mechanisms (e.g. OWAMP, TWAMP, and IPSLA) and programability interfaces (e.g., Cisco's EEM and onePK) could be used for the implementation. The autonomic solution should include and/or reference specific algorithms, protocols, metrics and technologies for the implementation of distributed detection of SLA violations as a whole.

4. Intended User and Administrator Experience

The autonomic solution should avoid the human intervention in the distributed detection of SLA violations. Besides that, it could enable the control of SLA monitoring by less experienced human administrators. However, some information is necessary from the human administrator. For example, the human administrator should provide the SLOs regarding the SLA being monitored. The configuration and bootstrapping of network devices using the autonomic solution should be minimal for the human administrator. Probably it would be necessary just to inform the address of a device which is already using the solution and the devices themselves could exchange configuration data.

5. Analysis of Parameters and Information Involved

5.1. Device Based Self-Knowledge and Decisions

Each device has self-knowledge about the local SLA monitoring. This could be in the form of historical measurement data and SLOs. Besides that, the devices would have algorithms that could decide which probes should be activated in a given time. The choice of which algorithm is better for a specific situation would be also autonomic.

5.2. Interaction with other devices

Network devices could share information about service level measurement results. This information could speed up the detection of SLA violations and increase the number of detected SLA violations. In any case, it is necessary to assure that the results from remote devices have local relevancy. The definition of network devices that exchange measurement data, i.e., management peers, creates a new topology. Different approaches could be used to define this topology (e.g., correlated peers [P2PBNM-Nobre-2012]). To bootstrap peer selection, each device could use its known endpoints neighbors (e.g., FIB and RIB tables) as the initial seed to get possible peers.

5.3. Information needed from Intent

TBD

5.4. Monitoring, diagnostics and reporting

TBD

6. Comparison with current solutions

There is no standartized solution for distributed autonomic detection of SLA violations. Current solutions are restricted to ad hoc scripts running on a per node fashion to automate some administrator's actions. There some proposals for passive probe activation (e.g., DECON and CSAMP), but without the focus on autonomic features. It is also mentioning a proposal from Barford et al. to detect and localize links which cause anomalies along a network path.

7. Related IETF Work

The following paragraphs discuss related IETF work and are provided for reference. This section is not exhaustive, rather it provides an overview of the various initiatives and how they relate to autonomic distributed detection of SLA violations. 1. [LMAP]: The Large-Scale Measurement of Broadband Performance Working Group aims at the standards for performance management. Since their mechanisms also consist in deploying measurement probes the autonomic solution could be relevant for LMAP specially considering SLA violation screening. Besides that, a solution to decrease the workload of human administrators in service providers is probably highly desirable. 2. [IPFIX]: IP Flow Information EXport (IPFIX) aims at the process of standardization of IP flows (i.e., netflows). IPFIX uses measurement probes (i.e., metering exporters) to gather flow data. In this context, the autonomic solution for the activation of active measurement probes could be possibly extended to address also passive measurement probes. Besides that, flow information could be used in the decision making of probe activation. 3. [ALTO]: The Application Layer Traffic Optimization Working Group aims to provide topological information at a higher abstraction layer, which can be based upon network policy, and with application-relevant service functions located in it. Their work could be leveraged for the definition of the topology regarding the network devices which exchange measurement data.

8. Acknowledgements

We wish to acknowledge the helpful contributions, comments, and suggestions that were received from Bruno Klauser, Eric Voig, and Hanlin Fang.

9. IANA Considerations

This memo includes no request to IANA.

10. Security Considerations

The bootstrapping of a new device follows the approach of homenet [draft-autonomic-homenet], thus in order to exchange data a device should register first. This registration could be performed by a "Registrar" device or a cloud service provided by the organization to facilitate autonomic mechanisms. The new device sends its own credentials to the Registrar, and after successful authentication, receives domain information, to enable subsequent enrolment to the domain. The Registrar sends all required information: a device name, domain name, plus some parameters for the operation. Measurement data should be exchanged signed and encripted among devices since these data could carry sensible information about network infrastructures. Some attacks should be considering when analyzing the security of the autonomic solution Denial of service (DoS) attacks could be performed if the solution be tempered to active more local probe than the available resources allow. Besides that, results could be forged by a device (attacker) in order to this device be considered peer of a specific device (target). This could be done to gain information about a network.

11. References

11.1. Normative References

[P2PBNM-Nobre-2012]	Nobre, J., Granville, L., Clemm, A. and A. Prieto, "Decentralized Detection of SLA Violations Using P2P Technology, 8th International Conference Network and Service Management (CNSM)", 2012.
[RFC4656]	Shalunov, S., Teitelbaum, B., Karp, A., Boote, J. and M. Zekauskas, "A One-way Active Measurement Protocol (OWAMP)", RFC 4656, September 2006.
[RFC5357]	Hedayat, K., Krzanowski, R., Morton, A., Yum, K. and J. Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", RFC 5357, October 2008.
[RFC6812]	Chiba, M., Clemm, A., Medley, S., Salowey, J., Thombare, S. and E. Yedavalli, "Cisco Service-Level Assurance Protocol", RFC 6812, January 2013.
[draft-autonomic-homenet]	Behringer, M., Pritikin, M. and S. Bjarnason, "draft-behringer-homenet-trust-bootstrap", Internet-Draft draft-behringer-homenet-trust-bootstrap-02, February 2014.

11.2. Informative References

[RFC3954]	Claise, B., "Cisco Systems NetFlow Services Export Version 9", RFC 3954, October 2004.
[RFC4148]	Stephan, E., "IP Performance Metrics (IPPM) Metrics Registry", BCP 108, RFC 4148, August 2005.

Authors' Addresses

Jeferson Campos Nobre Federal University of Rio Grande do Sul Porto Alegre, Brazil EMail: jcnobre@inf.ufrgs.br

Lisandro Zambenedetti Granvile Federal University of Rio Grande do Sul Porto Alegre, Brazil EMail: granville@inf.ufrgs.br

Alexander Clemm Cisco Systems San Jose, USA EMail: alex@cisco.com

Alberto Gonzalez Prieto Cisco Systems San Jose, USA EMail: albertgo@cisco.com