TE Working Group B. Christian Internet Draft UUNET Document: draft-christian-tewg-measurement-00.txt B. Davies Category: Informational UUNET H.Tse UUNET Jul 2000 Operational measurements for Traffic Engineering Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 [1]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. 1. Abstract This memo describes measurement in order to accomplish Traffic Engineering (TE) in IP networks. This document will aid vendors in their choice of information to provide; it will assist network operators in determining the appropriate information to request; and will demonstrate how measurements are used to accomplish TE. The objective of this memo is to describe TE measurement. This memo will also describe (in brief) some methods for using the variables and some methods for gathering the information. Christian/Davies/Tse Informational - Dec2000 1 draft-christian-tewg-measurement-00.txt July 2000 2. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [2]. 3. What is Traffic Measurement? Traffic Measurement (TM) is defined for the purposes of this document as a means of characterizing a flow of IP packets from one point to another. The characteristics of a traffic flow can be loosely defined as Throughput, Loss, Delay, Path, and Lifetime. These characteristics should be represented in every device that carries a flow of IP traffic. Delay variation and other measures are modifications of the above. A traffic flow can become arbitrarily specific. An example would be the measurement of traffic on a physical link as compared to measuring traffic on a virtual link. A physical link with many virtual links will aggregate a number of smaller traffic flows. A flow can also be an aggregate of physical links in schemes such as link bundling or ECMP. The measurement of traffic is meant to "facilitate reliable network operations." [AWD1] Traffic measurement provides a means for capacity planning as well as a means to work around congestion. Traffic measurement standards need to be protocol independent and should be portable across platforms. Traffic measurement is accomplished with the goals of modifying the path of traffic, allocating capacity, reducing congestion, and observing trends. 4. Advantages of TE measurements 4.1 Real-time and long-term TE measurement TE measurements are instrumental in providing real-time as well as long-term proactive TE. Network performance may be evaluated by examination of TE measurements. Measurements, such as throughput vs. maximum bandwidth, can indicate link utilization and link congestion on the network. Due to the transient nature of the network, the measurements must be able to derive the real-time characteristics of the network to be effective. Christian/Davies/Tse Informational - Dec2000 2 draft-christian-tewg-measurement-00.txt July 2000 Over a period of time, measurement metrics should be able to provide for long-term TE. Long term TE includes traffic growth patterns, congestion issues, and traffic peak patterns. Traffic growth and peak patterns can be derived from measurements such as throughput and peak rate. Measurements must facilitate proactive TE strategies to optimize the network or to avoid undesirable network conditions. 4.2. Measurements for traffic management 4.2.1. Load balancing To perform TE is to be able to optimize network traffic flows and balance network traffic on multiple trunks. During load balancing, traffic will be partitioned at the incoming interface onto multiple virtual paths. In the case of virtual links, based on the TE measurements, secondary link(s) with the appropriate requirements may be created to accommodate load balancing. Measurement of available bandwidth, loss, and delay are critical in determining the feasibility of creating secondary connections. Measurements, such as available bandwidth, change constantly. The network will not be in a steady load-balanced state because of its dynamic changing flow. In order to achieve a load-balanced steady state TE measurements are needed to determine recomputation and optimization intervals. 4.2.2. Policy-based TE measurements Policy-based TE provides flexibility in the specification of the network optimization objectives and constraints. Policy can be adjusted or fine-tuned on a continuous basis. Policy attributes on network path include priority, preemption, resilience, resource classes and policing. Policy-based TE measurements should compare the metric values with the thresholds based on the policy to trigger the appropriate actions. Policy-based measurements can be used to identify potential network traffic issues. Comparison of the measurements and policy-based thresholds can be setup statically at a predefined time interval or dynamically at event occurrence. For instance, in the event of path preemption, the traffic pattern can be impacted and the traffic flow changes. Measurements should be compared with the threshold values to ensure that proper actions Christian/Davies/Tse Informational - Dec2000 3 draft-christian-tewg-measurement-00.txt July 2000 are taken if the preemption induces undesirable effects on the traffic pattern. Policy-based TE should be in compliance with the Policy Information Base (PIB) specifications. Constraint-based routing (CBR) TE specifies a finer subset of the policy-based TE. CBR takes place when all the specified constraints are met by the TE measurements. Measurements must provide traffic characteristics in order to facilitate constraint-based routing comparison. Constraint specifications can include peak rate, committed rate and service levels. Policy-based TE measurements, such as bandwidth availability, can be compared with the peak rate and committed rate constraints to determine if they are met or not. 4.2.3. Measurements for Path Protection/Restoration Fault detection, path protection, and restoration are imperative in an operations environment. TE measurements are essential to ensure these mechanisms are in place. Faults can be identified using TE measurements such as packet loss or low throughput. Notifications may be generated automatically based on the observed value of these variables. Other metrics can determine the amount of spare capacity for different failure recovery scenarios. For example: a. Prior to restoring traffic to the original path b. Prior to creating the protection path Examination of TE measurement metrics can also be used to ensure that there is no overlap of the primary and secondary paths. 5. Throughput Throughput is a measure of the amount of traffic that passes between a set of end points, where end points can be logically or physically defined. The amount of traffic is a measure of the quantity of bits that pass over a period of time. This is usually represented as Bits Per Second or BPS. Another facet of throughput is Packets Per Second or PPS. PPS is infrequently used. However, PPS in conjunction with BPS will allow the operator to determine average packet sizes. Average packet size is an important measure as some vendors can have problems passing small packets at line rate. Christian/Davies/Tse Informational - Dec2000 4 draft-christian-tewg-measurement-00.txt July 2000 Both Medium and Long Term TE require a measure of throughput for intervention in scenarios of decreasing bandwidth availability as well for planning future capacity needs. Throughput measurement will also be important in situations where new software creates the demand for dynamic IP flow controls. See [AWD2] for a more detailed explanation of TE over time. Throughput for general usage is best measured at a regular interval. Most operators choose 5 minutes as their interval of choice. This provides for an approach that is granular without being so aggressive that the amount of data recorded becomes overwhelming. The use of the 5-minute interval is best when active traffic measurement (active traffic measurement is measurement with network operator involvement) is not being performed. The choice of 5- minute interval provides for enough data to identify daily/monthly/weekly trends. This data is used to predict capacity needs and to identify points of rising congestion. During periods of active traffic measurement intervals of 5 seconds are not uncommon. Active throughput measurement is undertaken in order to provide a means of working with points of congestion. With active throughput measurement the operator will identify flows and choose alternate paths or other modifications of flow parameters. Active throughput measurement also provides a means of monitoring changes to network parameters and the impact on traffic during production traffic engineering efforts. Vendors provide various levels of throughput measurement. Some vendors choose to measure throughput as the amount of IP traffic passed. Unfortunately, with differing methods it becomes necessary to remember which vendor you are measuring and adjust appropriately. An example would be switch vs. router. Many switches report the throughput of their protocol (such as ATM) which is, of course, greater than the throughput possible for an IP packet encapsulated within the protocol. A measure of throughput, which relates the most to what an IP packet perceives as throughput, would include only the IP packet. Additional encapsulation can create a false sense of capacity since some methods of switching can take up significant amounts of bandwidth (see ATM). The above statements seem to indicate that the best method for representing IP traffic is to subtract all additional forms of encapsulation from your measurements. This requires that the amount of space used for encapsulation be well known. For most encapsulation methods this works quite well since the amount of space necessary is well known. Christian/Davies/Tse Informational - Dec2000 5 draft-christian-tewg-measurement-00.txt July 2000 The 95th percentile is used to determine flow utilization. The percentile allows the capacity planner to determine future needs while avoiding the statistical anomalies that are inherent in packet networks. For the network operator 95 percent utilization is used to set alarms as well as determine that flows are approaching their predefined thresholds. 6. Loss A flow has certain requirements it must satisfy in order to be considered a quality service. The degree of loss is an important factor. No internet service (or it's component flows) will always be 100% loss free, therefore the loss constraint must be defined based on network dynamics and internal system constraints (topology, bandwidth etc.). What is acceptable loss? None is the preferred answer, but that is not always practical or possible. Generically, loss can be viewed as a quality attribute of a flow. The loss attribute of a flow, when compared to the predetermined constraints allows for problem determination. Accounting and measurement (real-time and long-term) provide the necessary information for developing a solution and finding the best possible resolution based on the system constraints. Traceroute & Ping at L3 allow the user to see loss and latency. Traceroute at L2 (in an overlay) can allow the user to see problems at L2. Physical outages and errors can lead to any number of higher level errors. Loss can be caused by outages in a bandwidth guaranteed TDM system (such as SONET/SDH) where no statistical gain is generally achieved. Loss can also be attributed to statistical systems where demand outweighs supply. (input port 1 + input port 2) > output port 1 Protection schemes such as (1+1, 1:1, N:1) can be used to mitigate TDM loss. Buffering, scheduling, and randomized discard strategies can be used to mitigate statistical loss and protection schemes. A laundry list of required values needed to mitigate, plan for, and resolve a flow's loss attribute would include: Per traffic class loss statistics. (ex UBR/ABR/VBR/CBR, multiple FECs, diffserv) Christian/Davies/Tse Informational - Dec2000 6 draft-christian-tewg-measurement-00.txt July 2000 -Intentional loss (RED, policy, contract enforcement) -Unintentional loss (buffer over-utilization, congestion, etc) -Total loss (cause independent) Per flow loss statistics (VC, DLCI, LSP) -Intentional loss -Unintentional loss -Total loss Per interface loss statistics -Intentional loss -Unintentional loss -Total loss 7. Delay Delay measurement, defined as the time it takes for a packet to travel from source to destination, is a must for any IP forwarding device. Delay directly affects the responsiveness of protocols such as TCP across the network. Round-trip packet delay, in some cases, may not be equal to twice the one-way packet delay due to asymmetric paths. On an uncongested network, delay value will provide the ability to measure propagation and transmission delay. Delay measurement is very useful as the use of real time and delay sensitive applications is growing. Along with end-to-end delay, buffer delay should also be taken into consideration and measured separately. Buffer delay is defined for the purposes of this document as the time it takes for a node to transfer/switch a packet from the ingress to the egress interface. This value is dependent on the type/bandwidth of the ingress and egress interfaces. Vendors have different implementations of the memory pools used for packet buffering e.g. per interface buffers or the use of a global pool of memory buffers, resulting in different values when measuring buffer delay. In other words, different vendors can have different ingress to egress transit times. Measurement of buffer delay will create the ability to determine the amount of time involved in transiting a device. This will help operators to determine congestion points as well as equipment performance under load. In test scenarios the measurement of buffer delay is academic since, in most situations, the path will not have a speed of light delay that is measurable. Sending alerts based on buffer delay provides a means of determining congestion without relying on tools such as ping which can add to the problem. Ping Christian/Davies/Tse Informational - Dec2000 7 draft-christian-tewg-measurement-00.txt July 2000 and similar tools are also external indicators of performance issues and may not monitor all paths through the network (ECMP for example). Pandiculation of buffer sizes will increase potential buffer delay and some vendors provide methods for doing this. Application level programs like ping and traceroute provide a means of measuring end-to-end delay. Most network management systems rely on pings to monitor performance of a given path. Methodologies for delay measurement on a node level will vary depending on vendor implementation. If all the nodes in the path of a packet are closely synchronized to a GPS clock, NTP (network time protocol) can be used as one way to measure packet delay. The source node will place a time-stamp in the packet and send it towards the destination. The destination node, upon receiving the packet, time-stamps it. The difference in value of the two time-stamps, along with any adjustment (adjustments may be necessary due to differences in clock synchronization) is one-way packet delay. The process can be repeated periodically with 3 to 5 packets sent in each instance. In addition to buffer delay, delay measurements can be impacted by frame translation. When IP traffic is being switched or routed from a device to another, SAR process can take place to translate the frame format. This will add delay into the switching or routing. Delay metrics for TE measurement can be optimized by engineering flows to avoid unnecessary frame translation or SAR. 8. Path Path can be described as the hops that packets in a flow will take from ingress node to egress node. It is not uncommon for there to be three separate layers of path information, from physical layer, to switched layer, to IP layer. Programs such as traceroute and ping can provide a record of the nodes that a packet has to traverse. Ping and traceroute only provide IP layer information and when a traceroute UDP packet, or a ping with a record option set, is received by a node the packet leaves the switching path and the information regarding the switching environment in the node is lost. Path information provides the ability to determine a flows preferred topology. Maintaining a history of previously preferred paths provides the ability to determine where a flow has previously lived and will provide the ability to prepare for network failures. Historical path information is used to determine failure scenarios Christian/Davies/Tse Informational - Dec2000 8 draft-christian-tewg-measurement-00.txt July 2000 that would represent overload based on aggregate potential flows over failover links (links that are preferred during outages). Hop count generally indicates on a node level how many nodes a packet has traversed in its quest for a destination. Simply counting the number of hops that a flow commonly prefers and sending a alert when the count exceeds thresholds will provide the ability to determine that a path has reach an unreasonable length or that network state has changed. 9. Lifetime The lifetime of a flow is simply the measurement of the total time that the flow exists. As stated before, a flow can exist on a physical or logical interface and could be permanent (such as a backbone connection) or dynamic (perhaps a VPN connection at certain times of day). The lifetime can be used in several ways to help facilitate reliable network operations. In a perfect world, a permanent flow would have an infinite lifetime. In reality, link outages, equipment failures, or scheduled maintenance will always cause flow to have a finite lifetime. By tracking the lifetime of the flow, it's performance and reliability may be characterized. The information gleaned from flow lifetimes could be applied to a network monitoring tool to alert operators to potential problems at lower OSI layers. Dynamic flow lifetime information is also very useful to operators or capacity planners. The range of specialized IP services offered continues to grow, and planners will need to be able to maximize the use of their network resources (while minimizing loss of course). By understanding the lifetime of flows on the network it is possible to optimize traffic to use the network to the fullest extent while still maintaining an acceptable level of quality. 10. Applications of TE measurement Over a period of time, static and dynamic measurement metrics should be able to provide data for long-term TE. Long term TE includes traffic growth patterns, congestion issues and traffic peak patterns. Traffic growth and peak patterns can be derived from measurements such as peak and average rate. Measurements must facilitate proactive TE strategy planning to optimize the network and to avoid undesirable network conditions. Christian/Davies/Tse Informational - Dec2000 9 draft-christian-tewg-measurement-00.txt July 2000 It is incumbent on the operator to determine intervals in which measurements should be accomplished. The rate of change in the 95th percentile (throughput change over time) should cue the network operator to increase the frequency of TE efforts. An operator in the summer months may adjust flow parameters on a monthly basis and in the winter months the operator may need to adjust on a weekly basis. Tracking the rate of change over time will help the operator predict this type of behaviour. Policy-based TE measurements should compare metric values with thresholds based on the policy to trigger the appropriate actions. The policy-based measurements should be able to alert operators to potential traffic issues. The comparison of measurements and policy-based thresholds can be setup statically at a pre-defined interval or dynamically at event occurrence. For instance, in the event of path preemption, the traffic pattern can be impacted and the traffic flow changed. Measurements should be compared with the threshold values to ensure proper actions to be taken if the preemption induces some undesirable effect on the traffic pattern. Policy-based TE should be in compliance with Policy Information Base (PIB) specifications. Constraint-based routing (CBR) TE specifies a finer subset of policy-based TE. CBR takes place when all the specified constraints are met by the TE measurements. Measurements must provide the explicit traffic characteristics in order to perform the comparison for CR. Constraint specifications can include peak rate, committed rate and service levels. Policy-based TE measurements, such as bandwidth availability, can be compared with the peak rate and committed rate constraints to determine if they are met. 11. Additional TE measurement considerations 11.1. Protocol-independent link bundling considerations In order to reduce the overhead in managing multiple virtual links that are originated and destined from the same ingress and egress points, there is proposal to aggregate links for network optimization. Component links will have same constraints, resource classes and attributes. Multiple virtual links are treated as a single IP link. TE measurements, such as bandwidth availability, Christian/Davies/Tse Informational - Dec2000 10 draft-christian-tewg-measurement-00.txt July 2000 throughput, should consider the measurements for bundled virtual links. There are ongoing discussions on virtual link/channel bundling for various standards under development or enhancement, such as MPLS, optical network. TE measurements for virtual link/channel bundling should be protocol independent and media independent to ensure portability and commonality in the measurements. 11.2. Feedback mechanisms for topology state considerations As part of the constraint-based routing measurements, all nodes require topology state information. TE measurements should provide information, such as link availability, and maximum constraints/resources that each link can meet. Topology information, such as throughput, loss, and bandwidth availability, changes continuously in a large-scale environment. Information distribution methodology is usually based on flooding or pre- determined algorithm for topology changes. It takes distribution and updating time to synchronize topology information while bandwidth measurements could be changed immediately. As a result, not every node will have the same topology view. In a large-scale operations environment, the topology information discrepancies on different nodes can be a problem in the event of failure or during recovery. TE measurements should consider the recent proposal for signaling protocol to include the actual link bandwidth availability at every link that it traverses. This feedback mechanism for topology will require additional TE measurements to provide the actual information as part of the reverse flowing messaging. The RSVP TLV-type of measurements should be protocol independent. In addition to the feedback on the actual bandwidth, future TE measurements should consider information on the actual utilization, current congestion, and number of channels or wavelengths available as part of the feedback mechanism. 11.3 Optical network considerations Christian/Davies/Tse Informational - Dec2000 11 draft-christian-tewg-measurement-00.txt July 2000 Optical network development is adding new dimensions to TE measurements. The role of optical switches in the traditional data router/switch network is increasing, TE measurements need to provide information on optical performance. Optical performance measurements for TE should include LOS, BER, insertion loss, OSNR, optical channel registration, optical compliance deviation, and optical power level. The information can be distributed to the edge devices that interface the optical layer and data layer. With these optical network measurements and IP data TE measurements, virtual paths/channels can be managed dynamically and performance can be optimized. The development of a traffic engineering control plane function in the optical network will require additional TE measurements. There can be similarities in TE measurements for optical channels and labels, specifically resource availability and constraints for network dimensioning. 11.4. ICMP extensions for one-way performance metrics TE measurements should consider the extension of ICMP for one-way traffic measurements. The new ICMP messages, type 41, and type 42, are probe packets for probe request message and probe reply message, respectively. They can provide information on one-way delay based on timestamp information and one-way loss rate based on the encoded sequence number. The one-way delay and one-way loss can be useful in the TE one-way performance metric measurements. 11.5. New requirement considerations Internet application development is increasing the complexity in the TE metrics. An example is TE multicast, which requires measurements to facilitate traffic optimization when multicast and unicast traffic co-exist. TE measurements for multicast need to provide information on constraints such as network utilization channel availability, delay, loss and throughput when creating the multicast tree. Similarly, additional considerations for TE measurements are needed for the voice over IP applications. Christian/Davies/Tse Informational - Dec2000 12 draft-christian-tewg-measurement-00.txt July 2000 12. Acknowledgments Special Thanks to Syed Malik, Josh Wepman, Brad Volz, Roshan Winslow, and Rick Glasser from UUNET. And yet more thanks to Ed Balas and Mark Davisson from Caimis and to Abha Ahuja from the University of Michigan. 11. Authors' Addresses Blaine Christian UUNET Blaine@uu.net Brian Davies UUNET Daviesb@uu.net Heidi Tse UUNET Htse@uu.net 12. References: [AWD1] D. Awduche, J. Malcolm, J. Agogbua, M. O'Dell, J. McManus, "Requirements for Traffic Engineering over MPLS," RFC 2702 September 1999 [AWD2] D. Awduche, A. Chiu, A. Elwalid, I. Widjaja, X. Xiao "A Framework for Internet Traffic Engineering", Work in Progress, May 2000 Christian/Davies/Tse Informational - Dec2000 13