Integrated Services over Specific Link Layers E. Horlait Internet Draft M. Bouyer Document: draft-horlait-clep-00.txt Paris 6 University July 1999 CLEP (Controlled Load Ethernet Protocol): Bandwidth Management and Reservation Protocol for Shared Media Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 except that the right to produce derivative works is not granted. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This memo is filed as , and expires Feb 1, 2000. Please send comments to the authors. The protocol described in this memo is patented. 1. Abstract There are various aspects in Quality of Service management. In this draft, we address the problem of bandwidth allocation and reservation over shared media (e.g. an Ethernet network). In order to do so, we define a protocol (CLEP: Controlled Load Ethernet Protocol) in charge of the management, allocation and fair sharing of the available bandwidth among users of the network. The load control is done via token bucket filters on outgoing interfaces of network elements. Our protocol efficiently manages the parameters of the token buckets in order to perform admission control. This service can be used alone, or with the Resources Reservation Protocol RSVP [1]. The distributed algorithm is described in section 4 and an implementation framework of this proposal is given in section 3. Horlait, Bouyer Expires January 2000 1 Draft-horlait-clep-00.txt July, 1999 2. Conventions used in this document This document is based on the service defined in [2] and the service specification templates given in [3]. A summary of the most important definitions is given hereafter. Quality of Service (QoS) This refers to the nature of the achieved packet delivery. A network offering dynamically controllable QoS will allow individual applications to request packet delivery characteristics that fit their needs. Network Element A Network Element (or Element), is any component of an internetwork which directly handle data packets, and thus may exercise QoS control over the data flow. These are, for example (but are not limited to) routers, subnetworks, or end-node operating systems. Flow A Flow is a set of packets all covered by the same request for QoS control. This may be the packets from a single application session, or the aggregation of combined traffics of several application sessions. TSpec and RSpec A TSpec (for Traffic Specification), is a description of the traffic pattern for which a QoS control service is requested. A Service Request Specification (or RSpec), specifies a Quality of Service a flow wishes to request from a network element. QoS control Service QoS control Service (or, when there is no ambiguity, Service) is a named set of QoS control capabilities provided by a single network element. Token Bucket A Token Bucket is a particular form of TSpec, consisting of a "token rate" r and a "bucket size" b. Essentially, the r parameter specifies the continually sustained data rate, and b the extend to which the data rate can extend the sustained level for short period of time. Best effort traffic (or best effort flow) Best effort traffic (or best effort flow) is a flow generated by and application that doesn't request any special QoS control service. A privileged traffic (or privileged flow) is a flow which has a special QoS control requirement (e.g. in term of bandwidth). Horlait, Bouyer Expires January 2000 2 Draft-horlait-clep-00.txt July, 1999 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [4]. 3. Controlling the load of an Ethernet shared network In an Ethernet Bus architecture, all the transmitters share the same resources. This means that a transmitter don't have any guarantee about the available bandwidth for its own use, unless the other transmitters of the bus restrict their throughput. A simple priority queuing algorithm will not meet the requirements of the Controlled- Load service: if one or several transmitters starts overflowing the link, the other transmitters will see their throughput fall to a value close to 0. So all the transmitters on the bus must restrict their maximum throughput value, to a per-transmitter value which will be called R. In any case, restricting the throughput of the transmitters will not avoid collisions, nor packet lost. This implies that the offered service is still a best-effort service, but if the sum of the throughputs of all stations is less than the bandwidth of the link, all the transmitter will statistically see a throughput close to its limit value R. There are several ways to limit the rate of a data flow. The most suited method here is to use a leaky bucket or a token bucket style filter. As we have to manage several flows, with different level of QoS, there should be a filter per flow, and the throughput R will be the sum of the rates of the different filters. For each transmitter, there is at least one filter, for the standard best-effort class of traffic plus one filter per privileged flow. As the TSpec provided for the flows which require a special QoS control is characterized by a token bucket filter, we propose to implement the filter for these flows with a token bucket. The filter for the best effort traffic will also be a token bucket. This presents the interest, over a single leaky bucket, to allow bursts of traffic, which minimize the effect of the bandwidth limitation for the usual traffics (NFS, TCP connections, ...). For implementation reasons, we define here a token bucket with two parameters (n,t) were n is the number of token and t is the time needed for a token to return to the free token pool. The relation between this definition and the definition given in section 2 is: b = n and r = n/t. Packets generated by the applications are classified with respect to their QoS requirements before being submitted to the filters. These filters must also insure some flow conformance control. The handling of best effort flows and that of privileged flows is, of course, quite different. Packets from the best-effort flow are stored in a Horlait, Bouyer Expires January 2000 3 Draft-horlait-clep-00.txt July, 1999 queue before being submitted to the filter. If the queue overflows, the packets are simply discarded. Packets from a privileged flow have to be handled in a different way as the specification of the controlled load service requires that packets which don't conform to the TSpec should be handled as best- effort packets. However, they can't be added at the end of the best- effort queue, because there may be a lot of packets waiting in the best-effort queue, so a non conforming packet would be delayed significantly, and would probably be discarded by the receiver. So these packets have to be forwarded as soon as possible, but shouldn't disrupt the best effort flow. To achieve this, the following algorithm is used: if a packet from a privileged flow does not conform to the token bucket filter, it is forwarded as a best- effort packet if this doesn't create a resource shortage for the best effort flow. That is to say, if there is more buckets in the best-effort free bucket pool, than bytes of packets waiting in the best-effort queue plus the size of the packet to be forwarded. Otherwise the packet is discarded. Some applications generate very low rate data flow, with traffic bursts, but need a much better reliability than that provided by the best-effort queue when the traffic exceeds the capacity of the token bucket filter. Examples of such applications are routing protocols, NTP or RSVP. Such protocols won't work at all with a high packet loss rate. The generated flow does not require a dedicated QoS handling with its own token bucket (the generated flow is, however, somewhat difficult to characterize with a token bucket, because of its low rate), it just requires a special priority. For this purpose, two queues are needed before the best effort token bucket, with different priorities. Figure 1 shows the overall architecture of a network element implementing our Controlled Load Service. Best effort --------+ Token Bucket Flow (low --> |---+ Filter Nbe, Tbe priority) --------+ | +---+ --------+ +->| |----------> | ---> Medium Best effort --------+ | +---+ ^ ^ --------+ Flow (high --> |---+ | | Priority) --------+ | | N1, T1 | | Privileged --------+ +---+ | | Flow #1 --> |----->| |---+ | --------+ +---+ | . | . | . Nn, Tn | Privileged --------+ +---+ | Flow #n --> |----->| |-----+ --------+ +---+ Figure 1: Architecture of the Network Element Horlait, Bouyer Expires January 2000 4 Draft-horlait-clep-00.txt July, 1999 It is to be noted that the maximum datagram size of the best effort flow is the MTU of the link, so the Nbe parameter of the best effort token bucket filter must be greater than this MTU. 4. The CLEP Protocol The Network Elements implementing the architecture described in the previous section need to exchange information, in order to adjust their token bucket parameters. Doing so, they are able to use the maximum available bandwidth of the underlying link without exceeding it. This section describes the rules used to compute the parameters of the token buckets, as well as the network protocol used by the network elements to keep their states consistent. From the resource sharing among network elements point of view, there are only two parameters to take into account: the amount of resources allocated to best-effort flows, and the amount of resources allocated to the privileged flows. These resources are evaluated as allocated bandwidth, so the value exchanged by the network elements are the rates of the token bucket, defined by R=N/T. To compute the reserved and available bandwidth, every network element needs to know the amount of bandwidth reserved for the best- effort and privileged flows by all the other network elements. We call Rbe and Rpriv the rate of the best effort token bucket and the sum of the rates of the privileged token buckets respectively. These two parameters are to be exchanged between network elements using CLEP protocol. Each network element periodically broadcasts on the link its Rbe and Rpriv parameters, as well as a flag WM (Wants More) indicating the need of resource and Rmin, minimum value for Rbe (this value is set by the administrator of the network element). Each network element keeps all the received (Rbe, Rpriv, Rmin, WM) parameters in a table which is used to compute Rfree, the available bandwidth for this machine. This parameter is computed as Rmax _ sum(Rbe + Rpriv) Rfree = ----------------------------------- Number of elements with WM active In this formula, Rmax is the total available bandwidth of the link. Another parameter, RfreeBE is also evaluated. RfreeBE is equal to Rfree if Rbe is less than the average per network element best effort bandwidth available, and to Rfree-Rmax/100 otherwise. Doing so, elements that use more bandwidth than the average per network element bandwidth will decrease their resources consumption, where others can still increase it. Horlait, Bouyer Expires January 2000 5 Draft-horlait-clep-00.txt July, 1999 When a change occurs in the table, new values are computed immediately. If RfreeBE becomes negative, the network element decreases its Rbe by Rbe _ Rmin ----------------- * (-RfreeBE + 0.5) sum(Rbe _ Rmin) if it is not already at its minimal value (this formula has been evaluated in order to provide a fair decreasing process). After computing these values, a broadcast message is send over the network. As all network elements perform the same calculations, Rfree becomes positive again except if all available resources are still allocated. Given this information, the admission control algorithm for a new reservation (Dr) is: - if Dr is less or equal than Rfree, the reservation is accepted, Rpriv is increased, Rfree is decreased; - if Dr is greater than sum(Rpriv + Rmin), the reservation is rejected; - in any other case, replace Rpriv by (Rpriv + Dr), broadcast a message with these parameters; the new Rfree should be negative and a decreasing process of Rbe is started; after a certain time, if Rfree is still negative, the reservation is rejected; if Rfree became positive, the reservation is accepted. A race condition can appear here: if two elements request a new reservation at the same time, the two reservations may fail where one of the two would have succeeded. In this case, it is possible to retry the reservation after a short random delay. A network element may decide to raise its Rbe if its best effort queue is (too much) overflowed. In this case, it may raise it up to Rfree, depending of its own needs and that of other network elements. A network element may also decrease its Rbe if the local element does not use all the allocated bandwidth or to redistribute the best effort bandwidth among other network elements requesting more resources. This allows the network elements to dynamically use the available best-effort bandwidth, and to adapt their Rbe to cope with their needs. 5. Architecture of a CLEP Network Element In order to use this control method, a network element must implement some dedicated functions. Mainly, token bucket filters, packet classifier, CLEP daemon, signaling protocol are base components of a node. Figure 2 gives an overview of the relationship between these components. Horlait, Bouyer Expires January 2000 6 Draft-horlait-clep-00.txt July, 1999 The CLEP daemon is responsible of state data management and is in charge of computations of token bucket parameters that it sets in the system. It receives and produces CLEP messages. Applications can send to the CLEP daemon their QoS requests via a local interface. This same interface can also be used by signaling protocols like RSVP that can also issue QoS requests. The CLEP daemon sets parameter in the packet classifier in order to adequately route packets from applications to the token bucket filter and queue corresponding to the traffic class. The token buckets module receives parameters from the CLEP daemon and gives back to it statistics on queue length, bucket size, drop statistics, and so on. +---------------+ | Applications | +---------------+ | | +---------------+ Parameters | +------------>| CLEP daemon |-----------+ | +---------------+ +->+---------------+ V | | Signaling |--+ | ^ +---------------+ | +---------------+ | +---------| Token Buckets | | V Statistics +---------------+ | +---------------+ ^ | | Packet | | +----------------------->| Classifier |--------------+ +---------------+ Figure 2: Functional structure of a CLEP capable node As far as implementation is concerned, Token buckets as well as packet classifier are to be implemented where networking protocols are, that is probably in the kernel. The CLEP daemon, signaling and applications are in the user space. 6. CLEP protocol Elements CLEP protocol is using UDP port 580. The message structure is shown on figure 3. All values are in network byte order. Vers Version of the protocol, currently version is one. W Wants More flag. X Exit flag. Current value of Rbe This value is an unsigned integer in bytes per second. Value of Rmin Horlait, Bouyer Expires January 2000 7 Draft-horlait-clep-00.txt July, 1999 This value is an unsigned integer in bytes per second. This parameter is set by the node administrator. Current value of Rpriv This value is an unsigned integer in bytes per second. This field is used to convey the current Rpriv value or the expected one in case of reservation request. Value of Rmax This value is an unsigned integer in bytes per second. This parameter is set by the administrator of the node and must be the same for all nodes. This field is used for consistency check. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + Vers | Unused |W|X| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + Current Value of Rbe | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + Value of Rmin | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + Current value of Rpriv | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + Value of Rmax | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3: CLEP message structure Two timer are used for protocol control purposes: Tbroadcast and Tcheck. We set Tbroadcast to 30 seconds and Tcheck to 1 second after some experiments. At startup, a network element sets its Rpriv to 0, its Rbe to Rmin, sends a CLEP message, and starts listening to the UDP port. Under normal circumstances, without any modifications of local parameters, a CLEP message is sent every (Tbroadcast _ Delta), where Delta is a random value in the range 0-1 second. This random value is here to avoid a synchronization between the network elements. Every Tcheck, the network element checks its interfaces and increase or decrease its Rbe if needed and/or allowed. It then sends the new Rbe value in a CLEP message with, in case of increase, the W flag set. This flag is also set if the network element requires more best effort resources than currently available. Upon CLEP message arrival, the protocol version is checked. If the version number doesn't match one of the versions supported by the network element, the message is dropped and an error is logged. If the Rmax parameter doesn't match that of the receiving element, an error message should also be logged. If the sending network element is a new one (it has never sent CLEP messages before), it is added to the local table, with the content of the message, otherwise the content of the table is updated with the information of the incoming message. The new RfreeBE is computed. If it is negative, Rbe should Horlait, Bouyer Expires January 2000 8 Draft-horlait-clep-00.txt July, 1999 be decreased according to the rules given in section 4. If it has changed a CLEP message with the new parameters must be sent as soon as possible. The parameters may not be changed before a delay of Tbroadcast/2, or if RfreeBE becomes positive again. If RfreeBE stays negative for more than 3*Tbroadcast/2, an error message should be logged. When a network element is to be shut down, it should send a CLEP message with the X flag set, and all his parameters set to 0. If a network element information in the local table has not been updated (no CLEP information received from this network element) in the last 2Tbroadcast seconds, it should be removed from the host table, and Rfree computed again. 7. Experiments and results An implementation of this controlled load service using CLEP is available. The development has been carried out using the NetBSD [5] operating system version 1.3 and 1.4. The interface between CLEP and the ISI implementation of RSVP [6]is also running. In parallel with the actual implementation, we have also developed a simulator of this system, using NS [7] network simulator. All this code (simulator, as well as NetBSD code) is available upon request. Please contact the authors. 8. References 1. Braden, R., et al., Resource ReSerVation Protocol (RSVP) -- Version 1 Functional Specification, 1997 , Internet Engineering Task Force, RFC 2205. 2. Wroclawski, J., Specification of the Controlled-Load Network Element Service, 1997 , Internet Engineering Task Force, RFC 2211. 3. Shenker, S. and J. Wroclawski, General Characterization Parameters for Integrated Service Network Elements, 1997 , Internet Engineering Task Force, RFC 2215. 4. Bradner, S., Key words for use in RFCs to Indicate Requirement Levels, 1997 , Internet Engineering Task Force, RFC 2119. 5. http://www.netbsd.org, NetBSD Operating System, NetBSD Project. 6. http://www.isi.edu/div7/rsvp/, RSVP, Reservation Setup Protocol, USC Information Sciences Institutes. 7. http://www-mash.cs.berkeley.edu/ns, Network Simulator (version 2), UCB/LBNL/VINT project. Horlait, Bouyer Expires January 2000 9 Draft-horlait-clep-00.txt July, 1999 9. Acknowledgements This protocol has been specified, developed and implemented under a grant from ALCATEL CRC, France. Thanks to Pascal Anelli from Universite Pierre et Marie Curie, Laboratoire LIP6 who develop the simulation model of CLEP. 10. Authors' addresses Eric Horlait Universite Pierre et Marie Curie Laboratoire LIP6 8, rue du Capitaine Scott 75015 PARIS France Email: Eric.Horlait@lip6.fr Manuel Bouyer Universite Pierre et Marie Curie Laboratoire LIP6 8, rue du Capitaine Scott 75015 PARIS France Email: Manuel.Bouyer@lip6.fr Horlait, Bouyer Expires January 2000 10