INTERNET-DRAFT J L Adams BT individual submission A J Smith Cranfield University December 2001 rap working group Expires May 31, 2002 A New QoS Mechanism for Mass-Market Broadband draft-adams-qos-broadband-00.txt Status of this Memo This document is an Internet-Draft and is subject to all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as 'work in progress'. The list of current Internet-Drafts can be accessed at http://www.ieft.org/lid- abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ieft.org/shadow.html Abstract This document describes a proposal which deals with congestion conditions that may arise when a home or SME customer requests too many simultaneous flows to be forwarded down a DSL link or other access technology. It provides a solution to guaranteeing certain flows while making others (typically the latest or another flow selected for policy reasons) the subject of focused packet discards. It has a number of significant benefits over other possible solutions, such as classical RSVP, and these are also listed in the document. 1 Introduction Broadband services delivered over DSL to residential or SME consumers have been the focus of much interest recently. The potential opportunities include TV distribution for selected residential areas, combined with voice and data services. Consumers may select lower value packages to begin with and move progressively through a process of 'upsell' towards higher value packages. It is envisaged that an upsell is automatically configured after the consumer selects it, e.g. using a browser. This creates a market opportunity where lower value packages may be the normal offering in early realisations, and higher value packages are added to the platform in stages as vendor equipment develops. For example, service packages may exclude TV in the early offering. Among the higher value packages that could be added later to a service platform is one which relies on a QoS function controlling the aggregate mix of services forwarded to each consumer. This QoS function would protect certain flows that could be pre-selected by the consumers. Such flows would not be interrupted or subject to packet discard. This internet draft proposes a new QoS function at the IP layer that provides policy-based flow protection for consumers. We believe this new function has advantages over classical RSVP, but it may be accommodated within a more lightweight version of the protocol. 2 Background 2.1 Edge Nodes Edge nodes exist in a network and channel all content from service providers towards customers. While this could be achieved using a separate ATM VC for each service type (TV, voice, and data), this is very complex if it is extended so that e.g. the data VC is no longer a single VC but a separate VC for several types of data. In particular web streaming would need a separate VC if its QOS is to be treated different to other data types. Therefore, it is advantageous if all flows are aggregated onto a single ATM VC because it is possible to give each flow the possibility of policy controlled QOS treatment. This implies that the IP layer has to handle the separate QOS requirements of each service type. Several vendors have developed equipment (Edge Nodes) that channels services using separate ATM VCs and several vendors are now considering how they can move to IP-based multiservice aggregation. The device in this document relates to an improved Edge Node which operates in conjunction with essentially functionally similar equipment to that currently existing, except for a modification to the set top box. This is made in order for it to recognise certain new alarm signals created by the device described here. The target requirements for Edge Nodes are that large numbers of customers should be connected (ultimately this may be 100,000 and upwards per Edge Node). The customer is connected to the Edge Node via a DSL link. In the network a number of fibre interfaces may be used, e.g. ATM which is well known as a simple and effective technology to pick out individual or aggregated content flows for a specific customer and forward them down the correct DSL links. ATM currently places some restrictions on the maximum link rate; most vendors currently stop at 625 Mbit/s for ATM interfaces, because their products do not include SAR chips that go faster than this rate. However it is expected that 2.5 Gbit/s ATM links will be commonly available, within the next two years; this would permit rates towards groups of customers to be increased. A higher rate of a few megabit/s per customer would permit the aggregation of TV and VOD signals into the mix. We may anticipate patterns of demand to be such that there will be a mix of both lower rate customers and higher rate (multimegabit per second customers) on the same link, enabling that link to handle many hundreds of customers in total. 2.2 Quality of Service While much effort has been directed by vendors towards the development of an Edge Node, there is one aspect where further improvements are needed. An Edge Node must be able to control QoS when congestion occurs, and this is the subject of the device described in this document. As an example scenario, consider a customer connected to a Virtual Private Network (VPN) which in turn is connected to various content sites. The customer has subscribed to a basic service package, which provides a main content source, which can include TV and data. This basic service package can be extended, and the customer is able to select from extra TV or data content sources. More generally, a customer can be connected to multiple VPNs and receive additional content via the internet. All these sources of traffic can combine to cause congestion. Both the simpler case of a single VPN and the extended case of multiple VPNs leads to the QoS issue. An example of this occurs if a source of real time video is demanded at the same time as streamed media is being viewed by another person in the same home. These two flows could both have low loss tolerance. If the combined traffic load produced by these sources is larger than the capacity of the link then it results in some information being lost (typically from both flows), and the perceived QoS becomes unacceptable. The actual place where packets are discarded is best handled at a single point in the network for all downstream flows to a specific customer and, logically, this best point is the Edge Node. We propose that a shaping function should be located at the Edge Node, which controls an envelope of traffic destined for any one customer at, or below, the customer's link capacity. This negates the requirement for sophisticated traffic handling functions in the DSLAM equipment. 3 The device The device described in this document would modify and improve the above proposed shaping function. It could also operate equally well in other network locations, wherever packets are buffered and can be examined in terms of their flow identities and class type. Currently, when flows consist of different priority information, such as video and data, shapers would first cause the discard of the lower priority flows (typically the data flow) and protect the video flows. However, our device addresses the problem of equal priority flows causing congestion, and unable to slow down through the control of e.g. TCP. 3.1 Classical RSVP: some disadvantages Classical RSVP can be used for congestion control of IP based flows. However, there are disadvantages with the full heavyweight version of the protocol. RSVP messages are separate from the higher-level call request and acknowledgement messages that lead to e.g. phone -ringing (ie the H225 messages). To introduce RSVP into e.g. the standard voice signalling message sequence requires the suspension of this sequence and then its resumption following the successful completion of the RSVP message sequence. This kind of suspension-resumption methodology will have to be added to the higher level signalling sequence of any kind of content, to prevent such content from starting to flow before the reservations have been made. If some flows are variable bit rate, RSVP is faced with difficult choices, which present disadvantages to this solution: - To admit the latest reservation request based on some average rate with the possibility that the flow will exceed this average rate for significantly long intervals and cause congestion and loss of packets to itself and other reserved bandwidth flows. - To admit the latest reservation request based on a peak rate, thereby wasting some of the available capacity through the condition that flows will only be admitted while the sum of their peak rates is less than the available capacity. - To operate a function that tries to estimate the remaining available capacity on a link by estimating a percentile point of the current offered traffic load, and use this estimate as the condition for accepting or rejecting the latest reservation request. Another disadvantage of RSVP (and indeed any other call admittance procedure), is the need to keep state information on flow arrivals and cessations so that guaranteed bandwidth can be returned to a notional common pool of available capacity. Yet another disadvantage is the need to suspend higher-level session control protocols until RSVP has completed its reservations. This requires certain timeouts to be implemented so that suspension does not continue indefinitely and various failure modes then need to be catered for, requiring additional state information to be kept. For example, the 'call state' reached at the point where suspension is implemented needs to be kept, so that it can be torn down if necessary. 3.2 Device Advantages All these disadvantages are overcome with the device described in this document. With this device it is possible to: - Admit variable bit rate flows without being constrained to accept only a set of flows whose peak rates are less than the available capacity. - Admit such flows without knowing the remaining capacity of the link. - Admit flows without being required to keep active/ceased state information on admitted flows. - Admit flows without requiring a suspension of higher-level session control protocols. - Provide guarantees to each of the admitted flows except under certain extreme traffic conditions, when selected flows will be targeted for packet loss enabling other flows to continue without any loss or undesirable packet delays. 3.3 Device Operation Description What follows is a detailed description of how the device operates to achieve these advantages. 3.3.1 Start Packet When a flow towards the customer commences, a new control packet must be sent; we have called this a 'Start Packet'. There is no requirement for that flow to wait for any processing or acknowledgement of its Start Packet and it can immediately start transmitting actual data packets after the Start Packet. A Start Packet is an IP-layer control packet with an identifying field. This field may be split into two parts, with one part being included in the standard IP layer. For example, by setting bit 49 (not yet used for any other purpose), it would be identified as a control packet. The other part of the field is the first element of the information field that further identifies it as a Start packet, or as an Alarm message packet. The exact nature of this field needs agreement among the standards community. In other respects, a Start Packet carries the same information (destination address, source address, and source/destination port numbers) as will be on the IP packet headers of the stream of data packets which form the flow behind the Start Packet. Because of its field, the Start Packet would be recognisable to a packet discard device located, for example, at the edge of the network to the customer. The basic principle is that the Start Packet contains information (such as the IP header fields of the subsequent data packets) which is loaded into a register by the Edge Node. Subsequent data packets are examined, and if their headers match what is in the register, then such packets may be discarded when the buffer is filled beyond a certain threshold value. Note that although we describe the device as operating on flows towards customers converging at Edge Nodes, there is no restriction on it to be only operating in that direction, or at one particular buffer point. 3.3.2 Functionality The device has a set of functions which are co-located with a buffer to achieve the advantages listed above. The buffer is part of the proposed shaping function specific to a single customer and the output from this buffer towards the customer is restricted in maximum rate to be compatible with the capacity on the corresponding link. The function that controls this rate limitation is a scheduling function. The set of functions include: - The customer-specific buffer; the implementation of such a buffer need not be in the form of physically separate buffers per customer. It would normally be a single buffer shared by all customers, with flow accounting maintained on a per-customer basis. - A packet discard function which maintains a state machine specific to a customer (although it should be noted that the number of states maintained per customer are far more limited than the number required by RSVP). It also serves to detect newly arriving start packets that are routed to the customer-specific buffer. It is an assumption of this description that there is already a routing process set up that routes packets (including start packets) destined for a specific customer towards a customer-specific buffer, where a shaped output is enforced. The buffer is needed to absorb some degree of burstiness in the arrival rate. - A main processor that controls which flows specific to a given customer may be subject to focused discards, as discussed further below. As with the buffer, an actual implementation would normally run a virtual process per customer in a single processor capable of handling all customers on a link. - A register which maintains discard control information specific to a given customer. Again, an actual implementation would use a single register which is divided on a per-customer basis into a number of virtual registers. 3.3.3 Basic Operation In its simplest operation a succession of start packets (preceding a succession of new flows) are sent towards a customer and are loaded into the (virtual) register such that each overwrites the previous one and therefore the register always contains the latest flow. When a packet identity is removed from the register (usually by being overwritten) the corresponding flow becomes bandwidth guaranteed except under certain extreme traffic conditions to be discussed below. This means that, normally, there are no packets discarded from such a flow when the buffer experiences congestion. We speak here of such flows as having entered the guaranteed area. If, over an interval of time a sequence of flows start with their corresponding start packets, then the normal behaviour of the system being described here allows some of the earlier flows to move to the guaranteed area. It will always retain at least one flow identity in the register, whose packets will be the subject of focused discard if the buffer becomes too full. A focused discard is triggered when the buffer has sent a control signal to the main control logic indicating that a fill threshold level has been exceeded. This subsequently instructs the discarding function to commence packet discarding. Before beginning to discard packets the discarding function sends two control packets. The first is sent forward towards the customer; this control packet is called the 'Congestion Notification' packet. This advises the application resident in the customer's equipment that a network congestion condition has occurred. An application may choose to continue receiving such data packets that are not deleted by the discarding function, or it may close down and indicate network busy to the user. The second is sent backwards towards the source to indicate that this flow is about to become the subject of focussed packet discard. Again, the source may choose to ignore this control packet, or may terminate the flow. The discard function then commences to discard all packets whose flow identity matches the identity in the flow register. Packet discarding will continue until the buffer fill level is reduced to a lower threshold value. The main control logic may also inform a network billing function that flow discarding has commenced on a specific flow, if the charging arrangements require this information. In some preferred arrangements the customer will be billed on a flat-rate basis and therefore it may be unnecessary to send any indication to a billing function. If an application chooses to close down on receipt of the Congestion Notification signal then it is responsible for sending the appropriate signals to the source end to shut down the flow. These procedures are outside the scope of this device and will vary from application to application. 3.3.4 More Refined Operations In a refinement of the simplest way of operating such control functions for packet discard, a field is utilised in the Start Packet. This field is known as the 'Rate Advisory' field. This field conveys the peak bit-rate of the flow. The register is now loaded so that it always retains a set of flows whose rate advisories sum to N percent, e.g. 5 percent of the link bandwidth. The value N can be varied to suit certain known traffic conditions. It caters for the degree of uncertainty that exists when accepting variable bit-rate flows. Thus, if the combined set of flows in the 'guaranteed area' bursts to a load-level which is significantly higher than the link capacity, then focused discard on the set of flows in the register can be expected to reduce the load by up to N percent of the link capacity. It provides sufficient flows to focus on for packet discards to get the buffer fill level down below the threshold value. Another refinement is concerned with the problem of how equal flows are subject to focused discards if several such flows are currently in the register. It is possible to operate the discard function so that the latest flow is the most vulnerable, and earlier flows (which are still retained because they make up a combined set of flows whose rate advisories sum to N percent) become less and less vulnerable, until they eventually leave the window into the guaranteed area. The discarding function will try to control the forwarding rate towards the buffer according to a leaky bucket principle, where only a limited burst of packets above a defined rate is permitted to be forwarded to the buffer. This defined rate is equal to the rate at which the buffer can transmit packets towards the customer. The discarding function can start by discarding every packet of the latest flow and only pick on additional packets from other flows in its register if it would otherwise exceed its burst size restrictions. There are other ways of operating the discard function, such as policy-based controls where, instead of the latest flow being the one chosen for total discard, another flow is chosen due to policy information stored in the discard function. A specific way of obtaining such policy information is to make use of a second control field of the start packet. This field is termed the sub-components field, and allows policy information to be captured within the register and readable by the discard function. When a flow consists of different media components, such as video and data, this sub-component field stores information relating to each component including its priority in terms of packet discard. The packet matching performed on the data packets passing through the discard function includes not only the destination and source addresses but also other information that uniquely identifies a sub-component. This may include source or destination port numbers or other information such as TOS QoS settings. The fraction of the link bandwidth that is used as a control for the set of flows retained in the discard function effectively defines a 'window' size for flow retention. Thus a flow starts up, its identity enters the discard function, and exits when further additional flows have arrived whose combined set of rate advisories makes it no longer necessary to retain this earlier flow. The term 'within the window' is used here to describe flows that are currently progressing towards guaranteed status. 3.3.5 Failure Conditions In this section we describe the action taken by the device under failure conditions. - It is possible that a start packet may fail to arrive, having been lost in the network between the point of generation and the device buffer. The proposed solution is to make the source generate two start packets which are sent prior to any data packet; and, regardless of the QoS setting of the subsequent data packets, to mark all start packets with a very high priority or class setting, thus making the loss of both packets very improbable. - It is possible that too many flows are requested by the customer, within a very short interval of time, so that the ability to assess their impact on the buffer cannot be assessed on a one-flow-at-a-time basis. The solution to this is to have a guard period that is the minimum time that a flow identity can remain in the window. For example, a counter is reset for each such flow at the moment when it enters the window and the flow identity must remain in the window until some number N of data packets has been sent and detected by the discard function, regardless of any other criteria governing exit from the window. However, it does not apply to those flows which, through policy reasons, are not put in the window. - It is possible that so many flows are requested by the customer within some very short interval of time that the control logic has insufficient space to handle all of their separate identities and guard periods within the window. The solution to this is for the main processor to have an alarm function that is triggered by such a condition. If triggered, the discard function is instructed to send a special Alarm signal towards the customer, indicating that the service is being abused outside of its expected parameters, and that all flows are being discarded. The discard function now deletes all packets. The Alarm message will advise the customer to contact the network or service administrator because the administrator will need to reset a discard flag and clear the existing window data after clearing down all flows. - It is possible that a flow is maintained by the customer as active even though it has been silent for some time. It now starts up again and creates congestion. Normally, this would not cause a problem since the flow is either still in the window, in which case its packets would start to be discarded, or it has moved to the guaranteed area, in which case other, newer flows will start to be discarded. There are, however, exceptional conditions when a high bit- rate real-time flow behaves in this way. Its rate could now exceed the protective window capacity (e.g. the N percent figure). This would also happen if some malicious flow delayed the onset of some very high rate until after it was likely to be guaranteed and then overloaded the buffer. The solution to these abnormal conditions is to allow the discard function to randomly choose an additional flow (by selecting this information from any passing data packet) and add such a random flow to the register window, and begin discarding on it. This flow could be distinguished from other flows by setting an additional parameter called the aux flowid parameter to 'emergency' (usually it is set to 'normal'). The discard function would also send an Alarm signal to the customer saying that the operation is outside of the expected service parameters. The discard function would be triggered into this mode of selecting one or more additional flows whenever the buffer fill-level hits a second, higher threshold, generating an alarm signal. Once in this mode (emergency delete mode), the discard function can repeat the random selection of further flows any number of times until the buffer loading starts to reduce. If the buffer load starts to reduce, a buffer alarm off signal is generated, causing the discard function to perform a stability check before removing any flowids from the register which have their aux parameter set to the value emergency. The stability check is designed to prevent the discard function from removing emergency flowids from the register and then quickly needing to add a random new set under the conditions that the buffer is quickly oscillating between alarm off and alarm on signals. It is preferable to keep the same set of emergency flowids under these conditions, which helps to limit the number of different flowids that become randomly selected. The stability check consists of the discard function inspecting flowids for emergency settings. If this is the case, a timeout period is begun, and the function monitors buffer alarm on/off transitions during timeout. If the buffer generates an alarm during this period, the emergency flows are not cleared from the register and remain the target of discard. This situation continues until an alarm off signal is generated by the buffer which causes a further timeout period to commence. The discard function will always perform the stability check before removing aux=emergency flowids from the register. The final error trap used by the discard function protects against a timeout period beginning if there is already a timeout in progress. - It is possible that a flow sends no Start Packet. This may cause existing flows in the window to be discarded if the additional flow (which has sent no Start Packet) causes congestion. In the extreme, it will trigger the same actions and alarm messages as described in the previous condition. Under the circumstances of the abnormal conditions described in the last two conditions, it is possible that some guaranteed flows are subject to packet discard, but this should be an exceptional event that is regarded as an alarm condition. 4 Conclusions The device described in this document offers a bridge between two worlds. The narrower the interval that we have termed the window, the more the current invention emulates the classic connection-oriented paradigm. The latest connection is the only one in the window. It is therefore either accepted or subject at any moment to full packet discard. If accepted it is placed in the guaranteed area as soon as a further new flow starts up. On the other hand, the wider the window, the more like the classic connectionless world. Most flows are vulnerable to packet discard when the buffer is too full. Notice also that this device fits with the connectionless paradigm in that sources are only required to transmit a start packet and then, without waiting further, start to transmit their data. There is no negotiation (unlike classical RSVP) yet there are still guaranteed flows in the case of window sizes that are some small fraction of the link rate. So we effectively have a new QoS procedure that is based only on start packets and no subsequent response packets are triggered or used. In place of such additional control messages as would have been expected in the classic 'circuit world' only warning indications are triggered on flows just prior to packet discard. The device may be refined by the addition of policy controls governing how a flow gets into the window. A family may decide that viewing the main film on a Thursday is the most important thing that day. That's why they subscribed to the service and they want that guaranteed. So, even though the film is the latest flow it moves straight into the guaranteed area because of a policy database that can be written to by the customer using, for example, a browser. This database information is readable by the main control logic. When this function is informed by the discard function that a new start packet has arrived, it checks the policy database and determines if the flow is to be added to the register or simply ignored so that effectively it passes straight to the guaranteed area. If it is moved straight to the guaranteed area, the possibility of recovering from buffer overload at the time when the movie starts is still achieved by focusing on the previous flows which remain in the window. John L Adams john.l.adams@bt.com pp MLB G 7 Orion Building (B62-MH) +44 1473 606321 Adastral Park Martlesham Heath Ipswich IP5 3RE UK A new QoS Mechanism for Mass-market Broadband Adams and Smith draft-adams-QoS-broadband-00.txt expires: May 31, 2002