Internet Draft Robert Hancock Eleanor Hepworth Andrew McDonald Siemens/Roke Manor Research Document: draft-hancock-nsis-overload- 00.txt Expires: December 2003 June 2003 Handling Overload Conditions in the NSIS Protocol Suite Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 [1]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract The NSIS working group is considering protocols for signaling for resources for a traffic flow along its path in the network. The requirements for such signaling are being developed in [2] and a framework in [3]. The framework describes a 2-layer protocol architecture, with a common lower NSIS 'transport' layer protocol (NTLP) supporting a variety of upper layer NSIS signaling layer protocols (NSLPs). It is an open issue where within this architecture to place the responsibility for handling overload conditions. These conditions relate both to overload of the IP layer itself, as well as overload of buffer/processing resources within the NTLP/NSLPs. This note discusses the requirements and the implications of various approaches, and proposes a way forwards. Hancock et al. Expires - December 2003 [Page 1] NSIS: Overload Handling June 2003 Table of Contents 1. Introduction, Scope and Terminology............................2 1.1 Terminology; Flow and Congestion Control ...................3 2. Requirements...................................................3 3. Implications of Doing Overload Handling within NSIS Protocols..5 4. RSVP and Other Protocol Work...................................5 5. Handling IP Overload ("Congestion Control")....................6 6. Handling NSIS Protocol Overload................................7 7. Security Considerations........................................9 8. Conclusions....................................................9 Acknowledgments..................................................10 Author's Addresses...............................................11 Full Copyright Statement.........................................11 1. Introduction, Scope and Terminology The NSIS working group is considering protocols for signaling for resources for a traffic flow along its path in the network. The requirements for such signaling are being developed in [2] and a framework in [3]. The framework describes a 2-layer protocol architecture, with a common lower NSIS 'transport' layer protocol (NTLP) supporting a variety of upper layer NSIS signaling layer protocols (NSLPs). It is an open issue where within this architecture to place the responsibility for handling overload conditions; 'handling' includes detection as well as prevention and recovery. These conditions relate both to overload of the network (IP) layer itself, as well as overload of buffer/processing resources within the NTLP/NSLPs. This note discusses the requirements and the implications of various approaches, and proposes a way forwards. These issues have been intermittently discussed on the NSIS mailing list [4], and noted in some of the design-related drafts [5, 6, 7]. [8] provides authoritative guidance specifically on how the problem of congestion should be approached within Internet protocol standards, and includes many important references. Note that this draft is specifically not about resource signaling to manage congestion within the network when it actually occurs - for example, traffic engineering to route data flows around congested network areas. This is an important subject, but it is specifically about how resource management should be done, rather than about how signaling protocols should work. This draft includes discussion of how to prevent signaling protocols from adding to the network congestion problem. Hancock et al. Expires - December 2003 [Page 2] NSIS: Overload Handling June 2003 After classifying the various types of signaling overload in section 1.1, section 2 describes the potential causes of overload and the (proposed) requirements for how they should be dealt with. Section 3 describes the basic implications for protocol design and implementation if they provide overload handling, and section 4 briefly mentions how some other protocols related to network operation handle the problem. Section 5 discusses how to handle network (IP layer) overload, and section 6 discusses overload within the NSIS protocol suite itself. Security aspects are briefly mentioned in section 7, and section 8 concludes. 1.1 Terminology; Flow and Congestion Control Unless otherwise stated, this document follows the terminology given in the current NSIS framework [3]. The overload problem is actually (at least) three problems: a) Overload in the IP layer, i.e. buffer congestion which causes IP packets to be dropped (affecting all flows, for signaling, data and other applications). b) Overload in the NTLP, meaning it cannot process incoming or outgoing packets fast enough. This might be caused by processor overload or by lower (IP) level congestion. It affects all NSIS signaling applications, but not the rest of the network - assuming (a) is already handled. c) Overload in an NSLP, meaning it cannot process incoming or outgoing packets fast enough. This might be caused by processor overload or by lower (NTLP/IP) level congestion. It affects only this signaling application - assuming that (a) and (b) are already handled. Traditionally, networking discussions draw a distinction between congestion control - protecting the infrastructure - and flow control - protecting the end systems. Making this distinction is somewhat subtle in the NSIS case, since the infrastructure includes end systems. For example, overload within the NTLP could be prevented by NTLP-level flow control; however, it would still be seen as equivalent to network congestion by NSLPs, and be invisible to the IP layer (as congestion or anything else). Therefore we work in terms of the more concrete concept of overload within particular protocol layers. No doubt even finer distinctions could be drawn. 2. Requirements This section summarises the potential sources of overload, and just how critical it is to deal with them as part of protocol design. Hancock et al. Expires - December 2003 [Page 3] NSIS: Overload Handling June 2003 Load/overload could originate from the following causes: NORMAL: 'Normal' operation, as user applications initiate signaling for their flows. (If this actually causes problems, the network or network elements probably just need re-engineering.) RETRY: Aggressive retry behaviour, as end-systems attempt to re- signal for failed or failing sessions, i.e. even if the flow itself is not active. (This sort of behaviour is felt to be a real problem in traditional telephony networks, where the worst excesses of such devices are curbed by regulation.) REFRESH: Signaling refresh messages generated within the network may cause overload, if the refresh period is not appropriately chosen. RXMIT: Message retransmission (e.g. to achieve reliability in the face of congestive loss) is itself a potential cause of overload, and particularly worrying as a source of instability, since the retransmissions themselves add to the overload. REPAIR: If there is a path change within the network, local repair actions could cause a flood of signaling traffic over the neighbouring links. While the sources of NORMAL and RETRY are end-systems proxies, the others are not. Therefore, it is not possible to rely only on end-to- end load control mechanisms, unless the other sources can be discounted. While NORMAL and REFRESH are proportional (somehow) to data traffic (and should be a small proportion of it) and hence should not usually be a source of IP-level overload, the others are not. Hence, both signaling element and general network overload should be handled within the protocol design. Any of these factors, especially RETRY and REPAIR, can lead to overload within the signaling protocol processing. The consequences of such overload would be reduced responsiveness within the network control plane, dropped signaling state for user sessions, and so on. Modified operation under these circumstances is mainly signaling- application specific; however, the signaling applications usually need support at the protocol level to detect the overload condition in the first place. In the case where all nodes in the network are NSIS-aware, the IP overload problem essentially becomes a node implementation issue (allocation of forwarding resources on outgoing links). However, a background assumption is that the NSIS protocols need to operate well over large-diameter NSIS-unaware clouds. A related issue is that causes REFRESH and REPAIR are mainly about signaling generated in support of particular signaling applications, rather than 'protocol maintenance' signaling. This is therefore Hancock et al. Expires - December 2003 [Page 4] NSIS: Overload Handling June 2003 generated only at NSLP-aware nodes. (This is a consequence of the design decision that the NTLP only handles message forwarding, not state maintenance, and therefore cannot for example generate a flood of signaling application messages on a rerouting event.) While NSLP/NTLP overload failures are problems which are 'local' to the NSIS activity, there is no point in even attempting to standardise protocols which can contribute to network congestion (IP overload) in an uncontrolled way (see the warnings in [9]). The conclusion of this section is that overload both within the NSIS protocols and IP layer needs to be handled with the NSIS protocol designs, the latter with particular attention to robustness. 3. Implications of Doing Overload Handling within NSIS Protocols Overload handling generally implies having a feedback channel to complement the forward channel which carries the 'overload generating' traffic. The nodes at each end of the feedback channel have to be sensitive to the presence of the overload and be able to reduce it; generally, the closer to the location of the overload the better (e.g. end-to-end mechanisms will be inefficient at dealing with a local overload caused by a rerouting event). The implication of this is that an NSIS protocol that purports to deal with overloads has to be bi-directional, and have state information at each end which tracks the current load situation. The more direct the feedback in the reverse direction the better. Overload protection mechanisms are often associated with reliability mechanisms, but they don't have to be (e.g. DCCP [10]); they can be considered independently. Indeed, there may be a case for unreliability within the protocol (e.g. to delete aged messages), even though overload control is still needed. Avoidance of congestion (IP overload) generally has to be done by tracking packet drops at NSIS-unaware nodes. The mechanisms can vary from very simple to very complex. At one extreme, a simple stop-and- wait protocol will work; at the other end, the full (and growing) sophistication of TCP can be used. More sophistication is needed as the network length of the feedback channel and the desired throughput performance increase. This may be a situation where there is a case for different protocol options in different parts of the network. 4. RSVP and Other Protocol Work The base RSVP protocol as defined in [11] includes very limited overload detection and management capabilities. The main aspect is Hancock et al. Expires - December 2003 [Page 5] NSIS: Overload Handling June 2003 the fact that refresh intervals can be locally adjusted, but this just allows management intervention rather than being an adaptive mechanism within the protocol itself. RSVP extensions for reliability were introduced in [12], accompanied by an exponential backoff procedure to address overload cause RXMIT. Most end-to-end application protocols, subject to causes NORMAL and RETRY, handle the overload control problem either by using TCP/SCTP as transports, or with a variety of ad hoc application level techniques applied over UDP. Within the network, the protocols which could be victims of causes REFRESH, RXMIT and REPAIR are non-trivial routing protocols. The most serious potential overload cause is a flood of routing messages as a new link is brought up. Here, OSPF uses a simple stop-and-wait protocol, while BGP uses TCP. The situation for the NSIS protocols is more severe, since the situation arises for any re-routing event (even one caused by link changes in a remote part of the network), and affects links which are already supposedly operational. In the Diameter Base protocol, which uses TCP/SCTP as a transport, higher layer overload is managed on a per-peer-connection basis by the explicit signaling of "busy" indications to the originating peer and the termination of the connection. The originating peer has the option to switch to an alternative next hop (load sharing), which is not possible within NSIS because the signaling has to be coupled to the data path. 5. Handling IP Overload ("Congestion Control") If NTLP can generate its own messages for any of causes REFRESH, RXMIT or REPAIR, then it has to do so in a way which cannot cause IP layer overload; there is no other option. If this is the case, it would seem to make sense to rely on the same mechanism (whatever it is) to protect the IP layer from all NSIS overload causes. However, whether the NTLP generates such messages depends on other aspects of NTLP design and other decisions about NTLP functionality. One could imagine a situation where a very lightweight NTLP had no intelligence to generate messages independently of NSLP operation, in which case protection responsibility could be pushed up to the individual NSLPs. We can't tell whether this argument applies or not without more detail about the proposed NTLP design. Therefore, the question remains of whether it is sensible to allocate the problem to the NTLP in any case. The following arguments would seem to apply: Hancock et al. Expires - December 2003 [Page 6] NSIS: Overload Handling June 2003 *) There is no need for different sorts of congestion control for different signaling applications. (There may be different detailed reactions to congestion, i.e. how to generate fewer messages; however, detecting that fewer messages need to be sent is universal across all signaling applications.) Therefore, there is no need to solve this in a signaling-application sensitive manner. *) Detecting the problem may be easier with closer interaction with the lower layers. The NTLP is best placed to do this. *) Solving the problem is hard and important. Therefore, it is better to do it once and for all, and make life less burdensome for future NSLP developers. The conclusion of this set of arguments appears to be that congestion control, i.e. protection of the IP layer from overloads caused by NSIS protocol operation, should be an NTLP function. 6. Handling NSIS Protocol Overload The other question is related to handling overloads within the NSIS protocol layers themselves, i.e. when the internal resource of the NEs are constrained. It is clear that the NSLP should be in charge of adapting its own behaviour in response to overload situations, since the response will be specific to the signaling application. However, the method of detection and response depends on what overload detection and control features the NTLP provides, and what assumptions the NSLP can make about their presence (especially in remote nodes). Therefore, this section aims to identify the different options for how overload indications can be pushed up the protocol stack and/or out to the edge of the network (where the adaptation can take place) and how in particular the NTLP should support this. If the conclusion of section 5 is correct (i.e. NTLP enforcing IP layer congestion control), it is most likely that in any case there should be a flow-controlling API between the NSIS protocol layers. For providing overload indications towards the edge nodes, there seem to be three cases to consider. The argument depends on whether there are intermediate nodes which are unaware of the NSLPs in use (see Figure 1). 1) The NTLP provides the equivalent of a highly granular flow controlled delivery service up to the next NSLP-aware node, with no assumed constraints on NSLP behaviour. The source is explicitly forced to throttle back the transmission of messages for the combination of source/destination/application. The NSLP only has to detect the condition locally; in fact, it can only send messages which the local NTLP is prepared to deliver. This makes life very Hancock et al. Expires - December 2003 [Page 7] NSIS: Overload Handling June 2003 easy for the NSLP, but NTLP design (in particular, buffer allocation and propagation of flow control information across nodes) is hard. +------+ | NE3 | |+----+| ||NSLP|| |+----+| +------+ +------+ | || | | NE1 | | NE2 | |+----+| |+----+| | | |======||NTLP||=== ||NSLP|| | | | |+----+| |+----+| | | | +------+ | || | | | | |+----+| |+----+| +------+ +------+ ====||NTLP||====||NTLP||==|Router| | NE4 | |+----+| |+----+| +------+ |+----+| +------+ +------+ | ||NSLP|| | |+----+| | | || | | |+----+| |======||NTLP||==== |+----+| +------+ Figure 1: Signaling with NTLP-only hops 2) The NTLP provides a flow controlled delivery service (as above), but operates under assumptions about upper layer sending windows which allow buffer management to be simplified. For example, if only one message is allowed to be outstanding for a particular session at any time, the buffer requirements can be precisely calculated. 3) The NTLP simply provides the service of delivery to the next NTLP node, e.g. NE1->NE2, NE2->NE3 in the figure. Overload at an NSLP- unaware intermediate node (NE2) is handled by dropping packets there (or, more sophisticated but still IP-like behaviour). The NSLPs in NE1 and NE3 have to detect this condition and somehow adapt accordingly (in particular, NE1 has to be able to detect that NE3 is overloaded but that NE4 may not be). Solutions (1) and (2) are both flow-control based, and require the maintenance of per-source-destination information in order to support flow control properly. For example, in figure 1, the NTLP at NE2 would have to detect overload for the signaling application at NE3 and throttle signaling messages for it from NE1, while not affecting NE1->NE2->NE4 communications. In addition, these solutions put complexity into the NTLP, and might infect it with knowledge about signaling flow topologies which it should really be ignorant of. Hancock et al. Expires - December 2003 [Page 8] NSIS: Overload Handling June 2003 Solution (3) puts some complexity into the NSLP behaviour which could be common to several applications; on the other hand, the flexibility to do it differently between different applications could be valuable. This option does not preclude the NTLP from doing flow control, but it does place a requirement on the NSLP to cope with lost messages at least as pathological events (although this would have to be the case anyway, e.g. to cope with intermediate node failure). Note that these problems are mainly caused by the NSLP-unaware node, NE2, and the fact that the NTLP cannot bypass it. In contrast, for direct communication (e.g. NE3<->NE4) it would be very easy to implement solution (1). Flow-controlling solutions are also attractive because they can minimize the buffering taking place within the network and hence improve responsiveness. The conclusion of this argument appears to be that (3) is the preferred approach. This conclusion is mainly driven by complexity arguments about the NTLP, and the existence of NSLP-unaware nodes; if both of these arguments could be dealt with, the conclusion might well be the opposite way around. 7. Security Considerations Malicious nodes can attack congestion control mechanisms to force nodes into a congestion avoidance state. The NTLP design should protect against this type of attack where the network is open to it. Also, both NSIS overload protection approaches have to make some assumptions about fairness at the NTLP level; however, this seems to be unavoidable. 8. Conclusions 1. The NTLP needs to prevent network overload in the IP layer between NTLP peers. 2. However, NSLPs need to detect and adapt to overload within the NSIS protocols themselves. 3. Detection may take place by noting messages dropped by the NTLP, as well as any flow control imposed by the NTLP. References 1 Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. Hancock et al. Expires - December 2003 [Page 9] NSIS: Overload Handling June 2003 2 Brunner, M., "Requirements for QoS Signaling Protocols", draft- ietf-nsis-req-07.txt (work in progress), March 2003 3 Freytsis, I., R. E. Hancock, G. Karagiannis, J. Loughney, S. van den Bosch, "Next Steps in Signaling: Framework", draft-ietf-nsis- fw-02.txt (work in progress), March 2003 4 Archive at: www.ietf.org/mail-archive/working-groups/nsis/ 5 Braden, R. and B. Lindell, "A Two-Level Architecture for Internet Signaling", draft-braden-2level-signal-arch-01.txt (work in progress), November 2002 6 Schulzrinne, H., H. Tschofenig, X. Fu, A. McDonald, "CASP - Cross- Application Signaling Protocol", draft-schulzrinne-nsis-casp- 01.txt (work in progress), March 2003 7 McDonald, A., R. Hancock, E. Hepworth, "Design Considerations for an NSIS Transport Layer Protocol", draft-mcdonald-nsis-ntlp- considerations-00.txt (work in progress), January 2003 8 Floyd, S., "Congestion Control Principles", RFC 2914, September 2000 9 http://www.ietf.org/ID-nits.html 10 http://www.ietf.org/html.charters/dccp-charter.html 11 Braden, R. et al., "Resource ReSerVation Protocol (RSVP) -- Version 1 Functional Specification", RFC 2205, September 1997 12 Berger, L., Gan, D., Swallow, G., Pan, P., Tommasi, F. and S. Molendini, "RSVP Refresh Overhead Reduction Extensions", RFC 2961, April 2001 Acknowledgments The authors would like to thank all their colleagues and fellow participants in the NSIS working group and internal protocol discussions for exposing the complexities and subtleties in this subject area. In particular, input was used from (in order of CRC{name}) Henning Schulzrinne, Xiaoming Fu, John Loughney, Melinda Shore, Hannes Tschofenig, Georgios Karagiannis, Ping Pan, Bob Braden, Sven Van den Bosch, Lars Westberg, Marcus Brunner, and Ruediger Geib. Henning in particular provided valuable education on flow control in Hancock et al. Expires - December 2003 [Page 10] NSIS: Overload Handling June 2003 signaling protocols. Needless to say, the interpretation and conclusions should be blamed only on the authors. Author's Addresses {Robert Hancock, Eleanor Hepworth, Andrew McDonald} Roke Manor Research Old Salisbury Lane Romsey, Hampshire SO51 0ZN United Kingdom email: {robert.hancock|eleanor.hepworth|andrew.mcdonald}@roke.co.uk Full Copyright Statement Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Hancock et al. Expires - December 2003 [Page 11]