Internet Engineering Task Force Raj Yavatkar, Intel INTERNET-DRAFT Don Hoffman, Sun Microsystems Yoram Bernet, Microsoft Fred Baker, Cisco Rick Kennedy, Cabletron February 1997 Expires: August 31, 1997 SBM (Subnet Bandwidth Manager): A Proposal for Admission Control over Ethernet Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). This document is a product of the ISSLL subgroup of the Integrated Services working group of the Internet Engineering Task Force. Comments are solicited and should be addressed to the working group's mailing list at issll@mercury.lcs.mit.edu, and/or the author(s). Changes From Last Version This draft reflects the following changes to its previous version: 1. An appendix that describes the message encapsulation rules and mes- sage formats has been added. Detailed message processing rules are available in a separate document (Postscript format) at ftp://ftpeng.cisco.com/fred/sbm/sbmrules.ps 2. An appendix that describes the DSBM Election algorithm has been added. 3. A section that describes the SBM operation in a switched topology has been added. draft-yavatkar-sbm-ethernet-03.txt [Page 1] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 Abstract This document outlines a signaling method and protocol for RSVP- based admission control over IEEE 802-style LANs. The proposed method is designed to work with the current generation of IEEE 802 LANs and should be considered as a first step towards discovering solutions for implementation of IntServ capabilities over such net- works. draft-yavatkar-sbm-ethernet-03.txt [Page 2] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 1. Introduction RSVP and Integrated-Services specifications together define an admission control and traffic control framework for providing end-to-end QoS guarantees over the Internet. However, specific algorithms, mechanisms, and protocols are needed to map the pro- posed integrated services over specific link-layer technologies such as the IEEE 802-style LANs. Our goal is to propose a signaling method and protocol for achieving admission control over the 802- style networks such as Ethernet on the basis of the framework pro- posed in the RSVP and Int-Serv working groups. Our proposal is based on the following architectural goals and assumptions: I. We define a signaling framework that provides a step-by-step solu- tion to the problem of managing bandwidth over shared subnetworks (such as an Ethernet) that works with the existing, legacy LAN infrastructure and takes advantage of the additional functionality (such as an explicit support for integrated services) as it becomes available in the new generation of switches, hubs, or bridges. As a result, our proposal would allow for a range of LAN bandwidth management solutions that vary from one that exercises purely administrative control (over the amount of bandwidth consumed by RSVP-enabled traffic flows) to one that requires cooperation (and enforcement) from all the end-systems attached to a shared sub- network. II. Our goal is to specify only a signaling method and protocol for LAN-based admission control over RSVP flows and leave the task of specifying link layer mechanisms for traffic control to the appropriate IEEE 802 working groups. Thus, the proposed mechanism explicitly controls only the total amount of traffic load imposed by RSVP-enabled flows on a shared LAN. However, the best-effort traffic generated by the TCP/IP sources is generally rate-adaptive (using "slow start" type conges- tion avoidance mechanisms or feedback-based rate adaptation used by sources based on RTP/RTCP protocols) and limits the amount of traffic generated to adapt to available capacity. A specification of an RSVP-based admission control mechanism for a LAN should typi- cally suffice to control the total amount of traffic over a shared LAN. This is especially true in a switched Ethernet environment if switches and NICs support at least two levels of priority. In such a multi-priority LAN, assignment of higher priority to the RSVP traffic (to separate it from best-effort traffic) coupled with a combination of admission control (over RSVP traffic to keep it draft-yavatkar-sbm-ethernet-03.txt [Page 3] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 within a fraction of any link's capacity) and per-flow policing at end-systems should suffice to realize an approximation to the "con- trolled load" service specified in the int-serv working group. III. For traffic control, we assume that end-systems will police indivi- dual RSVP-enabled data flows to ensure that each flow stays within its traffic specification stipulated in its reservation request for admission control. Additional traffic scheduling mechanisms may also be employed to realize a particular QoS service class. IV. As an interim measure until 802-mediated traffic control mechanisms become available, we assume that all the RSVP nodes on a LAN will utilize the proposed admission control procedure to reserve bandwidth in advance of sending any RSVP-enabled data flows and will not send/forward such traffic if the reservation request fails. Thus, if all the multimedia traffic on a LAN is sent using RSVP for resource reservation, the proposed architecture would restrict the total multimedia traffic on any LAN segment within the bounds desired by a LAN administrator. This does not, however, assure that non-RSVP traffic will not interfere with the RSVP traffic unless traffic flow separation mechanisms are included in the underlying Ethernet infrastructure (e.g., IEEE 802.1P). 2. Overview Our proposal assumes a logical entity called an SBM (Subnet Bandwidth Manager) that is responsible for handling admission control requests. We assume that an IP subnet corresponds to a single L2 (Layer 2) domain [3]. An L2 domain is defined to be a set of nodes and links intercon- nected without passing through a L3 (IP or Layer 3) forwarding function. We refer to links (point-to-point or shared segments) that interconnect nodes within a L2 domain simply as LAN segments (More precise defini- tions are included in an accompanying document available from ftp://ftpeng.cisco.com/fred/sbm/sbmrules.ps). We assume that a Designated SBM (DSBM) exists for each LAN segment (also called a Managed Segment or a MS) and services reservation requests (manages bandwidth) for that segment. A procedure for dynamically elect- ing the DSBM is described in Appendix A. The proposal makes no assump- tions about the number of SBMs within a LAN; a single DSBM may manage a shared LAN segment whereas a separate SBM may run on each switch in a switched LAN topology (Section 3 discusses this point in greater detail). In the following, we use the term "DSBM client" to refer to a L3 (layer 3) entity (host or router) that communicates with a DSBM for the purpose of establishing a QoS reservation. draft-yavatkar-sbm-ethernet-03.txt [Page 4] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 2.1 Basic Algorithm Figure 1 - An Example of a Managed Segment. Host Host +-------+ +---+ +-------+ +---+ |Router | | C | | SBM | | B | | R2 | | | +-------+ | | +-------+ +---+ / +---+ | | / | | | / | ==============================================================LAN | | | | +---+ __|_____ | A | | Router | | | | R1 | === |________| Host Figure 1 shows an example topology consisting of hosts and routers interconnected across a LAN. For the purpose of this discussion, we ignore the actual physical topology of the LAN and a single SBM is assumed to be the DSBM for the entire LAN. Section 3 describes how an SBM operates over different LAN topologies. The basic SBM algorithm works as follows: 1. DSBM Initialization: As part of its initial configuration, DSBM obtains information such as the maximum bandwidth that can be reserved on each LAN segment under its control. Configuration is likely to be static with the current Ethernet devices. Future work may allow for dynamic discovery of this information. 2 DSBM Client Initialization: At the start, a DSBM client first veri- fies that a DSBM exists in its L2 domain (see Appendix A) so that it can communicate with the DSBM for admission control purposes. 3. RSVP-based Admission Control: To reserve LAN bandwidth, DSBM clients (RSVP-capable L3 devices such as hosts and routers) follow the following steps: a) When a DSBM client sends or forwards a PATH message over a managed interface, it sends the PATH message to its DSBM instead of sending it to the destination address (as is done in conven- tional RSVP processing). After processing (and possibly updating draft-yavatkar-sbm-ethernet-03.txt [Page 5] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 an ADSPEC), the DSBM will forward the PATH message toward its destination address. As part of its processing, the DSBM builds and maintains a path state for the session and notes the previous hop that sent it the PATH message. For example, if the sender to a session is outside the LAN and router R1 (see Figure 1) is on the path to the receivers, R1 will forward a PATH message from the sender to its DSBM. The DSBM processes the PATH message and forwards the PATH message towards the RSVP session address (see the message processing rules for details on how the message is forwarded). In the process, the DSBM builds the PATH state, remembers the sender (router R1 in Figure 1) as the previous hop for the session, puts its own IP address in the PHOP object, and effectively inserts itself as an intermediate node between the sender (or R1 in Figure 1) and the receivers (or routers) on the LAN. b) When a receiver (say, host A) wishes to make a reservation request for the session, it follows standard RSVP rules and sends a RSVP RESERVE message to the previous hop address obtained from the PHOP object in the previously received PATH message. c) The DSBM processes the RSVP RESERVE message based on the bandwidth available and returns an RSVP_ERROR to the requester (host A) if the request cannot be granted. Admission control algorithm at DSBM ensures that sufficient bandwidth is available on managed segments (MS) between the NHOP (requester) and the PHOP (sender/router). In the case of a successful reservation, DSBM forwards the RESERVE message towards the PHOP(s) based on the contents of the RESERVE message and its local path state for the session. The DSBM merges reservation requests for the same session as and when possible using the rules similar to the conventional RSVP pro- cessing. d) The RESERVE message eventually reaches the original PHOP (sender/router) on that MS if all reservation requests within the MS succeed. 2.2 Changes to conventional RSVP operation The addition of a DSBM for admission control over managed segments results in changes to the RSVP message processing rules at a DSBM client. These changes are summarized below and detailed rules are described in a separate document available from draft-yavatkar-sbm-ethernet-03.txt [Page 6] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 ftp://ftpeng.cisco.com/fred/sbm/sbmrules.ps : - Normal RSVP forwarding rules apply at a DSBM client when it is not forwarding an outgoing PATH message over a managed segment. How- ever, outgoing PATH messages on a managed segment are sent to the DSBM for the corresponding MS (managed segment). - In conventional RSVP processing over point-to-point links, RSVP nodes (hosts/routers) use NHOP and PHOP objects to keep track of the next hop (and the previous hop) nodes on the path between a sender and a receiver. Over a shared LAN such as an Ethernet, we introduce a DSBM (subnet Bandwidth Manager) as a logical entity that performs admission control over the LAN. Such a LAN may span multiple switched or shared segments between a RSVP PHOP node and a RSVP NHOP node. For the purpose of Layer 3 routing and retaining RSVP semantics between two Layer 3 entities, however, the entire LAN acts a logical segment and the connection between RSVP PHOP and NHOP nodes must be maintained for the correct operation and routing of RSVP messages. Therefore, we introduce a new RSVP object called LAN_NHOP that keeps track of the next L3 hop as the PATH message traverses an L2 domain between the RSVP PHOP and NHOP nodes. In addition, we introduce a LAN_PHOP object that is used solely to avoid indefinite looping of PATH messages for a multicast session. - When a DSBM client (a host or a router acting as the originator of a PATH message) sends out a PATH message over a managed segment, it needs to include a LAN_NHOP object in the message. The LAN_NHOP object specifies the destination IP address of the PATH message. In the case of a unicast destination, the LAN_NHOP address specifies the destination address or the address of the next hop router towards the destination. - When a DSBM receives a RSVP PATH message, it processes the PATH message according to the PATH processing rules described in the RSVP specification. In particular, the DSBM remembers the PHOP address from which it received the messages (call it X) and for- wards the path message with the PHOP object modified to reflect its own IP address (and, thus, the DSBM inserts itself as an inter- mediate hop in the RSVP chain of nodes). - The path state in a DSBM is used for forwarding subsequent RESV messages for the same session. When the DSBM receives a RESV mes- sage, it processes the message and forwards it to appropriate PHOP(s) based on its path state. draft-yavatkar-sbm-ethernet-03.txt [Page 7] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 - Because a DSBM inserts itself as a hop between a source of traffic on the LAN (sender or router) and the receiver, all RSVP related messages (such as PATH, PATH_TEAR, RESV, RESV_CONFM, RESV_TEAR, and RESV_ERR) now flow through the DSBM. In particular, a PATH_TEAR message is routed exactly as its corresponding PATH message (and a PATH_TEAR message is forwarded after the local PATH state has been cleaned up). - When RSVP session address is a multicast address, it is possible for an entity (a DSBM client or an SBM) that forwarded a PATH mes- sage to receive one or more copies of the PATH message when a DSBM on the path to the destination forwards it (i.e., multicasts it). To facilitate detection of such loops, we use a LAN_PHOP object. All entities (except the DSBMs reflecting a multicast PATH message) overwrite the LAN_PHOP object (add the object if it is not there) in the PATH message with their own unicast IP address. Thus, a DSBM client or an SBM can discard the duplicates based on the contents of the LAN_PHOP object (lists one of its interface addresses). 3. Various LAN Topologies Our goal is to offer an admission control solution that works with the existing, shared segment LANs as well as newer, switched LAN topologies. In the following, we consider two different LAN topologies and describe how our solution works in each case. 3.1 Shared LAN segments Figure 1 shows a sample topology where entire IP subnet spans a single shared segment. Actual physical topology in this case may consist of multiple physical segments interconnected by hubs. However, for practi- cal purposes, such a LAN acts as if all hosts are attached to a single shared segment. In this case, a single DSBM manages shared bandwidth for the entire subnet. 3.2 Switched LAN segments Figure 2 shows a sample topology where an IP subnet spans a switched Ethernet consisting of one or more switches interconnecting all the end-systems within the subnet. In this case, PATH and RESV messages between two end-systems on the subnet will propagate hop-by-hop from one DSBM to another DSBM as they travel across the switches along the path. We assume that at least three types of entities co-exist in such an draft-yavatkar-sbm-ethernet-03.txt [Page 8] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 environment, namely, DSBM client, DSBM, and an SBM. The DSBM clients are L3 devices that communicate with DSBM for admission control. SBMs typi- cally run on switches and routers. An SBM is capable of managing resources on a segment, but does not manage resources unless it is elected to be the DSBM for a particular segment. DSBM is an elected SBM responsible for managing a segment. When an SBM runs on a switch, it is possible that it acts as the DSBM for some segments attached to the switch whereas it is only an SBM for other attached segments. In the latter case, it may only act as a RSVP node that forwards RSVP messages. Given a switched segment, it is possible to have two SBMs running at two ends of such a segment (e.g., the "segment b" between switches S1 and S3 in Figure 2 has two SBMs, namely, S1 and S3. However, only one of the two SBMs (say, S1) is the designated SBM for the "segment b" and is responsible for admission control on the segment. The other SBM (S3) only acts as a forwarding node in the RSVP processing. In Figure 2, we assume that S1 manages the segments "a" and "b", S2 manages the segments "c" and "d", and the router R2 manages the shared segment "f". For example, let us assume that a unicast RSVP session exists with traffic addressed to the host H5 in Figure 2 (RSVP session address includes H5's address). Let us also assume that the sender in the ses- sion resides outside the LAN shown in Figure 2 and the router R1 for- wards a PATH message towards the receiver. When a PATH message for the session arrives at R1, the following sequence of events will happen (we omit the treatment of LAN_PHOP object at SBMs and SBM clients here for simplifying the discussion) : 1. R1 forwards the PATH message to its DSBM (S1 in Figure 2). The PHOP object in the PATH message contains R1's address and R1 adds a LAN_NHOP object to the PATH message that contains the address of Router R2 (Layer 3 next hop on the way to the host H5). 2. S1 processes the PATH message, builds the path state, and then for- wards the PATH message over the next L2 hop towards the LAN_NHOP address (does not change the contents of this object). S1 deter- mines the forwarding interface for the PATH message from the MAC address table at S1. 3. When the next L2 hop (SBM S3) receives the PATH message, S3 first looks at the LAN_NHOP object (Router R2's address) and discovers draft-yavatkar-sbm-ethernet-03.txt [Page 9] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 that it must forward the PATH message onto segment f to reach router R2. Because S3 is not the DSBM for segment f, it forwards the PATH message to the DSBM for the segment (router R2). S3 processes the PATH message according to the standard RSVP forward- ing rules and forwards the PATH message to the DSBM (router R2) for the segment f. 4. R2 will perform the PATH processing and forward the PATH message after stripping off the LAN_NHOP and LAN_PHOP objects from the PATH message. 5. In the case of a multicast session address (with hosts H5 and H3 being the members of the destination group), the SBMs forward the PATH message hop-by-hop until it reaches a managed segment where the destination resides. An important difference is that the LAN_NHOP object carries the destination multicast address (and not the unicast R2 address) until it reaches R2. At that point, the DSBM for the managed segment (R2) will multicast the PATH message directly onto the segment f and will also forward it to H5. Another difference is that DSBMs pass LAN_PHOP objects unchanged in this case. The accompanying document (available at ftp://ftpeng.cisco.com/fred/sbm/sbmrules.ps ) provides more details of the message processing rules. 6. When the host H5 decides to make a reservation for the session, it looks up its path state to determine the previous hop(s) for the session and sends the RSVP RESV message to its PHOP (R2). If the admission control at R2 succeeds, it will forward the RESV message to the PHOP (S3) according to its path state. S3 then forwards the RESV to S1 (its previous hop) and, finally, the RESV message lands at the router R1 if admission control at intermediate DSBMs (R2 and S1) succeeds. 5. Any admission control failure results in a RESV_ERROR being sent to the requester and the RESV state at intermediate nodes is removed. draft-yavatkar-sbm-ethernet-03.txt [Page 10] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 Figure 2 - An example of a switched topology --------- | Router | | R1 | |_________| / / segment a / ++++++ Switch + S1 + S1 + + ++++++ / |__ segment b / | segment c / | ++++++ ++++++ switch + S3 + + S2 + switch S3 + +------+ + S2 ++++++ d ++++++ | |_____ segment f | | segment e ------------------------------- | | | | | ==== +------+ === === | H4 | |Router| |H3 | | | | | | R2 | | | |H2 | ==== +------+ === Host Host === Host | | ==== Host | H5 | | | ==== _6. _R_e_f_e_r_e_n_c_e_s [1] R Braden, L Zhang, S Berson, S Herzog, J Wroclaswki, "Resource Reservation Protocol", Internet Draft draft-ietf-rsvp-spec12.txt,May 1996. [2] S.Shenker, "Specification of General Characterization Parameters", draft-ietf-intserv-charac-00.txt,Nov 1995 draft-yavatkar-sbm-ethernet-03.txt [Page 11] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 APPENDIX A DSBM Election Algorithm _A._1. _I_n_t_r_o_d_u_c_t_i_o_n We assume that an IP subnet corresponds to a single L2 (Layer 2) domain. An L2 domain is defined to be a set of nodes and links intercon- nected without passing through a L3 (IP or Layer 3) forwarding function. We refer to links (point-to-point or shared segments) that interconnect nodes within a L2 domain simply as LAN segments. We assume that a Desig- nated SBM (DSBM) exists for each LAN segment (also called a Managed Seg- ment or a MS) and services reservation requests (manages bandwidth) for that segment. We use the term "SBM client" to refer to an entity (an edge device such as a host or a router) that communicates with a DSBM for the purpose of establishing a QoS reservation. To simplify the rest of this discussion, we will assume that there is a single DSBM for the entire L2 domain (an IP subnet). We also assume that the DSBM is a Layer 3 entity that uses UDP for communication with its peers and clients. Section A.11 describes how more than one DSBM may exist within a L2 domain where each DSBM is responsible for separate portions (LAN segments) within a L2 domain. To allow for quick recovery from the failure of a DSBM, we assume that additional SBMs may be active in a L2 domain for fault tolerance. When more than one SBM is active in a L2 domain, the SBMs use an election algorithm to elect a DSBM for the L2 domain. After the DSBM is elected and is operational, other SBMs remain passive in the background to step in to elect a new DSBM when necessary. The protocol for electing and discovering DSBM is called the "DSBM election protocol" and is described in the rest of this document. Once elected, a DSBM periodically multicasts an I_AM_DSBM message to indicate its presence. The message is sent every "refresh interval (e.g., every 5 seconds; the RefreshInterval timer value is a configura- tion parameter). Absence of such a message over a certain time interval (called "SBMDeadInterval"; another configuration parameter typically set to a multiple of RefreshInterval) indicates that the DSBM has failed or terminated and triggers another round of the DSBM election protocol. The SBM clients always listen for periodic DSBM advertizements and send their PATH/RESV (or other) messages to the DSBM. When an SBM client detects the failure of a DSBM, it waits for a subsequent I_AM_DSBM advertizement before resuming any communication with its DSBM. During the recovery period, an SBM client may forward outgoing PATH messages using the standard RSVP forwarding rules. draft-yavatkar-sbm-ethernet-03.txt [Page 12] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 The exact message formats and addresses used for communication with (and among) SBM(s) are described in Appendix B. _A._2. _O_v_e_r_v_i_e_w _o_f _t_h_e _D_S_B_M _E_l_e_c_t_i_o_n _P_r_o_c_e_d_u_r_e When an SBM first starts up, it listens for incoming DSBM advertizements for some period to check whether a DSBM already exists in its L2 domain. If one already exists (and no new election is in progress), the new SBM stays quiet in the background until an election of DSBM is necessary. If no DSBM exists, the SBM initiates the election of a DSBM by sending out a DSBM_WILLING message that lists its IP address as a candidate DSBM and its SBM priority. Each SBM is assigned a priority (a configuration parameter assigned by the network administrator) to determine its rela- tive precedence. When more than one SBM candidate exists, the SBM prior- ity determines who gets to be the DSBM based on the relative priority of candidates. If there is a tie based on the priority value, the tie is broken using the IP addresses of tied candidates (one with the highest IP address in the lexicographic order wins). The details of the election protocol start in Section A.4. _A._2._1 _S_u_m_m_a_r_y _o_f _t_h_e _E_l_e_c_t_i_o_n _A_l_g_o_r_i_t_h_m For the purpose of the algorithm, an SBM is in one of the four states (SteadyState, DetectDSBM, ElectDSBM, I_AM_DSBM). An SBM (call it X) starts up in the DetectDSBM state and waits for a ListenInterval for incoming I_AM_DSBM (DSBM advertizement) or DSBM_WILLING messages. If an I_AM_DSBM advt. is received, the SBM notes the current DSBM (its IP address and priority) and enters the Steady- State state. If a DSBM_WILLING message is received from another SBM (call it Y), then X enters the ElectDSBM state. Before entering the new state, X first checks to see whether it itself is a better candidate than Y and, if so, sends out a DSBM_WILLING message and then enters the ElectDSBM state. When an SBM (call it X) enters the ElectDSBM state, it sets a timer (called ElectionIntervalTimer that is typically set to a value at least equal to the DeadIntervalTimer value) to wait for the election to finish and to discover who is the best candidate. In this state, X keeps track of the best candidate seen so far (including itself). Whenever it receives another DSBM_WILLING message, it updates its notion of the best draft-yavatkar-sbm-ethernet-03.txt [Page 13] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 candidate based on the priority (and tiebreaking) criterion. During the ElectionInterval, if X considers itself the best candidate so far, it sends out a DSBM_WILLING message every RefreshInterval to (re)assert its candidacy. At the end of the ElectionInterval, X checks whether it is the best can- didate so far. If so, it declares itself to be the DSBM (by sending out the I_AM_DSBM advertizement) and enters the I_AM_DSBM state; otherwise, it enters the SteadyState. An SBM is in SteadyState state when no election is in progress and the DSBM is already elected (and happens to be someone else). In this state, it listens for incoming I_AM_DSBM advertizements and uses a DSBMDeadInterval timer to detect the failure of DSBM. Every time the advertizement is received, the timer is restarted. If the timer fires, the SBM goes into the DetectDSBM state to prepare to elect the new DSBM. If an SBM receives a DSBM_WILLING message from the current DSBM in this state, the SBM enters the ElectDSBM state after sending out a DSBM_WILLING message (to announce its own candidacy). In the I_AM_DSBM state, the DSBM sends out I_AM_DSBM advertizements every refresh interval. If the DSBM wishes to shut down (gracefully ter- minate), it sends out a DSBM_WILLING message (with SBM priority value set to zero) to initiate the election procedure. The priority value zero effectively removes the outgoing DSBM from the election procedure and makes way for the election of a different DSBM. _A._3. _R_e_c_o_v_e_r_i_n_g _f_r_o_m _D_S_B_M _F_a_i_l_u_r_e When a DSBM fails (DSBMDeadInterval timer fires), all the SBMs enter the ElectDSBM state and start the election process. At the end of the ElectionInterval, the elected DSBM sends out I_AM_DSBM advertizement and the DSBM is then operational. _A._4. _D_S_B_M _A_d_v_e_r_t_i_z_e_m_e_n_t_s The I_AM_DSBM advertizement contains the following information: 1. DSBM address information -- contains the IP address of the DSBM and its SBM priority (a configuration parameter -- priority specified draft-yavatkar-sbm-ethernet-03.txt [Page 14] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 by a network administrator). The priority value is used to choose among candidate SBMs during the election algorithm. Higher integer values indicate higher priority and the value is in the range 0..255. The value zero indicates that the SBM is not eligible to be the DSBM. 2. refresh interval -- contains the value of the refresh interval in seconds. Value zero indicates the parameter has been omitted in the message. Receivers may substitute their own default value in this case. 3. SBMDeadInterval -- contains the value of the SBMDeadInterval in seconds. If the value is omitted (or value zero is specified), a default value (from initial configuration) should be used. _A._5. _D_S_B_M__W_I_L_L_I_N_G _M_e_s_s_a_g_e_s When an SBM wishes to declare its candidacy to be the DSBM during an election phase, it sends out a DSBM_WILLING message. The DSBM_WILLING message contains the following information: 1. DSBM address information -- Contains the SBM's own address, if it wishes to be the DSBM. Also, contains the priority of the SBM whose address is given above. _A._6. _S_B_M _S_t_a_t_e _V_a_r_i_a_b_l_e_s For each network interface, an SBM maintains the following state vari- ables related to the election of the DSBM for the L2 domain on that interface: a) LocalDSBMAddrInfo -- current DSBM's IP address (initially, 0.0.0.0) and priority. All IP addresses are assumed to be in draft-yavatkar-sbm-ethernet-03.txt [Page 15] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 network byte order. b) OwnAddrInfo -- SBM's own IP address for the interface and its own priority (a configuration parameter). c) DSBM RefreshInterval in seconds. When the DSBM is not yet elected, it is set to a default value specified as a configuration parameter. d) DSBMDeadInterval in seconds. When the DSBM is not yet elected, it is set initially set to a default value specified as a confi- guration parameter. f) ListenInterval in seconds -- a configuration parameter that decides how long an SBM spends in the DetectDSBM state (see below). g) ElectionInterval in seconds -- a configuration parameter that decides how long an SBM spends in the ElectDSBM state when it has declared its candidacy. Figure 1 shows the state transition diagram for the election protocol and the various states are described below. A complete description of the state machine is provided in Section 10. _A._7. _D_S_B_M _E_l_e_c_t_i_o_n _S_t_a_t_e_s DOWN -- SBM is not operational. DetectDSBM -- typically, the initial state of an SBM when it starts up. In this state, it checks to see whether a DSBM already exists in its domain. SteadyState -- SBM is in this state when no election is in progress and it is not the DSBM. In this state, SBM passively monitors the state of the DSBM. draft-yavatkar-sbm-ethernet-03.txt [Page 16] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 ElectDSBM -- SBM is in this state when a DSBM election is in pro- gress. IAMDSBM -- SBM is in this state when it is the DSBM for the L2 domain. _A._8. _E_v_e_n_t_s _t_h_a_t _c_a_u_s_e _s_t_a_t_e _c_h_a_n_g_e_s StartUp -- SBM starts operation. ListenInterval Timeout -- The ListenInterval timer has fired. This means that the SBM has monitored its domain to check for an exist- ing DSBM or to check whether there are candidates (other than itself) willing to be the DSBM. DSBM_WILLING message received -- This means that the SBM received a DSBM_WILLING message from some other SBM. Such a message is sent when an SBM wishes to declare its candidacy to be the DSBM. I_AM_DSBM message received -- SBM received a DSBM advertizement from the DSBM in its L2 domain. SBMDeadInterval Timeout -- The SBMDeadInterval timer has fired. This means that the SBM did not receive even one DSBM advertizement during this period and indicates possible failure of the DSBM. RefreshInterval Timeout -- The RefreshInterval timer has fired. In the I_AM_DSBM state, this means it is the time for sending out the next DSBM advertizement. In the ElectDSBM state, the event means that it is the time to send out another DSBM_WILLING message. ElectionInterval Timeout -- The ElectionInterval timer has fired. This means that the SBM has waited long enough after declaring its candidacy to determine whether or not it succeeded. CONTINUED ON NEXT PAGE draft-yavatkar-sbm-ethernet-03.txt [Page 17] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 _A._9. _S_t_a_t_e _T_r_a_n_s_i_t_i_o_n _D_i_a_g_r_a_m (_F_i_g_u_r_e _1) +-----------+ +--<--------------<-|DetectDSBM |---->------+ | +-----------+ | | | | | | | | +-------------+ +---------+ | +->---| SteadyState |--<>---|ElectDSBM|--<--+ +-------------+ +---------+ | | | +-----------+ | +---| IAMDSBM |-<-+ | +-----------+ | | +-----------+ +>>-| SHUTDOWN | +-----------+ _A._1_0. _E_l_e_c_t_i_o_n _S_t_a_t_e _M_a_c_h_i_n_e Based on the events and states described above, the state changes at an SBM are described below. Each state change is triggered by an event and is typically accompanied by a sequence of actions. The state machine is described assuming a single threaded implementation (to avoid race con- ditions between state changes and timer events) with no timer events occurring during the execution of the state machine. The following routines will be frequently used in the description of the state machine: ComparePrio(FirstAddrInfo, SecondAddrInfo) -- determines whether the entity represented by the first parameter is better than the second entity using the priority information and the address information. If any address is zero, that entity automatically loses; then first priorities are compared and addresses (assumed to be in network byte order) are used for breaking ties. Returns TRUE if first entity is a better choice. FALSE otherwise. SendDSBMWilling Message() Begin draft-yavatkar-sbm-ethernet-03.txt [Page 18] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 Sendout DSBM_WILLING message listing myself as a candidate for DSBM (copy Myaddr and priority into appropriate fields) start RefreshIntervalTimer goto ElectDSBM state End AmIBetterDSBM(OtherAddrInfo) Begin if (ComparePrio(MyAddrInfo, OtherAddrInfo)) return TRUE change LocalDSBMInfo = OtherDSBMAddrInfo return FALSE End UpdateDSBMInfo() /* invoked in an assignment such as LocalDSBMInfo = OtherAddrInfo */ Begin update LocalDSBMInfo such as IP addr, DSBM priority, RefreshIntervalTimer, DSBMDeadIntervalTimer End _A._1_0._1 _S_t_a_t_e _C_h_a_n_g_e_s State: DOWN Event: StartUp New State: DetectDSBM Action: Initialize the local state variables (LocalDSBMADDR and LocalBSBMAddrInfo set to 0). Start the ListenIntervalTimer. State: DetectDSBM New State: SteadyState Event: I_AM_DSBM message received Action: set LocalDSBMAddrInfo = IncomingDSBMAddrInfo start DeadDSBMInterval timer goto SteadyState State: DetectDSBM Event: ListenIntervalTimer fired New State: ElectDSBM Action: Start ElectionIntervalTimer SendDSBMWillingMessage(); State: DetectDSBM Event: DSBM_WILLING message received draft-yavatkar-sbm-ethernet-03.txt [Page 19] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 New State: ElectDSBM Action: Cancel any active timers Start ElectionIntervalTimer /* am I a better choice than this dude? */ If (ComparePrio(MyAddrInfo, IncomingDSBMInfo)) { /* I am better */ SendDSBMWillingMessage() } else { Change LocalDSBMAddrInfo = IncomingDSBMAddrInfo goto ElectDSBM state } State: SteadyState Event: SBMDeadInterval timer fired. New State: ElectDSBM Action: start ElectionIntervalTimer SendDSBMWiliingMessage() State: SteadyState Event: I_AM_DSBM message received. New State: SteadyState Action: /* first check whether anything has changed */ if (!ComparePrio(LocalDSBMInfo, IncomingDSBMInfo)) change LocalDSBMInfo to reflect new info endif restart DSBMDeadIntervalTimer; continue in current state; State: SteadyState Event: DSBM_WILLING Message is received New State: Depends on action (ElectDSBM or SteadyState) Action: /* check whether it is from the DSBM itself (shutdown) */ if (IncomingDSBMAddr == LocalDSBMAddr) { cancel active timers Change LocalDSBMAddrInfo = MYAddrInfo Start ElectionIntervalTimer SendDSBMWillingMessage() /* goto ElectDSBM state */ } /* else, ignore it */ continue in current state State: ElectDSBM Event: ElectionIntervalTimer Fired New State: depends on action (I_AM_DSBM or SteadyState) Action: If (LocalDSBMAddrInfo == MyAddrInfo) { /* I won */ send I_AM_DSBM message start RefreshIntervalTimer draft-yavatkar-sbm-ethernet-03.txt [Page 20] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 goto I_AM_DSBM state } else { /* someone else won */ start DSBMDeadInterval timer goto SteadyState } State: ElectDSBM Event: I_AM_DSBM message received New State: SteadyState Action: set LocalDSBMAddrInfo = IncomingDSBMAddrInfo Cancel any active timers start DeadDSBMInterval timer goto SteadyState State: ElectDSBM Event: DSBM_WILLING message received New State: ElectDSBM Action: Check whether it's a loopback and if so, discard, continue; if (!AmIBetterDSBM(IncomingDSBMInfo)) { Change LocalDSBMAddrInfo = IncomingDSBMAddrInfo } continue in current state State: ElectDSBM Event: RefreshIntervalTimer fired New State: ElectDSBM Action: SendDSBMWillingMessage() State: I_AM_DSBM Event: DSBM_WILLING message received New State: I_AM_DSBM Action: send I_AM_DSBM message /* reassert myself */ restart RefreshIntervalTimer State: I_AM_DSBM Event: RefreshIntervalTimer fired New State: I_AM_DSBM Action: send I_AM_DSBM message restart RefreshIntervalTimer State: I_AM_DSBM Event: I_AM_DSBM message received New State: depends on action (I_AM_DSBM or SteadyState) Action: /* check whether other guy is better */ If (ComparePrio(MyAddrInfo, IncomingAddrInfo)) { /* I am better */ send I_AM_DSBM message restart RefreshIntervalTimer draft-yavatkar-sbm-ethernet-03.txt [Page 21] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 continue in current state } Set LocalDSBMAddrInfo = IncomingAddrInfo cancel active timers start DSBMDeadInterval timer goto SteadyState State: I_AM_DSBM Event: Want to shut myself down New State: DOWN Action: send DSBM_WILLING message with My address filled in, but priority set to zero goto Down State _A._1_0._2 _S_u_g_g_e_s_t_e_d _V_a_l_u_e_s _o_f _I_n_t_e_r_v_a_l _T_i_m_e_r_s To avoid DSBM outages for long period, to ensure quick recovery from DSBM failures, and to avoid timeout of PATH and RESV state at the edge devices, we suggest the following values for various timers. Assuming that the RSVP implementations use a 30 second timeout for PATH and RESV refreshes, we suggest that the RefreshIntervalTimer should be set to about 5 seconds with DSBMDeadIntervalTimer set to 15 seconds (K=3, K*RefreshInterval). The DetectDSBMTimer should be set to a random value between (DeadIntervalTimer, 2*DeadIntervalTimer). The ElectionIn- tervalTimer should be set at least to the value of DeadIntervalTimer to ensure that each SBM has a chance to have its DSBM_WILLING message (sent every RefreshInterval in ElectDSBM state) delivered to others. _A._1_1 _D_S_B_M _E_l_e_c_t_i_o_n _o_v_e_r _p_o_i_n_t-_t_o-_p_o_i_n_t _l_i_n_k_s _i_n _a _s_w_i_t_c_h_e_d _L_A_N The election algorithm works as described before in this case except each SBM-capable L2 device restricts the scope of the election to its local segment. As described in Section B.1 below, all messages related to the DSBM election are sent to a special multicast address (AllSBMAd- dress). AllSBMAddress (its corresponding MAC multicast address) is con- figured in the permanent database of SBM-capable, layer 2 devices so that all frames with ALLSBMAddress as the destination address are not forwarded and instead directed to the SBM management entity in those devices. Thus, a DSBM can be elected separately on each point-to-point segment in a switched topology. For example, in Figure 2, DSBM for "segment a" will be elected using the election algorithm between R1 and S1 and none of the election-related messages on this segment will be forwarded by S1 beyond "segment a". Similarly, a separate election will draft-yavatkar-sbm-ethernet-03.txt [Page 22] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 take place on each segment in this topology. draft-yavatkar-sbm-ethernet-03.txt [Page 23] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 APPENDIX B Message Encapsulation and Formats To minimize changes to existing RSVP implementations and to ensure quick deployment of an SBM in conjunction with RSVP, all communication to and from a DSBM will be performed using messages constructed using the current rules for RSVP message formats. For more details on the RSVP message formats, refer to the RSVP specification (draft-ietf-rsvp-spec- 14.ps). No changes to the RSVP message formats are proposed, but new message types and new LAN_NHOP and LAN_PHOP objects are added to the RSVP message formats to accommodate DSBM-related messages. These addi- tions are described below. _B._1 _M_e_s_s_a_g_e _A_d_d_r_e_s_s_i_n_g For communication among peer SBMs, DSBM, and DSBM clients, we define two IP multicast addresses, AllSBMAddress and DSBMLogicalAddress. These two addresses are selected to be IP addresses with local scope (message pro- pagation restricted to within the originating L2 domain). In addition, these addresses are chosen such that their corresponding MAC multicast addresses (especially, in case of the Ethernet) can be used for easy filtering of RSVP/SBM messages at the switch interfaces. These MAC mul- ticast addresses will be configured in the permanent database of layer 2 devices (which are SBMs and DSBMs) so that incoming RSVP/SBM control messages are easily directed to the resident SBM software agents WITHOUT requiring the devices to explicitly snoop within IP datagrams. AllSBMAddress is used as the destination address while sending out both DSBM_WILLING and I_AM_DSBM messages. In addition, a DSBM uses this address as the destination address when forwarding a PATH message to DSBM clients and SBMs on a path. When SBM clients (and SBMs) need to forward an outgoing PATH message to their DSBM for the corresponding segment, they use the DSBMLogicalAd- dress as the destination address. Thus, a DSBM client (which is also not an SBM) monitors the AllSBMAd- dress to receive I_AMDSBM advertizements and PATH messages sent by its DSBM. _B._2. _M_e_s_s_a_g_e _S_i_z_e_s Each message must occupy exactly one IP datagram. If it exceeds the MTU, draft-yavatkar-sbm-ethernet-03.txt [Page 24] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 such a datagram will be fragmented by IP and reassembled at the reci- pient node. This has a consequence that a single message may not exceed the maximum IP datagram size, approximately 64K bytes. _B._3 _R_S_V_P-_r_e_l_a_t_e_d _M_e_s_s_a_g_e _F_o_r_m_a_t_s All RSVP messages directed to and from a DSBM may contain various RSVP objects defined in the RSVP specification and messages continue to fol- low the formatting rules specified in the RSVP specification. In addi- tion, an RSVP implementation must also recognize two new object classes, LAN_NHOP and LAN_PHOP, that are described below. 5.2 LAN_NHOP and LAN_PHOP Objects Both LAN_NHOP and LAN_PHOP objects have the same structure as the RSVP_HOP object, but are identified as separate object classes to dis- tinguish them from each other and from the RSVP_HOP objects. LAN_NHOP objects use object class = 40; IPv4 LAN_NHOP object uses and IPv6 LAN_NHOP object uses . IPv4 LAN_NHOP object: class = 40, C-Type =1 +---------------+---------------+---------------+---------------+ | IPv4 Next Hop Address | +---------------+---------------+---------------+---------------+ | Logical Interface Handle | +---------------+---------------+---------------+---------------+ IPv6 LAN_NHOP object: Class = 40, C-Type = 2 +---------------+---------------+---------------+---------------+ | | + + | | + IPv6 next Hop Address + | | + + | | +---------------+---------------+---------------+---------------+ | Logical Interface Handle | +---------------+---------------+---------------+---------------+ LAN_PHOP objects use class=41; IPv4 LAN_PHOP and IPv6 LAN_PHOP objects draft-yavatkar-sbm-ethernet-03.txt [Page 25] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 use C-Types 1 and 2 respectively. These objects are structured same as those shown above. 5.3 RSVP PATH Message Format As specified in the RSVP specification, an RSVP_PATH message contains the RSVP Common Header and the relevant RSVP objects. For the RSVP Com- mon Header, refer to the RSVP specification (draft-ietf-rsvp-spec- 14.ps). Changes to an RSVP_PATH message include addition of the LAN_NHOP and the LAN_PHOP objects as specified below. ::= [INTEGRITY>] [] If the INTEGRITY object is present, it must immediately follow the RSVP common header. LAN_NHOP object must always precede the SESSION object. 5.4 RSVP RESV Message Format As specified in the RSVP specification, an RSVP_RESV message contains the RSVP Common Header and relevant RSVP objects. No Changes to the RESV message format are needed with the SBM protocol. _B._4 _A_d_d_i_t_i_o_n_a_l _R_S_V_P _m_e_s_s_a_g_e _t_y_p_e_s _t_o _h_a_n_d_l_e _S_B_M _i_n_t_e_r_a_c_t_i_o_n_s New RSVP message types are introduced to allow interactions between a DSBM and an RSVP node (host/router) for the purpose of discovering and binding to a DSBM. New RSVP message types needed are as follows: RSVP Msg Type (8 bits) Value DSBM_WILLING 66 I_AM_DSBM 67 All SBM-specific messages are formatted as RSVP messages with an RSVP common header followed by SBM-specific objects. ::= where ::= [] For each SBM message type, there is a set of rules for the permissible choice of object types. These rules are specified using Backus-Naur Form draft-yavatkar-sbm-ethernet-03.txt [Page 26] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 (BNF) augmented with square brackets surrounding optional sub-sequences. The BNF implies an order for the objects in a message. However, in many (but not all) cases, object order makes no logical difference. An imple- mentation should create messages with the objects in the order shown here, but accept the objects in any permissible order. Any exceptions to this rule will be pointed out in the specific message formats. _D_S_B_M__W_I_L_L_I_N_G _M_e_s_s_a_g_e ::= _I__A_M__D_S_B_M _M_e_s_s_a_g_e ::= All I_AM_DSBM messages are multicast to the well known DSBM_GROUP address. The default priority of an SBM is 1 and higher priority values represent higher precedence. The priority value zero indicates that the SBM is not eligible to be the DSBM. _R_e_l_e_v_a_n_t _O_b_j_e_c_t_s DSBM IP ADDRESS objects use object class = 42; IPv4 DSBM IP ADDRESS object uses and IPv6 DSBM IP ADDRESS object uses . IPv4 DSBM IP ADDRESS object: class = 42, C-Type =1 +---------------+---------------+---------------+---------------+ | IPv4 DSBM IP Address | +---------------+---------------+---------------+---------------+ IPv6 DSBM IP ADDRESS object: Class = 42, C-Type = 2 +---------------+---------------+---------------+---------------+ | | + + | | + IPv6 DSBM IP Address + | | draft-yavatkar-sbm-ethernet-03.txt [Page 27] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 + + | | +---------------+---------------+---------------+---------------+ PRIORITY Object: class = 43, C-Type =1 +---------------+---------------+---------------+---------------+ | //// | //// | //// | priority | +---------------+---------------+---------------+---------------+ The timer intervals are specified as interval objects with an integer value in the range 0..255 seconds. DSBM Refresh Interval Object: class = 44, C-Type =1 +---------------+---------------+---------------+---------------+ | //// | //// | //// | Timer value | +---------------+---------------+---------------+---------------+ DSBM Dead Interval Object: class = 44, C-Type = 2 +---------------+---------------+---------------+---------------+ | //// | //// | //// | Timer value | +---------------+---------------+---------------+---------------+ draft-yavatkar-sbm-ethernet-03.txt [Page 28] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 ACKNOWLEDGEMENTS Authors are thankful to John ("JJ") Krawczyk of Bay Networks for his constructive comments on the SBM design and the earlier versions of this draft. 6. Authors` Addresses Raj Yavatkar Intel Corporation MS: JF3-206 2111 N.E. 25th Avenue, Hillsboro, OR 97124 USA phone: +1 503-264-9077 email: yavatkar@ibeam.intel.com Don Hoffman Sun Microsystems, Inc. MS: UMPK14-305 2550 Garcia Avenue Mountain View, California 94043-1100 USA phone: +1 503-297-1580 email: don.hoffman@eng.sun.com Yoram Bernet Microsoft 1 Microsoft Way Redmond, WA 98052 USA phone: +1 206 936 9568 email: yoramb@microsoft.com Fred Baker Cisco Systems 519 Lado Drive Santa Barbara, California 93111 USA phone: +1 408 526 4257 email: fred@cisco.com Rick Kennedy Cabletron Systems 40 Continental Blvd. Merrimack, NH 03054 USA draft-yavatkar-sbm-ethernet-03.txt [Page 29] INTERNET-DRAFT SBM (Subnet Bandwidth Manager) February, 1997 phone: +1 603-337-5163 email: rkennedy@ctron.com draft-yavatkar-sbm-ethernet-03.txt [Page 30]