Core Based Trees (CBT) - Scalable Multicast Routing - by A. Ballardie (UCL), P. Tsuchiya (Bellcore), J. Crowcroft (UCL) Status of this Draft This document is an Internet Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a ``working draft'' or ``work in progress.'' Please check the 1id-abstracts.txt listing contained in the internet-drafts Shadow Directories on nic.ddn.mil, nnsc.nsf.net, nic.nordu.net, ftp.nisc.sri.com, or munnari.oz.au to learn the current status of any Internet Draft. Abstract Multicasting is a technique used which allows for group communication either locally or on a wide-area scale. Most local-area networks such as Ethernet and Token Ring provide a multicast service, which has since been exploited by many applications and distributed systems. More recently, multicast capability has been extended to the internetwork using a combination of a distance-vector routing algorithm, a host-to-router group-membership reporting protocol, and the computation of sender-based multicast trees. This, and other approaches to internetwork multicasting, are not scalable to large groups, especially so for large numbers of groups. This document provides a proposal for a new multicast routing algorithm which provides multicast capability in a datagram internetwork. It is scalable, low-cost and efficient - properties lacking in current internetwork multicast routing protocols. Distribution of this draft is unlimited. A mailing list is in place in idmr@cs.ucl.ac.uk (subscribe to idmr-request@cs.ucl.ac.uk) to discuss this and other inter-domain multicast routing issues. (NOTE: DIAGRAMS HAVE BEEN OMITTED FROM THIS .txt VERSION OF CBT. HOWEVER, THEY ARE INCLUDED IN THE .ps VERSION OF THIS DRAFT WHICH IS AVAILABLE IN THE "internet-drafts" DIRECTORIES UNDER THE SAME NAME). CBT Expires April 6th, 1993. i Contents 1 Introduction. . . . . . . 1 2 CBT Design Goals . . . . . . 2 3 Assumptions of the CBT Model . . . . 4 4 CBT Design Overview. . . . . . 5 4.1 Some Terminology . . . . . 5 4.2 Packet Types . . . . . 7 4.3 CBT Introduction . . . . . 10 4.4 CBT Algorithm . . . . . 13 4.5 Some Examples . . . . . 16 5 The LAN Issue . . . . . . 20 5.1 Further Details on the LAN Issue . . . 22 5.2 Summary . . . . . . 25 6 More on how a Router becomes a Major Core . . . 26 7 Core Assignment and Optimality . . . . 27 7.1 Dynamically Improving Optimality . . . 27 8 Major Cores and Minor Cores: The core-join request . 29 9 Core Tree Maintenance Protocol . . . . 30 9.1 The quit-request . . . . . 34 10 core-state-notification (CSN) packet . . . 35 10.1 How the CSN Packet Works . . . . 35 11 core-list packet . . . . . . 37 12 Acknowledgements . . . . . . 42 A Authors' Addresses . . . . . 43 CBT - Internet DRAFT, November 1992 1 1 Introduction Multicast transmission is an increasingly important function in data networks. For instance, SMDS specifies multicast as part of its service, and multicast is an important element for ATM networking, particularly for, but not limited to, video broadcast. Multicast is also increasingly used in the IP internet. One of the central problems in multicast transmission is forming the multicast tree - the collection of nodes and links that the multicast packet traverses. In spite of the fact that substantial work has been done in the area of algorithms for forming multicast trees, significant problems remain to be solved. Paramount among these is the problem of scaling. Current multicast techniques either scale well but form grossly suboptimal trees, or they form good trees but scale poorly. (By scale poorly, we mean consume too many memory, bandwidth, or processing resources.) For example, the multicast model that has been adopted for internetwork multicasting is based onformulating a sender-based multicast spanning tree for each member of a group to which a host is subscribed, using information extracted from the distance-vecto routing algorithm. This imposes considerable storage costs on participating routers. These costs include storing parent and child attributes (and corresponding next-hop link information) in existing route table entries, as well as some other protocol-dependent information. Furthermore, all routers in the internet that have multicast capability are required to participate in establishing a multicast tree for each group present in the internet, whether they are on the multicast tree or not1. This is clearly not scalable as the number of groups to which a host subscribes, and the number of hosts per group, increases. The number of applications taking advantage of the internetwork multicast service is growing rapidly. Example applications include video-conferencing, audio-conferencing, distributed database updating/querying, and location mechanisms for distributed network services. It is therefore important that any internetwork multicast routing algorithm be scalable, efficient (in terms of multicast packet forwarding), low-cost (in terms of computational overhead and storage requirements), and be adaptable to security features - an important issue so far not addressed in multicasting. This draft proposes a new algorithm for multicast group communication in the global internetwork, called CBT (Core Based Trees). It satisfies all of the requirements stated above. ---------------------------- 1. This overhead can be reduced by using an appropriate TTL value in the IP header, especially so for localised groups CBT - Internet DRAFT, November 1992 2 2 CBT Design Goals 1. The design of any multicast routing algorithm should satisfy the requirements of the Host Group Model as specified in [1]. These mostly have to do with the flexibility of multicasting, for example: o Senders need not be members. o Groups may have any number of members. o There are no topological restrictions on group membership. o Membership is dynamic and autonomous. o Host groups may be permanent or transient. 2. Scalability is of paramount importance in any multicast routing algorithm. Good multicast routing algorithm scaling characteristics include low memory, low bandwidth, and low processing resouce requirements. Furthermore, the less routers need to know about the location of remote group members, the better. 3. Robust Multicast Routing. CBT uses a core-based approach to building the multicast spanning tree. Most groups will be small at the beginning of their lifetime, and so the multicast spanning tree for a group will be single-core in the beginning. The destination address in all multicast packets will be the address of the core. As groups become larger and more widespread, additional core(s) may be assigned to the group. Core placement will be chosen so as to enhance the optimality and robustness of the multicast delivery tree. 4. Invisible Multicast Routing and Addressing to routers not on the CBT tree. CBT allows for multicast routing and addressing to be completely invisible to underlying unicast routing and addressing to routers not on the CBT tree. No separate multicast routing table is required by routers (although minor extensions to the unicast routing table are required), and looking at the IP address field alone, multicast addresses are indistinguishable from IP unicast addresses. CBT - Internet DRAFT, November 1992 3 Multicast addresses are only recognised as such if the router in question is a member of the same group as the group identified in an incoming multicast packet, i.e. it is a router on the CBT tree for that group. The router in question can establish this by matching its group identifier with that present in the IP Options field. The group identifier is 32 bits long. If there is no match, then the packet is routed as a normal unicast packet towards the core addressed in the packet. We considered this addressing flexibility an important design goal since multicast addresses are no longer restricted to class D addresses, the administrative overhead of assigning them is reduced, and we see a degree of multicast address and routing transparency introduced not seen before in multicasting. Furthermore, by having a separate name space it is concievable to apply additional semantics to the name. Routers that are part of a CBT tree need only know which routers are its parent (any router has only one parent per CBT tree) and which are its children (the exception to this rule is for certain cores on the core tree which need to know about each other although they are not directly connected neighbours). Thus, CBT allows for a considerable amount of information hiding when compared with other network-level multicast approaches. 5. Routing Algorithm/Protocol Independence. CBT's flexible addressing method allows CBT to be installable anywhere, even across different routing algorithms. Furthermore, it is easy to incorporate into new protocols that may not have a separate address space set aside for multicasting (for example, ATM with E.164 addresses). This is obviously advantageous for inter-domain multicast routing, and also means it can be applied to technologies such as SMDS and ATM networks 6. Independence from any new IP addressing scheme. One of the biggest headaches currently facing the internet research community is that the 32-bit IP address space is fast exhausting. A considerable number of researchers are frantically trying to find the right addressing solution to satisfy future growth expectations and beyond. As far as we can foresee, CBT should function independent of addressing method. This is certainly the case for the proposals so far reviewed by the authors. CBT - Internet DRAFT, November 1992 4 3 Assumptions of the CBT Model o CBT is discussed here in the context of a connectionless datagram internetwork, although as we have already stated, the design of CBT should allow it to run over heterogeneous networks. o CBT uses both ICMP and IGMP protocols, and although the both are standard parts of IP, the latter is present in many, but not all, implementations. We assume its presence for CBT. o CBT requires certain messages not originating on the CBT tree, to traverse only tree branches once encountering the tree. Furthermore, messages originating in certain parts of the tree must only traverse certain branches, and any acknowledgements to these messages must similarly traverse the same branch(es) on the return-path. We have not assumed symmetric routes and have therefore made explicit provisions to create symmetric routes where necessary. o CBT routers are not dedicated to CBT multicast routing. They are normal IP routers with normal IP routing capabilities. A router may be a CBT router for more than one group at any one time. CBT functionality then, is considered an extension to normal IP routing functionality. Ideally, all routers in the internetwork will have CBT functionality and each will make itself available to become a core for a group(s) at any time. o Last, but by no means least, we assume the presence of a multicast core-assignment service, or core server. The specification of such a core-assignment service is outside the context of this draft. However, we make numerous assumptions/suggestions as to how it would/should operate and some of the services it should provide. We also include some of the messages that might be exchanged between such a service and CBT in order to provide a clearer picture, but these message are NOT part of CBT. Whether a dedicated core-assignment service is required, or directory services should assume this role, is open to debate, but clearly a core-assignment service would require global topology information in order to be able to assign cores reasonably optimally. For simplicity, we will refer to a dedicated core-assignment service throughout this draft as multicast directory service. We feel some kind of dedicated multicast service is necessary not only for CBT, but for all multicast protocols. If such a service were presently in operation, maybe it could assign the class D addresses required by current internet multicast routing protocols, which at present, come from ``thin air''. CBT - Internet DRAFT, November 1992 5 4 CBT Design Overview We will first of all provide a description of some of the terminology used in CBT. This is followed by a subsection on CBT describing what it is, what it aims to provide, and briefly how it works. Finally, a more detailed description of the algorithm is given. CBT encompasses a rather more complicated protocol for maintaining the connectivity of the backbone of the CBT tree (core-tree), called the Core-tree Maintenance Protocol. This may be mentioned in the description of the algorithm, but we have dedicated a complete section to it later on because of its importance and for reasons of clarity. 4.1 Some Terminology CBT (core-based tree) is a multicast routing algorithm which results in a centre-based (core-based) multicast delivery tree. There is one CBT tree per group. A core-based tree can be of two types: o single-core o multi-core Single-core CBT trees are for small, geographically local groups. As a group becomes larger and more widespread, additional cores may be assigned to the group concerned, thus making it a multi-core CBT tree. We distinguish between three different types of router that make up a CBT tree: o major core - these are the only routers to be assigned by directory service. They become the centre (or part of the centre in the case of multi-core trees) of a multicast CBT spanning tree. They are assigned so as to be strategically placed to form as near an optimal multicast spanning tree as possible for the group. All multicast packets contain, as their destination address, the (unicast) address of a major core. o minor core - router(s) which form a virtual link between major core routers on a CBT tree. These routers are only present in multi-core CBT trees. When an additoinal major core is assigned to a single-core CBT tree, that major core must make a special core-join request to the existing major core. The path of routers the core-join request-ACK traverses become minor cores on the CBT tree. CBT - Internet DRAFT, November 1992 6 o non-core router - all routers part of the CBT tree that are not major or minor cores are non-core routers. The part(s) of the CBT spanning tree that link major core routers together, i.e. the branches of the tree that only contain major and minor cores, is known as the core tree. o major child - the major core that has sent (and received and ACK for) a core-join request to another major core on the core tree. o major parent - the major core which has received and ACK'd a core-join request from another major core, now on the core tree. CBT - Internet DRAFT, November 1992 7 4.2 Packet Types The following is a list of all CBT and CBT-related control packet types. Brief descriptions are given here, with fuller implementation details available in the appropriate later sections. o CBT control packet types. 1. send-join request - sent from a host to its local router requesting that router to formulate and send a join request. It is an indication of the host's firm intention to join a particular group2. 2. send-join-ACK - positive response from the local router to the host, formulated and sent by it after forwarding a join-request to the major core as specified in the send-join request. The host's group membership is pending. 3. join-request - originated by (if received a send-join request from host on attached subnet), or forwarded by, a router that is not already part of the CBT tree for the group identified by group-id inside the packet. Non-core router status is pending. 4. join-ACK - formulated by any router on the CBT tree that receives a join-request. All routers traversed by a join-ACK become non-core routers on the CBT tree for the group identified by group-id. 5. core-join request - formulated and sent by a major core router that has been newly-assigned to a group. It will have been assigned as such by directory service, which also supplies it with the address of an existing major core to which the core-join request must be forwarded. Alternatively, a core-join request is formulated and sent by an existing major core to another existing major core when reachability between the two has been lost, or one of the major core's parents has become unavailable. Both scenarios amount to a core-tree partition. 6. core-join-ACK - formulated and sent by either a major core (in the case it actually received the core-join request), or by a minor core (if that minor core terminated the core-join CBT - Internet DRAFT, November 1992 8 request), towards the source of the core-join request. All routers traversed by a core-join-ACK on its way to its destination become minor core routers, if are not already such. 7. termination-notification - formulated by a minor core that has received (and thus terminated) a core-join request. It is forwarded to the major core addressed as the target of the of the core-join request. 8. termination-notification-ACK - formulated by the major core that was the target of a core-join request terminated by a minor core. It is forwarded to that minor core. 9. core-state-notification - originated by a major core when that major core has sent either a core-join-ACK, or a termination-notification-ACK. Disemminated on all core-tree interfaces. Alternatively, a major core will generate a core-state-notification when it loses reachability to a virtual neighbour. 10. core-list - generated by the originating major core of a core-state-notification packet. It is the result of a core-state-notification packet having stabilized. The core-list is multicast on the CBT tree. ---------------------------- 2. This will incur minor host modifications. Any host modifications are obviously undesirable, but we feel that if we are going to solve the problem of inter-domain multicasting for the long-term future to provide a scalable and simple approach, host modifications will be eventually necessary. The minor host changes will be with respect to IGMP CBT - Internet DRAFT, November 1992 9 11. quit-request - can be composed by all three types of CBT router. In the case of major cores, a quit-request is sent when a new core-join request arrives from a currently unreachable major child via a different interface than that leading to the child previously. In the case of minor cores, a quit-request is sent only when one is received on a child core-tree interface AND there are no member hosts on any attached subnets. In the case of non-core routers, a quit-request is sent whenever the non-core router's parent changes and the path leading to the new parent is different from the old, or a quit request is received from a downstream non-core AND the current router has no member hosts on any attached subnets. Note: All CBT control packets will be sent encapsulated in IP datagrams. o ICMP. 1. ICMP echo request - sent both ways between major child and major parent, traversing only the core-tree, to test the ``liveness'' of each. 2. ICMP echo reply - a positive response to the above, again, sent only via the core-tree. o IGMP. 1. group-membership query - composed by routers and sent locally to attached subnets to monitor the presence of group members. Failure to receive a response for a particular group query after a certain timeout period results in bf non-core router sending a quit-request to its parent. 2. group-membership report - composed in response to receiving a group-membership query from the local designated router, to indicate that the group in question still has membership on this subnet. o CBT - Directory Service. 1. group-initiation request - composed by a host that wishes to create a group for a particular multicast application. It contains the address(es) of the other parties with which it wishes to communicate, and is forwarded to directory service as a request for a major core to be assigned. It expects the address of a new core to be returned. 2. group-join request - composed by a host that wishes to join an existing group. It is a request for a major core address. 3. core-request - a request from directory service aimed at a router for it to become a major core for a group. CBT - Internet DRAFT, November 1992 10 4.3 CBT Introduction The aim of CBT is to provide a core-based tree for each multicast group. This is a centre-based shortest-path spanning tree, with branches emanating from the centre. The tree will normally start out by being a single-core tree but as the group becomes larger and more widespread, additional core(s) will be assigned to the tree for reasons of routing efficiency and robustness. All cores are assigned by a multicast directory service, the operation of which is outside the context of this draft. When more than a single core is present in a CBT tree, the branch(es) linking the cores together form the core-tree. The assigned cores for a particular CBT tree, known as major core routers, are likely to be geographically distant apart, and so the core tree which links these cores is likely to contain other routers, known as minor core routers. For reasons of brevity, we describe first the single-core case. All branches of the tree consist of point-to-point links. When a host on a subnet somewhere in the internet, wishes to join an existing group, it sends a request to its local router to send a request to join the group. This join-request contains as its destination address, the address of the major core for the group, and is forwarded using the underlying unicast routing algorithm to the next hop on its way to the core. Every router that is traversed en-route will either be a router that is already part of the CBT tree, known as non-core routers, or will not be part of the CBT tree. If the join-request encounters a non-core router, that router terminates the join-request, and sends a join-ACK back to the source. All routers traversed by the join-ACK become non-core routers on receiving it. The join-ACK continues to be forwarded until it reaches the original source. Routers that receive a join-request that are not non-cores for the CBT tree will forward it on its way to the destination core using normal unicast routing, and then await a join-ACK before becoming non-cores for the CBT tree (group). Whenever a join-ACK is sent or received via an interface, that interface is marked as being a multicast interface for the corresponding CBT tree (group). In this way, branches are formed on the CBT tree. The diagram over illustrates a single-core CBT tree. CBT - Internet DRAFT, November 1992 11 Figure 1: Single-core CBT Tree Should the group become larger or/and more widespread, an additional core(s) may be assigned by directory service. This new major core must make a special join-request to the existing major core, to become part of the core-tree. This request is called a core-join request. All routers traversed en-route by the core-join request will become minor core routers on receiving a core-join-ACK. Existing minor cores on a CBT core-tree are obliged to terminate core-join requests and send a core-join-ACK back to the source, which, on receipt of it, becomes a new major core for the group. The new and existing major cores become (virtual) neighbours (major neighbours), the latter being the major parent and the former being the major child. On receiving the core-join-ACK, both parent and child majors can begin exchanging Core-tree Maintenance Protocol messages. The diagram over illustrates a multi-core CBT tree. CBT - Internet DRAFT, November 1992 12 Figure 2: Multi-core CBT Tree Should an existing minor core terminate a core-join request, it must send a termination-notification packet to the target of the join. When it receives a termination-notification-ACK in response, it may then generate a core-join-ACK which is forwarded to the source of the core-join request. The termination-notification packet indicates to the major parent that it has a new major child. Multicast data packets can originate from members of the group or non-members outside the group that are not part of the CBT tree. In the former case, multicast data packets will originate from a member host on a CBT router's attached subnet and be disemminated on reciept by that local on all outgoing multicast interfaces identified by group-id. Eventually, the data packets will reach all nodes on the CBT tree. In the latter case, multicast data packets will be forwarded as normal data packets, with the destination being a major core for the group, using the underlying unicast routing algorithm. The multicast data packet will first encounter the CBT tree either when the destination is reached, i.e. the major core, or it will hit another router (minor or non-core) that is part of the CBT tree. In either case, the multicast data packet will be forwarded on all outgoing interfaces associated with the group-id present in the data packet(s). In this way, the multicast data packets will eventually reach all nodes on the CBT tree. CBT - Internet DRAFT, November 1992 13 4.4 CBT Algorithm As specified above, it is assumed that hosts and routers are running the Internet Group Management Protocol - routers running the router part, and hosts the host part [1]. Whenever a host anywhere in the internet wants to initiate a group, there must be at least one other entity at the initial stage with which it wishes to communicate. Taking as an example, two hosts in different parts of the internet have agreed, by some external means (e.g. e-mail), that they wish to partake in a video conference. On group initiation, one host must make a group-initiation-request to multicast directory service for a group initiation. If, as in this case, there is no CBT tree yet present for the application concerned, the host makes its group initiation request supplying all the host addresses it currently knows about with which it wishes to take part in the group communication. In this way, directory service can ``magically'' select a major core which is optimally placed between those participants and supply the address of that major core, together with the group-identifier for that group, to all hosts identified in the initial reqest. If the requesting host knows, by some external means, that a group with which it wishes to communicate already exists, then it makes a group-join reqest to directory service which reqires no arguments. Directory service replies with the address of a major core and group-identifier (group-id) for the group. Once the requesting host has received the address of a major core for the group, it must now send a send-join-reqest, which includes the group-id and major core address, to its local router. There are two scenarios to consider: o The local router is not already a member for the group identified by the group-id. In this case, the local router formulates a join-request containing as its destination address, the address of the core supplied in the send-join request and also containing the group-id. This is forwarded using normal unicast routing on the next hop to its destination. At this point, the local router sends a send-join-ACK back to the local host, indicating to it that the join-request has been sent and group membership is pending. When a join-ACK is received by the local router, it is forwarded to the local host that initiated the join-request. o The local router is already a member for the group identified by the group-id. In this case, the send-join request is not actually honoured, and the local router formulates a send-join-ACK and forwards it locally back to the host. Following this, the router formulates a ``bogus'' join-ACK (since no join-request was actually sent) and forwards it to the local host. CBT - Internet DRAFT, November 1992 14 Only at this point, can an end-system host consider itself a member for the group, and as such can send/receive multicast packets for that group, as well as send positive replies to IGMP group membership queries. Design decision: We decided a member host should not be permitted to forward multicast packets to the CBT tree for a group until it receives the join-ACK. Here we trade-off between reliablility and low join-latency. By forwarding multicast packets to an upstream router before being ``ACK'd'' would further reduce the degree of delivery probability associated with a connectionless datagram network. When a router receives a join-request from a neighbour router, there are two scenarios to consider: o Scenario 1: The router is already part of the CBT tree for the group identified by group-id, but it is not the major core to which the join was addressed. Thus, the router encountered is either a minor core or a non-core router on the tree. At this point, the join-request terminates. The join-request is terminated at this point because there is already a shortest-path from this router to the core tree. So, the interface over which the join-reqest arrives is marked as a child interface for the corresponding CBT tree, and all CBT packets for the group arriving over it are forwarded on that router's one and only interface leading towards the core tree. If the router receiving the join-request is indeed the major core to which it is addressed, then the major core must be the first router encountered on the CBT tree for that group. In this case, the major core simply returns a join-ACK which may traverse other routers before it reaches its destination. All routers that forward/originate a join-request that recieve/forward a join-ACK for that request, set the following flags: 1. Router status flag - if not already set, is set to non-core in the flag table. Other router status flags are major core and minor core, but these are set under different circumstances. Note: Only one router status flag can be set at any one time for any one CBT tree in the flag table. The setting of one flag may however, lead to the clearing of another. 2. parent-branch - this flag is set in routing table, and there can only be one branch for which this flag is set. It is set against the interface indicating the parent branch in a particular CBT tree identified by group-id. The routing table is a slightly modified unicast routing table. CBT - Internet DRAFT, November 1992 15 3. child-branch - this flag is also set in the routing table against the interface indicating a child branch in a particular CBT tree identified by group-id. For any one group, this flag may be set on multiple interfaces. The flag table is illustrated below: ++++++++++++++++++++++++++++++++++++++++ + ROUTER STATUS FLAG + ++++++++++++++++++++++++++++++++++++++++ + Non-core Minor-core Major-core + + -------- ---------- ---------- + + group-id - - + + - group-id + + - group-id - + ++++++++++++++++++++++++++++++++++++++++ o Scenario 2: The join-request encounters a router that is not part of the CBT tree identified by group id. In this case the router forwards the join-request on the next hop towards the addressed core using the underlying unicast routing algorithm. As the join-ACK passes back along the point-to-point links that the join-request traversed, the routers encountered become non-cores and the appropriate flags are set in the flag table/routing table, as described above. Note: All join-request's contain an identifier in addition to the group-id to help distinguish between different join's for the same group, and match join-request's with join-ACK's. When a join-request is received by a router not already on the CBT tree, a copy of the join-request is cached, together with the incoming interface address. When it is forwarded, the outgoing interface address is added to the information contained in the cache. This is to ensure that the join-ACK is received and sent out on the correct interfaces. CBT - Internet DRAFT, November 1992 16 4.5 Some Examples The diagram below shows a CBT tree for a group with group-id 12345. We will demonstrate two examples. The first will show the messages generated when a host, shown as ``X'' next to the bottom right router in the diagram, wishes to become a member of the group 12345. Figure 3: Example Multi-Core CBT Tree The second example demonstrates how a new major core becomes part of the core-tree. The part of the diagram relevant for this is the top centre indicated by thick dotted lines. CBT - Internet DRAFT, November 1992 17 o Example 1: This example demonstrates the sequence of events required for host ``X'' (bottom right in diagram), when it wishes to become a member for the group 12345. The diagram illustrates the present state of the CBT tree for group 12345, but the dashed lines indicate the branches not yet part of the tree. The routers on the dotted lines are those traversed by any messages exchanged with the tree. We assume the message exchange with directory service has already taken place, and has supplied major core B's address. The sequence of events is as follows: 1. Host ``X'' composes a send-join request containing the address of major core B, and the group-id 12345. It then forwards it to the local router shown in the diagram. 2. The local router composes a join-request with major core B as the destination address, and forwards it on the next-hop towards the core. There are two further hops to traverse (shown on the dotted line on bottom right of diagram) before the join-request hits the CBT tree. These two routers forward the join-request as a normal unicast packet, i.e. CBT is transparent to them as long as they are not part of the group identified by group-id in the OPTIONS field of the IP header. 3. After the local router has forwarded the join-request, it returns a send-join-ACK to the local host. This indicates to the local host that group membership is pending. 4. The non-core router at the end of the dashed line receives the join-request. It processes the IP header which includes the group-id in the OPTIONS field of the header, and realises it is a CBT packet pertaining to group 12345, of which it is a current member. 5. Since the non-core router already has a shortest-path branch to the core tree, it terminates the join-request and composes a join-ACK. This is forwarded back over the interface the join was received, towards the originator of the join-request. The current non-core router sets another child interface flag in its routing table, for the new branch of the CBT tree. 6. Each router that receives the join-ACK sets the parent/child flags in their routing tables, indicating the CBT tree interfaces. It also sets the non-core flag in the flag table. CBT - Internet DRAFT, November 1992 18 7. When the originator receives the join-ACK, it marks the appropriate parent interface in its routing table, sets its non-core flag in its flag table, and forwards the join-ACK to the local host that requested membership. 8. From this point the host can send/receive multicast packets for the group identified by group-id. The host will also begin to answer group-membership queries as part of IGMP. CBT - Internet DRAFT, November 1992 19 o Example 2: This example demonstrates the sequence of events required for a newly-assigned major core router to become part of the CBT core-tree for the group 12345. The CBT tree for group 12345 is the same as one used in Example 1 above. We assume the initial exchange with directory service has been completed, and the major core has been supplied with either the address of one existing major core (in this case, B), or it has been supplied with the complete list of all current major cores for the group, and makes its own choice (in this case, B). The path between the newly-assigned major core and major core B is shown by the thick dotted line. The sequence of events is as follows: 1. The newly-assigned major core C now knows the address of major core B, to which it must forward a core-join request. It composes the request, and forwards it on the next hop towards B. The sequence of hops before encountering the CBT tree is shown by the thick dotted line. 2. The next-hop recieves the core-join request. It forwards it using normal unicast routing, since it does not recognise CBT packets unless it is a router on the CBT tree for that group. Similarly for the next two hops, since they are not part of the CBT tree (yet) either. 3. The core-join request hits the core-tree on its next-hop. All minor cores terminate core-join requests. The minor core caches the received core-join request and forwards a termination-notification to major core B. When the minor core receives a termination-notification-ACK it can then compose a core-join-ACK and forwards it on the next-hop to the new major core. 4. Once major core B has sent the termination-notification-ACK it marks the interface over which it was sent in its routing table as the interface for major child C. It can proceed to send ICMP echo request/replies over that interface, but first waits to receive an echo request before itself sending one. 5. The routers traversed by the core-join-ACK record over which interface it was received, set the appropriated parent flag in the routing table, set the minor core flag in the flag table, then set the child flags in the routing table after forwarding the join-ACK. 6. When the new major core receives the join-ACK, it records the interface over which it was received, together with its new major parent, in the parent/child table. It then proceeds to send an ICMP echo request on the interface leading to its new major parent. A new core-tree branch is now complete. CBT - Internet DRAFT, November 1992 20 5 The LAN Issue Multicast packets will nearly always originate from a host on some local area network (LAN). For LAN's with only one router attached, there are no apparent problems with receiving/forwarding CBT multicast packets onto the LAN on thier way to their destination. Today however, multi-access links are becoming more common on LAN's, i.e. LAN's now tend to have more than one router attached to them. Although all routers on a multi-access link will be exchanging EGP/IGP messages to establish which are the unicast shortest-cost paths to destinations, there is often more than one valid route to a destination, for example, for different types of service (ToS) or/and policy considerations for inter-domain routing. The latter points are of relevance to CBT; consider what would happen if a join-request from a particular major core from downstream (in relation to that core), arrived at a router on the multi-access LAN. To which router should the join-request be forwarded if there is more than one with a shortest-path to the core? If the router receiving the join-request were to select (or elect) a forwarding router on the LAN for that and all subsequent multicast packets, the problem is not solved, as we can see from the scenario that could then follow; consider another join-request for the same group arriving from downstream (in relation to the core tree) at a different router on the LAN (which is a perfectly valid scenario). To exacerbate the problems of our example, the join-request is for the same group, but addressed to a different major core. Further assume that the router at which the join-request arrived, selects/elects a different router on the LAN to the one previously chosen by the router that received the first join-request. The LAN topology could be as shown in the diagram over .... CBT - Internet DRAFT, November 1992 21 Figure 4: The Multi-Access LAN Problem This obviously cannot be allowed to happen since not only would it cause ``cycles'' in the tree (resulting in looping), but would also result in duplicate multicast packets arriving on the LAN from upstream, and duplicates being sent from the LAN upstream. The problem can be alleviated rather simply since we can take advantage of the way CBT uses unicast addressing to perform multicasting. This would certainly not be the case if CBT used class D addresses, since modern router interfaces are programmed to pick up all multicast packets. What we must do is imagine a multi-access LAN as one large node (router) on the CBT tree. Any node on the tree has only one interface leading to/from the core-tree, and zero, one or more interfaces leading away from the core-tree. The two problems described in our example are solved as follows: Problem 1: A router has received a join-request from downstream (in relation to the core tree) and must decide between two or more routers on the LAN which each have equal-cost paths to the addressed major core. The problem is solved by electing the router with the lowest address from those with equal-cost paths. Then, the join-request and all subsequent multicast packets pertaining to that group that arrive from downstream are encapsulated at the link level and addressed to the elected router. Problem 2: As we have already explained, it is perfectly valid for a second join-request for the same group to arrive at a different router on the LAN than the previous one, and be addressed to a different major core for the same group. If we imagine the LAN to be one large node (router) on the CBT tree, the solution should be obvious. As we explained in section 4.4, when a join-request arrives at a node that is already on the tree, whichever major core for the group the join-request is addressed to, the join is terminated at that node, and the path towards the core tree is via that node's parent interface for the tree. Thus, that node can ACK the join-request. More details on the solutions to the above problems are given in the following subsection. CBT - Internet DRAFT, November 1992 22 5.1 Further Details on the LAN Issue As we described in the preceding section, we must imagine a multi-access LAN as one large node (router) on the CBT tree, with one interface leading to the core from the LAN, and from the core to the LAN. There may be zero, one or more legitimate LAN interfaces (routers) leading away from the LAN for any particular group tree. We now describe the details of two scenarios with respect to multi-access LAN CBT multicasting: Scenario 1: There are no members for CBT group g on multi-access LAN l, and some host(s) on the LAN is/are sending to group g. In this case, there are no complications involved and the CBT multicast packet will be encapsulated as normal at the link level and addressed to some local router that has a shortest-path to the core. This router then forwards the packet on its way to the addressed major core. CBT - Internet DRAFT, November 1992 23 If the chosen router happens not to have the shortest-path to the addressed core, that router will nevertheless forward the packet, and then send an ICMP-Redirect message to the originating host so that subsequent packets will be addressed to the more appropriate router3. Scenario 2: This is the more complicated scenario where a join-request is received by a router on a multi-access LAN. As we have stressed, there must be only one LAN interface to/from the core. Thus, when a join-request arrives at a router on a multi-access LAN, that router must take immediate action to avoid the problems occurring that we outlined in the previous section. Exactly what happens is described below. ---------------------------- 3. The router to which packets are sent for forwarding maybe DOWN. In the absence of a router-discovery protocol, hosts may never dynamically discover that the packets they are sending to it are disappearing ``into a black hole''. CBT - Internet DRAFT, November 1992 24 On receipt of a join-request, router r performs the following: o elects a router on the LAN that has the shortest-path to the addressed major core. If there is a tie, the router with the lowest address is elected as designated core multicast router4. o Router r sends a special CBT LAN broadcast containing the following information: 1. group-id contained in the join-request 2. designated core multicast router IP address o The join-request is forwarded to the designated router. All routers on the LAN are expected to store this information, together with a timer (of say, three minutes) for reference purposes, in case another join-request is received before the previous join-request has been ACK'd. When the join-ACK arrives from upstream, it is forwarded onto the LAN by the designated router, and is eventually picked up by the sending LAN router, which may forward it which may or may not forward it downstream, depending on whether it received a join, or originated one. CBT LAN broadcast messages are sent by a single router on the LAN say, once every three minutes. They are stored by all LAN routers with a timer. Thus, the router responsible for broadcasting must continue to do so once within the timeout period, as long as it is a node on the CBT tree. Information in successive broadcasts overwrites previous information with respect to the group identified by group-id provided it is broadcast by the same source. If another router on the LAN recieves a join-request for the same group, it may not have received a previously broadcast message from some other router on the LAN w.r.t. the same group (however, this is considered highly unlikely given LAN media bandwidths and reliability). If this second router begins broadcasting the same or different information w.r.t. the router on the LAN with the shortest-path to the core, all other routers would receive it and recognise it as being generated by a different source than the first. In this (again unlikely) case, the two broadcasting routers will choose between themselves, which is going to elect the designated core multicast router and be responsible for LAN broadcasts. The diagram over shows a LAN with a designated router: ---------------------------- 4. The designated core multicast router will be responsible for operating the ``Host Membership Protocol'' (IGMP) CBT - Internet DRAFT, November 1992 25 Figure 5: The Solution to the Multi-Access LAN Problem If a quit-request is received between broadcasts, and there is more than one downstream LAN interface for the CBT tree, then multicasts received via those interfaces will continue to use the designated core multicast router until the entry expires for that group. This, and the case where the designated router fails, are now considered: Scenario 1: A downstream CBT tree router interface has received a quit-request. It will encapsulate the quit-request and address it to the designated router. Since the designated router is also responsible for running IGMP, it will know if there are any more CBT routers on the LAN pertaining to that group. If there are, the quit-request is not forwarded further. If there are not, then it forwards the quit-request upstream and removes itself from the tree. Note: If the quit-request has been sent by the router responsible for CBT LAN broadcasting, then assuming there are still members remaining on the LAN, it assumes responsibility for subsequent broadcasts. Scenario 2: The router responsible for CBT broadcasts is DOWN. This could potentially happen at any time. It is a fairly certain assumption that a dynamic routing protocol will be operating between the routers on the LAN which should indicate the status of the CBT LAN broadcasting router. If it is indeed DOWN, the designated router on the LAN assumes responsibility for that role. 5.2 Summary In summary then, once a join-request has been received by a router on a multi-access LAN, all other routers on that LAN are informed of the group-id and the designated router, which is considered the only LAN interface to the core tree. After the join-ACK has been received and forwarded, all routers on the LAN have the information necessary to allow the LAN to simulate a single node on the CBT tree, with one interface leading to/from the core, and (potentially) several interfaces leading away from the core. CBT - Internet DRAFT, November 1992 26 6 More on how a Router becomes a Major Core Any available IP router can be requested to become a major core at any time by directory service. It is assumed that directory service will have a full list of most recently available routers (subject to routing update period), each of which it can request to become a major core for a particular group(s) at any time. Directory service make such a request by means of a core-request. Network managers have the authority to configure their routers to reject core-request's if they so wish. When a core-request is successfully accepted (by means of an ACK as opposed to a N-ACK), directory service makes an entry in its group table. This contains a link list for every group present in the internetwork. As long as a group is in existence it will have at least one link (for the case of a single-core tree) containing the address of the major core. As more major core's become assigned to a group, their addresses are added to the corresponding link list. An entry in the group table of directory service would be as follows: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + Group-id || core1_addr | core2_addr | core3_addr | NULL + +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ On receipt of a successful ACK to a core-request, directory service will send a packet containing the address of an existing major core (or alternatively all the core addresses) that the new core should join. If this packet is empty, the new core assumes it is the core for a single-core tree and awaits join-requests that will be pending, since cores are only assigned as such when some end-system requested the presence of the group. As the group becomes larger and more widespread, directory service can assign further major cores for the group. The assignment of these is left at the discretion of directory service. Additional core assignment becomes necessary when potential members request the address of a major core for the group but are geographically distant from present members. Alternatively, a new major core may be assigned for reasons of robustness. Any newly-assigned core must send a core-join request to the major core determined by directory service (or alternatively determined by itself from a list supplied by directory service). This is described in the section 8. CBT - Internet DRAFT, November 1992 27 7 Core Assignment and Optimality It is the assumption of this draft that major cores will be optimally placed (or rather, near optimally placed) with respect to group members, whether groups be small or large. We have left this ``magical assignment'' to multicast directory service, but have not said how directory service should/could achieve near-optimal core placement. There is no doubt that such a specification is outside the context of a multicast routing protocol, but it is clearly a subject of future work. However, in our opinion, the heuristic required to achieve near optimal core placement is likely to be simpler rather than the result of complex mathematics or graph theory. We may well propose a solution before a CBT implementation is complete. Even with optimal core placement, it should be obvious that CBT does not always provide shortest paths between a multicast source and all destinations on the tree, as opposed to source-based multicast. There are some for whom this will be unacceptable, but we must focus on the problems at hand; firstly, it is highly probable, given optimal or near-optimal core placement, that the paths traversed by CBT multicast packets will be mostly shortest-paths, and so will compare favourably with source-based multicast trees. We have yet to carry out experimentation to prove this, but we don't expect an optimality margin any greater than 10-15%. Second, we have to think of the problems at hand. As far as multicast routing is concerned, we should first and foremost ensure scalability, as well as simplicity, robustness and information hiding. Thirdly (and rather more cynically), most routing algorithms/protocols are in one way or another sub-optimal, whether it be how they handle congestion, or react to topology changes. What, in the area of routing, is optimal in all cases! Finally, there is the interesting case of locating network services distributed throughout the internetwork. So called resource discovery is a topic of current research, but there is no doubt that multicast is an important aspect of it, with for example, one multicast group linking together all locations relevant to a particular service. Such an application would require permanent group trees to be maintained. Again, assuming reasonably optimal core placement between the members of such a group, a client need only send a request by means of a CBT multicast packet, and unicast it towards a core for that particular tree. The first available service receiving the request on the tree would reply by means of a unicast to the sender. Used in this way, CBT must surely offer one of the most attractive solutions to the resource discovery problem. 7.1 Dynamically Improving Optimality It may well be that when a join-request arrives at a router that is not already on the tree, that router's most optimal path to the addressed core is temporarily not available. In order to keep the join latency low, we decided that all routers must immediately forward CBT - Internet DRAFT, November 1992 28 join-requests, and in the above case, the join would be forwarded on the next-best path, which may be considerably less optimal. We have made a provision for the tree to be able to re-configure itself so that the paths traversed to the core tree are mostly optimal. The overhead required to do this is negligible and the packet loss resulting from the transition may well be zero. Consider the following, where router X (not yet on the tree) must forward an incoming join-request to next-hop Y on its way to major core A, since it's best next-hop, Z, is temporarily not available. Figure 6: CBT branches may change dynamically In the above case, the branch formed will be X, Y, A. However, the best path is X, Z, A. When Z becomes available again, X can send a second join-request to Z, which forwards it on towards A. A will then send a join-ACK back along the path A, Z, X. When X receives it, it can ``keep it on hold'' before making a new entry (parent interface in routing table) for that CBT tree. In an instant when all the of X's CBT interfaces are silent, it could then make the transition to shortest-path by simply replacing the its parent entry in it's routing table. After the transtion it sends a quit-request out on the old interface. CBT - Internet DRAFT, November 1992 29 8 Major Cores and Minor Cores: The core-join request A core-join request will always be forwarded by non-core routers. However, minor cores will always terminate a core-join request. There are 5 scenarios to consider when a core-join request is sent to an existing major core router: o Scenario 1: There is a point-to-point link between the sending major core and the recipient major core. In this case the core-join request arrives directly at the destination major core. The recipient core sends a core-join-ACK back to the originating core, and proceed to exchange Core Tree Maintenance Protocol messages. o Scenario 2: The core-join request encounters a router that is not any part of the CBT tree for the group identified by group-id in the core-join request. The receiving router will cache the request (which contains additional identifying information) and the address of the incoming interface, forward it using normal unicast routing to the next hop on its way to its destination (major core), then add the outgoing interface address to the cached information. When/if a matching core-join-ACK is received, the cached information is used to ensure that it arrived on the correct interface. If so, the appropriate flags are set in the flag table/routing table, and a core-join-ACK is forwarded on the interface as specified in the cached information. After a core-join-ACK has been forwarded, it becomes a minor core on the core-tree for that group. Minor cores are obliged to forward Core Tree Maintenance Protocol messages. If, for any reason, a check fails, the packet is dropped. Note: The interfaces over which a core-join-ACK is received/sent are recorded in the routing table by flags, one for each interface that corresponds to the core tree. It is necessary for a minor core to know its parent core-tree interface and child(ren) core-tree interface(s), and these are marked as such in the routing table. o Scenario 3: The core-join request encounters a non-core router on the CBT tree. Non-core routers are obliged to forward core-join requests to an upstream neighbour that is the next-hop on the CBT tree. Once a core-join request hits a non-core router, irrespective of the major core it is addressed to, it is forwarded on the parent interface of that non-core router towards. Thus, the core-join request may no longer be destined for the major core addressed in the IP header. The core-join request continues it's journey on a branch of the CBT tree for the appropriate group, and will eventually hit either a minor core or a major core. CBT - Internet DRAFT, November 1992 30 If a major core is encountered, whether it is the one addressed or not, it is ACK'd by it. The ACK traverses the same branch back to the source and all routers along the way change their status to minor core, except the final recipient router, which now becomes a new major core. In this way, a new branch of the core tree is formed. Design decision: In this way, a core-join request will only traverse existing branches of the tree and in doing so prevent core-join requests ``crossing over'' non-core routers, which could potentially lead to ``cycles'' in the tree. Note: The core-join ACK must contain an explicit field as to the true identity of the parent major core. The reason should be apparent from scenario 5. o Scenario 4: The first router encountered by a core-join request is a major core different from that in the destination field of the request. This is a somewhat unlikely scenario since it would indicate that directory service had made an error in its choice of new major core. Major cores are expected to be geographically isolated from each other, and so it is unlikely that directory service would assign a router to become a major core so near to an existing major core, for any one group. However, if the first router encountered were indeed a different major core than the one addressed in the core-join request, it will terminate it and ACK it. The source address of the ACK is the true identity of the parent. o Scenario 5: The core-join request encounters is a minor core for the group. The core-join request is terminated at this point. The recipient minor core next formulates a termination-notification packet which contains among other things, the address of the source (major core), and forwards it over the core tree interface which leads to the addressed core (as indicated in its routing table). By doing this, Core Tree Maintenance Protocol message exchanges may be more evenly distributed. The termination-notification packet serves to indicate to the addressed major core that it has a new major child with which to exchange Core Tree Maintenance Protocol messages. On receipt of the termination-notification-ACK, the minor core formulates a core-join-ACK and forwards it on the interface over which it received the core-join request. As already mentioned, any routers on or off the tree that the core-join-ACK traverses becomes a minor core for the group, so the necessary updates must be made in the flag table of each router. In this way, new branches of the core-tree are formed, and the core-tree branch interfaces are recorded by each minor core by setting the appropriate flags in the routing table. Note: The core-join-ACK formulated by the minor core must have a field containing the identity of the true major parent of the new child. By assuming the source address of the ACK would be clearly incorrect. CBT - Internet DRAFT, November 1992 31 9 Core Tree Maintenance Protocol An introduction to the Core Tree Maintenance Protocol is given below, but the actual detailed working of the protocol is made clear in the sections on the core-state-notification packet and core-list packet. ICMP messages form the basis of the Core Tree Maintenance Protocol. However, other types of message are generated as part core-tree maintenance, so we shall refer to these messages collectively as Core Tree Maintenance Protocol messages. Only major cores are ever the source/destination of Core Tree Maintenance Protocol messages. More precisely, only major parents and thier respective major child(ren) exchange Core Tree Maintenance Protocol messages. For any group, say group A, the Core Tree Maintenance Protocol messages must traverse only the core-tree for group A. Only in this way will core-tree partitions be noticable. Design decision: This is an example of why it is necessary to distinguish between minor cores and non-cores. The latter will never receive Core Tree Maintenance Protocol messages, and so need not know how to deal with them. Whenever a major core sends/receives a core-join request, it records the destination address present in the core-join-ACK (or the source address in the case of receiving a core-join-ACK) and the interface over which it forwards the ACK (or received it). This information is recorded by major cores in the major parent-child table. This not only gives information on who a major core's major parent/major child(ren) are, but also the interfaces over which they are reachable, since major parent/child(ren) will not be directly connected neighbours. The interfaces will obviously be the CBT core-tree interfaces. Note: If a major core receives a termination-notification packet from some minor core, and replies with a termination-notification-ACK, address and interface information is similarly recorded, as above, in the major parent-child table. The major parent-child table is illustrated below: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + Group-id || major-parent = i/f | major_child1 = i/f | major_child2 = i/f + ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ A core-list packet contains the addresses of the major cores for a CBT tree that are considered to be currently available. A core-list packet is only generated in response to the addition or removal of major cores and is the result the result of a core-state-notification packet having stabilized. A core-list packet is multicast by one major core. This process will be more fully explained in the sections 10 and 11. Non-core routers cache and forward core-list packets on all outgoing CBT tree interfaces, i.e. multicast. Non-core routers need take no action on receiving a core-list packet unless its parent is a major CBT - Internet DRAFT, November 1992 32 core and has become unavailable. The parent may have become unavailable since the generation of the core-list, or the core-list was generated as a result of that major core being removed. In either case, it is the responsibility of the non-core router to send a new join-request to the ``nearest'' major core listed in the core-list packet. If a major core, to which a non-core router is attached becomes unreachable, it is the responsibility of that non-core router to send a new join-request towards another major core selected from its current core-list packet. In this way, only the non-core router in the immediate vacinity of a CBT partition need take action to ensure that the branch becomes re-attached to the tree. Thus, the partition remains transparent to other non-cores not in the direct vacinity of the partition. Minor cores ignore core-list packets. Minor cores play no direct part in repairing partitions of a core-tree, but do serve to help prune the core-tree after a failure has occured. Major cores cache the most recently received core-list packet. A major core must use its cached core-list if it establishes that it's parent is DOWN, in order to restore core tree connectivity. When reachability is lost between major child and major parent because of an intermediate link failure, it is the responsibility of the major child only, to re-establish a (virtual) link with its major parent. The core-list is not required for this. Design decision: Indeed, if both major cores did try to re-establish connectivity with each other, each would become the parent (and child) of the other. Major-parent and major-child(ren) constantly test each others ``liveness'' by exchanging ICMP echo request/replies. If a reply is not received after a certain timeout period, one of two kinds of failure is assumed: o Scenario 1: Major core failure. o Scenario 2: Minor core failure (or some link between parent and child major cores is down. Scenario 1: Major core failure - A major parent (or child) can establish to a high degree of probability the reason for not receiving a response to a ping message. It can assume failure after trying to reach the parent/child via all other possible routes. Again, if no reply is received within a small timeout period, the parent/child is assumed DOWN. Only the major child(ren) of the failed major core is responsible for re-establishing full connectivity of the core-tree. Each child does so by making core-join request to another major core as described in the section 11 on the core-list packet. After a major core has received the core-join-ACK, its new parent-core generates a core-state-notification packet which is disemminated on all core-tree interfaces for that group. When the core-state-notification packet stabilizes, i.e. the packet sent CBT - Internet DRAFT, November 1992 33 out on a core-tree interface is the same as that received, the contents of it are copied to a core-list packet and multicast on the CBT tree. Only the major core that originated the core-state-notification packet generates and multicasts the core-list packet. Scenario 2: Minor core (or link) failure - if, on the other hand, an ICMP reply is received via a route not on the core tree, then the core tree link is assumed broken. As in the previous case, it is the responsiblility of the child to re-establish core tree connectivity. The underlying unicast routing algorithm will provide a new route to the parent major core. The new route may or may not include hops that were part of the old core-tree. What follows is described in the next section on the quit-request. CBT - Internet DRAFT, November 1992 34 9.1 The quit-request If it has been decided that a core tree link is broken, the major child sends a new core-join request towards its major parent. If the first next-hop of the new route is different from the old, it must also send a quit-request to the old next-hop which formed part of the original core-tree. Quit-requests can have a cascading effect in that whenever any router on the CBT tree receives one from a child router, it too sends one to its parent iff it has no further children and no member hosts on attached subnets. The exception to this is when a quit-request arrives at a non-core router from a downstream non-core router, and the directly connected parent is either a minor or major core. In both cases the quit-request is sent to the parent which removes the child flags as appropriate from the routing table, but no further quit-requests are generated from this point. A quit-request is only permitted in the direction child to parent, and never in the reverse direction. Furthermore, quit-requests are always terminated at a major core. CBT - Internet DRAFT, November 1992 35 10 core-state-notification (CSN) packet A core-state-notification packet is generated only by major cores as the result of a major core having accepted a new core-join request, or by a major core as the result of that core sending a quit-request. The CSN packet contains a unique identifier as well as the group-id and originator's address, both of which remain constant until the CSN packet stabilizes. Once a new major core has been ACK'd by its major parent (either explicitly or by having received a core-join-ACK from a minor core in response to the minor core having received a termination-notification-ACK packet), the parent generates a core-state-notification packet containing a list of its major core children. Alternatively, a major core has sent one of it's children a quit-request. In this case too, the sending core generates a CSN packet. The CSN packet is distributed on each core-tree interface for the group. Exactly how the CSN packet works is described in the next section. 10.1 How the CSN Packet Works In a multi-core CBT tree, all major cores have at least one major core (virtual) neighbour. When the state of the core tree changes as the result of a new core addition or core removal, a mechanism is needed to inform the rest of the CBT tree of the new state. This mechanism is the core-state notification packet exchange which results in a core-list packet being multicast on the tree. It works as follows; whenever a major core accepts a new child, or sends a quit-request to remove one, it must take action on reporting it. The parent to the new addition/removal generates a core-state notification packet reporting itself and its children that are UP. This CSN packet is broadcast on all core tree interfaces, but not on other CBT tree interfaces. The aim of the CSN packet is for it to arrive in a state where each child is terminated by a ``NULL'' - this ``NULL'' entry can only be made by a child if it has itself no children. If a major core has no children, then a ``NULL'' follows it's own entry. When a major core cannot append information to a broadcast CSN packet is it forwarded on outgoing interfaces only if it is as large or larger than the CSN packet the node currently knows about. If it is not the CSN packet arriving is dropped. In this way, a slow responder cannot flood the the core tree with out-of-date packets. If however, a broadcast arrives which is as big or bigger than the one it currently knows about AND can modify it, the resulting CSN packet is broadcast on all outgoing core tree interfaces. The CSN packet is considered to have stabilized when all child entries in it are suffixed with ``NULL'', and when childless parent entries are also suffixed with ``NULL''. A stabilized CSN packet is copied to a core-list packet and multicast on the CBT tree by the originator of the first CSN packet. A diagram is given in section 11. CBT - Internet DRAFT, November 1992 36 A diagram may have been appropriate here, but because of the potential for each major core to be slow/faster than the others, and as broadcasts can arrive in different orderings because of faster/slower nodes, we thought a diagram may have been misleading, so we hope the explanation suffices. CBT - Internet DRAFT, November 1992 37 11 core-list packet As far as major cores are concerned, the core-list packet is crucial to core-tree maintenance, allowing major cores, where appropriate, to re-establish core-tree connectivity after it has failed. The structure of the core-list packet is simple, but is catalytic in repairing core-tree connectivity. As we do not envisage the number of major cores to be more than a handful for very large groups, the core-list is uncomplicated and remains even in the worst case, quite small. It consists of a concatenation of major cores and their major core children in the order they joined the parent.For example, take the following (large for demonstration purposes) core-tree: Figure 7: CBT Core Tree The numbers only serve to illustrate the example and indicate the order in which major cores were assigned to the group. The core-list packet for this group will be as shown over: CBT - Internet DRAFT, November 1992 38 Figure 8: CBT core-list packet If a major core fails and it has no major core children, the failed major core is a leaf on the CBT core-tree. In this case, no action is necessary since core-tree connectivity is maintained. If there are non-core router children hanging off it, it will be the responsibility of the non-core whose parent has failed, to make a join-request elsewhere. Assuming that the failed major core has at least one major child, it may be necessary for the child/children to restore core-tree connectivity. The first major child that joins a major core is tagged to indicate it is the oldest child for that core. While there is only one child, it is also the youngest, so it is also tagged as such. The oldest child will obviously retain the oldest tag (as long as it is alive), but as new major children join, the youngest tag changes. Returning to the example of the failed major core. Its youngest major child must send a core-join request to the next-oldest, which then does the same. This continues until the oldest child has been reached. The major core this child sends a core-join request to depends on whether the failed parent was the child of any other major core. There are two scenarios: o If the failed major core has a parent, the youngest child must join that parent. o If the failed major core has no parent, then the youngest child of the failed core need not make a core-join request to any other major core, since original core-tree connectivity will have been re-established simply by having linked the children together, if indeed, there were any present. Design decision: When a core-tree partition occurs, simply having a major core in the vacinity of the partition send a new core-join request to any of the other major cores listed in the core-list packet would NOT ensure full core-tree connectivity, since a request could potentially be made to a major core on the same side of the partition. The method described above ensures that full core-tree connectivity is restored after a partition. CBT - Internet DRAFT, November 1992 39 Let us take the previous example again to show this: Figure 9: CBT Core Tree When 5 fails: From the core-list, major core 5 has 3 major core children (6, 7, and 8). Sequence of events: o 8 joins 7, 7 joins 6. o 6 scans the core-list to establish whether 5 is the child of any other major core. o There is a hit. 3 is the parent to 5. o 6 joins 3 to re-establish core-tree connectivity. The new core-tree is shown over: CBT - Internet DRAFT, November 1992 40 Figure 10: Repaired CBT Core Tree Another example is given over .... CBT - Internet DRAFT, November 1992 41 Another example. Assume major core 1 fails. Sequence of events: o 4 joins 3, 3 joins 2. o 2 scans the core-list to establish whether 1 is the parent of any other major core. o There is no hit. Thus, no action is taken, and core-tree connectivity is re-established. The resulting core-tree is as shown over: Figure 11: Repaired CBT Core Tree CBT - Internet DRAFT, November 1992 42 12 Acknowledgements There are several people whom we would especially like to thank for their valuable contributions to CBT. Firstly, Ian Wakeman (UCL) whose suggestion it was to separate the group identifier from the IP address and place it in the options field of the IP header. We considered this an important design issue offering several advantages over the original design. Secondly, Joel Halpern (Network Systems Corporation) and Steve Deering (Xerox PARC). Joel identified many of the problems encountered when CBT multicast packets traverse or originate on a multi-access LAN. Both he and Steve offered valuable insights into the LAN multicast problems and made many comments and suggestions as to how these problems could be solved. Finally, we would like to thank all those on the IDMR mailing-list who contributed to, or/and made comments on, CBT, however small these may have been. We consider these, and all ongoing comments and suggestions important in devoloping a standard inter-domain multicast routing protocol that will cater for the requirements of the internet community for the forseeable future and beyond. We encourage continued participation via IDMR on CBT and all other inter-domain multicast routing issues. CBT Expires April 6th, 1993. CBT - Internet DRAFT, November 1992 43 A Authors' Addresses Anthony J. Ballardie, Department of Computer Science, University College London, Gower St., London WC1E 6BT, ENGLAND. e-mail: A.Ballardie@uk.ac.ucl.cs Office Tel: ++44 (0)71 387 7050 ext. 3701 Paul F. Tsuchiya, Bellcore, 445 South St. 2L-281, Morristown, New Jersey 07960, U.S.A. e-mail: tsuchiya@thumper.bellcore.com Office Tel: 1-201-829-4484 Jon Crowcroft, Department of Computer Science, University College London, Gower St., London WC1E 6BT, ENGLAND. e-mail: J.Crowcroft@uk.ac.ucl.cs Office Tel: ++44 (0)71 380 7296 REFERENCES [1] S. E. Deering. Multicast Routing in a Datagram Internetwork. PhD thesis, Stanford University, California, U.S.A., 1991.