Internet Engineering Task Force Jon Crowcroft INTERNET DRAFT UCL draft-crowcroft-apex-multicast-00.txt Ken Carlberg February 2001 UCL MAPEX Application level multicast architectural requirements for APEX Status of this memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This is a memo that presents an approach to do application layer multicast. Our intention is to present a design that produces a tree distribution structure within the existing design structure of APEX and BEEP. It is not our intention to present a design that replaces multicast at the network layer. Where the need exists (and in scenarios where it would seem symbiotic), we would find it constructive to integrate the two technologies and perspective (i.e., application later and network layer multicast). Crowcroft Expires August 2001 [Page 1] Internet Draft MAPEX February 2001 Assumptions We are using APEX on top of BEEP on top of TCP, as the basic "hop". The general initial service requirement is for a one-to-many eventually leading to a higher level many-to-many service. The many- to-many service, if needed, may be constructed by configuring/associating a set of one-to-many APEX. Considerations must include: Group identifiers and management (allocation/revokation/distribution) Group dynamics long or short lived (we think long lived) high or low rate change of membership (we think potentially high rate, but aggregatable within domains and thus low rate with respect to APEX) Topology likely "hop" count hierarchy of hops into clusters? managed or self organised clusters? sparse or dense tree depth/breadth metrics (delay, throughput, fanout) Extensibility/Flexibility Reprogrammable Transparency Aware of lower level topology? (we think no) Aware of lower level IP multicast capability? (not usually -- or at least not dependant on it) Aware of lower level considerations like AS boundaries? not yet - could possibly be brought in indirectly Aware of lower level considerations like NAT (not directly) Introduction APEX [1] provides a very high level abstraction for a "datagram" service. The purpose of this document is to capture the architectural requirements for extending this service to provide "multicast". Since this is effectively an application level service, we make no assumpotions at all about the network layer (i.e. whether Crowcroft Expires August 2001 [Page 2] Internet Draft MAPEX February 2001 or not IP multicast is available). This is separated from any consideration here by at least 2 levels -- BEEP and the transport layer [2]. If IP multicast was present, then BEEP might use a one- to-many reliable multicast transport, such as PGM [4], in place of TCP for "local" communication between neighbouring APEX services [3] instead of a set of TCP connections. This is for later exploration. We assume that initially we are building up from one-to-one, to one- to-many, and eventually to many-to-many -- the latter two representing a multicast type of distribution model. There are various goals for a multicast topology construction and maintenance protocol. Scaling servers Simplification of Application implementation. Reduction in "network/link" traffic The order here is significant. Since we are operating way about the network level, we do not really have (much) knowledge of the network topology or link performance, so the main goal here is to provide a "natural" interface for group application programmers, and to provide something to allow massive scalability of group communication by effectively combining this with a load balancing scheme. e.g. a group of a million APEX end points cannot be serviced simultaneously by 1 APEX end point, but 1000 APEX relays can distribute messages easily to many more if organised correctly. Group Identification and its management communication usually requires a group name which has, usually in IP level multicast systems (except in Express/SSM [5]), been devoid of topological significance. We have no need to retain that idea. In our model, we will explicitly attach a group name to a domain - this makes the allocation task (basically collision avoidance) simple. An APEX end point entity is enhanced to add a group name allocation facility - group@domain (to be discussed further). This is then distributed over a "well known" apex session channel. To all APEX relays, end points indicate their interest in the channel by "joining" the channel by sending a message to the multicast service in the endpoint's administrative domain; e.g., apex=multicast@example.com, which would act as the entity acting as the nearest relay. Note that the apex core requires that apex services be co-resident with the relays. Topology Construction Crowcroft Expires August 2001 [Page 3] Internet Draft MAPEX February 2001 Normally, to date, IP multicast has been built by reversing paths taken from the Unicast routing table. DVMRP [6] uses its own distance vector (RIP basically:-) protocol to build the forward table and then uses a flood and prune, data driven approach to building the tree. PIM DM [7] is similar, except that it relies on the underlying unicast routing, which we don't have in APEX. But in any case, neither scale on an interdomain basis. MOSPF [8] adds membership reports to link state messages - this does not look to silly here given the likely size of the APEX relay network. We'll look at this in more detail in a moment. CBT [9] and PIM-bidir [10] are aimed at many-to-many applications with a low average delay per given source, but both rely on managers knowing where to place the core or RP. This is mainly concerned with link traffic optimisation aspect of ip multicast, and is of low importance in APEX. PIM SSM provides a source based channel and looks like a very nice approach for APEX, except that it requires the underlying SPF calculation that was done for unicast, to provide the shortest path from receiver to sender as the way to graft a path from the existing tree rooted at a sender, downstream towards the receiver. In other words, with respect to APEX, it is using an upstream perspective to instantiate a downstream tree. Further, it is bound to the (typical) single metric unicast routing available at the network layer. Since APEX is application level oriented, its routing perspective has the potential to be more versatile because of the smaller scale of nodes participating. So what we need to do here is look at: a) how we learn the topology of APEX servers (an OSPF like "link- state" protocol may be appropriate, and one that combines group identifier distribution with state flooding might be quite neat). b) what metrics we use then to buid a "shortest" path There are three pieces to this, and we probably want to keep this programmable (i.e. extensible) but provide some default behaviour. Neighbourhoods An APEX relay knows about some other relays - lets call this a neighbourhood. We can "flood" neighbourhood information to all APEX servers. This gives everyone an APEX Routing Information Base. Top Down External configuation gives local and neighbourhood, and global scope constraints - e.g. we need preferences (a la MX) for delay, throughput and fanout - APEX relay fanout is locally controlled. Crowcroft Expires August 2001 [Page 4] Internet Draft MAPEX February 2001 Delay is an end user preferences -- throughput is measured by a collection of APEX servers and is therefore somewhere between. Fanout, and promoting a high degree towards stub domains or APEX end points, would be considered a network (or source APEX end point) preference with respect to attempting to minimize overall state in the network. Different APEX users may desire a path to a group that minimises delay AND maximiss throughput - this is NP hard, but there are approximations. For now, we are thinking in terms of a single APEX- relay-hop metric for this part of the scheme. This can be built on later by discovery means. Bottom Up YAM like [11,12] -- an APEX end user receivers could graft by doing a YAM like one-to-many join to a list of APEX relays (potentially in more than one neighbourhood) - each APEX relay looks at the join parameters and decides to respond/bind depending on the match, plus its own (e.g. fanout) constraints. We would consider this approach an optimization effort by downstream endpoints subsequent to an initial top down construction of the tree. Regardless of whether a YAM like algorithm is used, our intention would be to segment optimization of the tree in order to allow different algorithms to be developed, as well as to retain the simplicity of the initial top down construction The implication of such a design is that we don't have to flood several metrics, and the corresponding changes in their condition. We only flood the relatively stable one of hop count for the top down construction, and then 'discover' on-demand the condition of other metrics (if so desired). Summary, Conclusions, and Comments We need some APEX topology and group management protocol elements 1) join/leave messages, with metrics 2) group distribution/revokation 3) "link" state advertisement/flood - top down on hops, bottom up on other metrics We need some tree building code: basically Dijkstra (or incremental Dijkstra if you prefer ), plus RPF code. Our preference is to retain a measure of simplicity in the initial tree construction that allows us to take advantage of the alternative paths available from a link- state algorithm. Crowcroft Expires August 2001 [Page 5] Internet Draft MAPEX February 2001 We may want to allow for native ip multicast in stub domains. The natural question that arises is: what is the protection against loops between native multicast at the network level and application level multicast. This may have additional implications with respect to scoping -- possibly setting TTL scopes for intra-domain distribution, and yet assigning a set of multicast addresses for inter-domain APEX distribution. Another issue that probably needs to be addressed is the issue of NATs. It may be a case of using NAT boxes as APEX gateways between native (TTL scoped) multicast at a source/destination domain, and TCP unicast distribution of APEX. Questions to Consider: Is MAPEX aware of lower level topology? Right now, we'd say no. just aware of the server topology Is Mapex aware of lower level considerations like AS boundaries? At the moment, No. we'd say that APEX is naturally "aware" of the domain boundaries, but not the more abstract (or aggregated) level of AS boundaries. Other than a combination of link-state advertisements and YAM, are there other application level-like approaches to consider? Yes, Yallcast [15] and Application Level Active Networking (ALAN) [16,17]. Acknowledgements We would like to thanks Marshall Rose for some clarifications of an earlier draft. References [1] M. Rose, D. Crocker, G. Klyne, "The APEX Presence Service", Work in Progress, Internet-Draft, draft-mrose-apex-core-02.txt, 2/20/2001 [2] M. Rose, "The Blocks eXtensible eXchange Protocol Framework", Crowcroft Expires August 2001 [Page 6] Internet Draft MAPEX February 2001 Work in Progress, Internet-draft, draft-mrose-bxxp-framework-01. txt, 6/16/2000 [3] M. Rose, "Mapping the BXXP Framework onto TCP", Work in Progress, Internet Draft, draft-mrose-bxxp-tcpmapping-01.txt, 7/13/2000 [4] T. Speakman, et al, "PGM Reliable Transport Protocol", Work in Progress, Internet-Draft, draft-speakman-pgm-spec-06.txt, 2/13/2001 [5] H. Holbrook, B. Cain, "Source-Specific Multicast", Work in Progress, Internet-Draft, draft-holbrook-ssm-arch-01.txt, 11/24/2001 [6] T. Pusateri, "Distance Vector Multicast Routing Protocol", Work in Progress, Internet-Draft, draft-ietf-idmr-dvmrp-v3-10.txt, 9/2000 [7] Deering, et. al, "Protocol Independent Multicast Version 2 Dense Mode Specification, Work In Progress, Internet Draft, draft-ietf-pim-v2-dm-01.txt, November 1998 [8] J. Moy, "Multicast Extensions to OSPF", Request For Comments 1584, IETF, March, 1994 [9] A. Ballardie, "Core Based Trees (CBT) Multicast Routing Architecture", Request For Comments 2201, IETF, Sep. 1997 [10] M. Handley, I. Kouvelas, L. Vicisano, "Bi-directional Protocol Independent Multicast (BIDIR-PIM)", Work in Progress, Internet- Draft, draft-ietf-pim-bidir-01.txt, 11/23/2000 [11] K. Carlberg, J. Crowcroft, "Building Shared Trees Using a One-to-Many Joining Mechanism", ACM Computer Communications Review, Jan. 1997 [12] M. Faloutsos, et. al., "QoSMIC: Quality of Service sensitive Multicast Internet Protocol (QoSMIC), Proceedings of ACM SIGCOMM'98, Sept. 1998. [13] D. Frigioni, et. al., "Fully Dynamic Algorithms for Maintaining Shortest Paths Trees", Journal of Algorithms, vol. 34, (2000), pp. 351-381. [14] S. Cicerone, et. al.,, "A Fully Dynamic Algorithm for Distributed Shortest Paths", Proceedings of the Latin American Theoretical INformatics (LATIN2000). Lecture Notes in Computer Science, 1776, pp. 247-256. Crowcroft Expires August 2001 [Page 7] Internet Draft MAPEX February 2001 [15] Yallcast, Presentation by P. Francis at on Reliable Multicast workshop, May, 1999 [16] Application level multicast CMU:-Narada/RDP, http://www.cs. cmu.edu/~hzhang/multicast/other/endsystem-index.html (see ref to degree bounded k-spanner, constraint based tree formation) [17] M. Fry, A. Ghosh, "Application Level Active Networking", Fourth International Workshop on High Performance Protocol Architectures (HIPPARCH '98), June 1998. Other reference to consider... Application level multicast CMU:-Narada/RDP, http://www.cs.cmu. edu/~hzhang/multicast/other/endsystem-index.html (see esp. ref to degree bounded k-spanner, constraint based tree formation!) Crowcroft Expires August 2001 [Page 8]