zerouter BoF R. Perlman Internet-Draft Sun Microsystems Expires: December 12, 2003 A. Williams Motorola June 13, 2003 Design for a Routing Bridge draft-perlman-zerouter-rbridge-00.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http:// www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on December 12, 2003. Copyright Notice Copyright (C) The Internet Society (2003). All Rights Reserved. Abstract This design provides the ability to have an entire campus, with multiple physical links, look to IP like a single subnet. This capability is often provided today with bridges. Bridges have the advantage of being plug-and-play. However, they have disadvantages: routing is confined to a spanning tree, the header on which the spanning tree forwards has no hop count, spanning tree forwarding in the presence of loops spawns exponential copies of packets, nodes can have only a single point of attachment, and the spanning tree, in order to avoid temporary loops, is slow to start forwarding on new ports. The design in this paper avoids those disadvantages of Perlman & Williams Expires December 12, 2003 [Page 1] Internet-Draft Routing Bridge June 2003 bridges. The basic design is layer 3-independent, and is a design for bridging with a shortest-path routing algorithm (instead of spanning tree paths), and with more robust forwarding. Then the design is extended to provide IP-specific optimizations. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Details of the Rbridge Scheme . . . . . . . . . . . . . . . . 5 2.1 Rbridge Addresses, parameters, and constants . . . . . . . . . 5 2.2 The routing algorithm . . . . . . . . . . . . . . . . . . . . 5 2.3 The envelope . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Link Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5 The Spanning Tree . . . . . . . . . . . . . . . . . . . . . . 6 2.6 Data Packet handling . . . . . . . . . . . . . . . . . . . . . 7 3. Optimization for IP . . . . . . . . . . . . . . . . . . . . . 8 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 IP Data Packet handling . . . . . . . . . . . . . . . . . . . 8 3.3 Handling ARPs . . . . . . . . . . . . . . . . . . . . . . . . 8 3.4 Keeping the link cache up-do-date . . . . . . . . . . . . . . 9 4. Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . 9 5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 10 6. Security Considerations . . . . . . . . . . . . . . . . . . . 10 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 8. Intellectual Property Notice . . . . . . . . . . . . . . . . . 11 Normative References . . . . . . . . . . . . . . . . . . . . . 11 Informative References . . . . . . . . . . . . . . . . . . . . 11 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 11 Full Copyright Statement . . . . . . . . . . . . . . . . . . . 12 Perlman & Williams Expires December 12, 2003 [Page 2] Internet-Draft Routing Bridge June 2003 1. Introduction Bridges can transparently glue many physical links into what appears to IP to be a single LAN. However, routing via the spanning tree concentrates traffic onto selected links, is slow to bring new connectivity on-line because temporary loops are a disaster (with no hop count in the header and exponential proliferation of packets during loops), and nodes must have a distinct layer 2 address for each point of attachment. There have been proposals for having routers within a campus automatically number links with distinct IP subnet numbers. Although this makes a campus plug-and-play, it requires a large number of IP subnet numbers, a node must change its address if it moves to a different link, and addresses of nodes might fluctuate as the topology changes and links must be renumbered. The first concept is to use routing for bridging (where "bridging" means forwarding according to the layer 2 header, and making no assumption about endnode behavior beyond layer 2). Let us refer to the region that should appear to be a single LAN (and a single IP subnet) as "the campus". We'll call the devices that will implement what is in this paper "Rbridges" (routing bridges). It is possible, within a campus, to mix bridges with Rbridges. A true router terminates the campus (i.e., is on the boundary). A bridge is internal to what Rbridges see as a link. Two Rbridges are neighbors if they are connected via a bridged LAN. Rbridges, like routers, terminate a LAN and do not participate in the bridge spanning tree or bridge forwarding. The basic idea behind this proposal is that the Rbridges within the campus run a link state routing protocol such as IS-IS among themselves, so that they at all times compute an optimal path from themselves to each other Rbridge. Rbridges also compute a spanning tree among themselves, on which packets for unknown destinations will be forwarded. This is a different spanning tree, and differently computed, from the spanning tree that bridges compute. Rbridges will not forward regular bridge spanning tree messages, or participate in any other way as a bridge. Instead, like routers, Rbridges terminate LANs. But unlike routers, Rbridges will glue many links together into what would appear to layer 3 routing to be a single subnet. When data packets are travelling between Rbridges within the campus, they will be encapsulated in an additional header, which will specify Perlman & Williams Expires December 12, 2003 [Page 3] Internet-Draft Routing Bridge June 2003 the destination Rbridge and a hop count. This header will be inserted by the source Rbridge, and removed by the destination Rbridge. We call this extra header the "envelope". Since there might be bridges on the path between two Rbridges, there must be an additional layer 2 header on top of the envelope. This layer 2 header will contain the transmitting and next hop Rbridge addresses (or when the packet is intended for all Rbridges on a LAN, a multicast address), and a new Ethertype that indicates that inside is an Rbridge envelope. We'll call that Ethertype "Rbtype". Note that the extra layer 2 header is inserted and deleted on an Rbridge-hop basis, so there is no possibility of confusing bridge learning. If the original source MAC address were seen by bridges in an outer layer 2 header, bridge learning would be confused, since this scheme allows packets to be routed along non-spanning tree paths. IS-IS selects, for each link, a "Designated Router" (DR). This election must be per-port, so if a router R has two ports onto the same bridged LAN, at most one of the ports will be elected DR. An alternate way of looking at it is that R must notice, because of the Designated Router election, that two of its ports, pa and pb, are on the same link, and R must never forward packets between ports pa and pb. Also, since pa and pb are equivalent, the link cache should combine the learning seen on pa and pb. Only the DR on a link is allowed to learn the membership of the local link based on observing "naked packets" (packets without the extra envelope), and only the DR is allowed to delete an envelope and forward the resulting naked packet onto the local link. A DR, R, learns the endnode membership on its local link, and includes a list of MAC addresses that should be sent to R in its link state information. This enables other Rbridges to know what destination Rbridge to address a packet to, for a given MAC address. The DR maintains a "link cache" of (link, node address) pairs for endnodes on links for which that Rbridge is DR. For an Rbridge R1 that is distant from destination D, it is only relevant that D must be sent to R2. However, R2 must know which of its links D resides on. If the DR R sees a packet, without an envelope, it looks at the destination address D. a) If R's endnode cache indicates (D,R) (i.e., R itself owns D), then R forwards the packet as specified in the link cache (possibly not Perlman & Williams Expires December 12, 2003 [Page 4] Internet-Draft Routing Bridge June 2003 forwarding it at all if D resides on the link from which the packet was received). R adds no envelope in this case since the packet is going directly from the source link to the destination link. b) Else, if R's endnode cache indicates (D, R1), then R attaches an envelope to the packet indicating R1 as destination Rbridge, and forwards the packet towards R1. In addition to the envelope, R must attach an additional layer 2 header, putting its own MAC address on that link as source address and the MAC address of the neighbor Rbridge to which the packet is being forwarded, as the destination, and the Ethertype "Rbtype". c) Else, (destination is unknown), R attaches an envelope to the packet indicating the special value "0" as destination Rbridge. This indicates that the packet should be sent through the spanning tree. Each Rbridge forwards such a packet along the spanning tree, and additionally, if the Rbridge is a DR, it removes the envelope and forwards the packet onto each link for which that Rbridge is DR. The additional layer 2 header will contain the source address R, the Ethertype Rbtype, and the destination the (to be assigned) layer 2 multicast address "All-Rbs". 2. Details of the Rbridge Scheme 2.1 Rbridge Addresses, parameters, and constants Each Rbridge needs a unique ID within the campus. The simplest such address is a unique 6-byte ID, since such an ID is easily obtainable as any of the EUI-48's owned by that Rbridge. IS-IS already requires each router to have such an address. A parameter is the value to which to initially set the hop count in the envelope. Recommended default=20. An Ethertype must be assigned as "Rbtype". A layer 2 multicast address must be assigned for All-Rbs. 2.2 The routing algorithm IS-IS, without modifications, will compute paths between all routers within the campus, using EUI-48's as the unique IDs. In addition, a TLV value needs to be added for reporting MAC addresses of local endnodes. Perlman & Williams Expires December 12, 2003 [Page 5] Internet-Draft Routing Bridge June 2003 2.3 The envelope The information in the envelope is: +--------------+-----------+ | dest Rbridge | hop count | | (6 bytes) | (2 bytes)| +--------------+-----------+ The value "0" for "dest Rbridge" indicates the destination Rbridge is unknown, and the packet should travel via the spanning tree. If dest Rbridge=0, then next Rbridge is also 0. If dest Rbridge is not 0 (it is a specific Rbridge), then "next Rbridge" indicates the neighbor Rbridge to which the packet is being forwarded. 2.4 Link Cache The link cache is kept by a DR, and is populated based on observing packets without envelopes. It consists of the mapping between S and the specific link from which a packet from endnode S was received. These caches are refreshed based on seeing data, and timed out and entries deleted if some time has gone by without seeing data from that endnode. 2.5 The Spanning Tree Packets for unknown destinations, or packets for link level multicast/broadcasts (such as ARP packets) are sent through the spanning tree, with an envelope indicating destination Rbridge=0. There is no need to run an additional protocol for computing the spanning tree. Instead, the link state database is used. The Rbridge R with the lowest EUI-48 is chosen as the root of the spanning tree, and shortest paths from R are computed through the normal IS-IS Dijkstra algorithm. Links on that shortest path tree are in the spanning tree. It is vital that all Rbridges calculate the same spanning tree. Therefore there must be a well-defined tie- breaker in the case of equal cost paths. The tie-breaker is that, when attaching Rbridge R3 to the tree, if R3 has equally minimal cost paths using parent R1 or R2, the parent, R1, with the smallest ID is chosen. If there are multiple links between R3 and R1, this is irrelevant except between R3 and R1. Such parallel links can actually be considered to be part of a fatter pipe, and packets can be load split across those links, or any of those links can be chosen. Perlman & Williams Expires December 12, 2003 [Page 6] Internet-Draft Routing Bridge June 2003 2.6 Data Packet handling If a data packet without an envelope is received by R on link L with (layer 2) destination D: a) if D=R, then R should process the packet (R is the destination) b) else, if R is not DR on L, drop the packet c) else, (R is DR on L): c1) if D is in R's endnode cache on link L1, then forward the packet onto link L1 (unless L=L1, in which case drop the packet) c2) else (D is not local), if the link state database indicates D is local to R1, then add an envelope indicating dest Rbridge=R1, add an extra layer 2 header indicating R's MAC address as source and the next-hop Rbridge towards R1 as destination, and forward the packet c3) else, (D is not local, and is unknown), the add an envelope indicate dest Rbridge=0 and forward the packet on the spanning tree, as well as forwarding the naked packet onto all other links for which R is DR. Add to the enveloped packet an additional layer 2 header with R's MAC address on the link to which the packet is being forwarded as source, and the layer 2 multicast address "All-Rbs" as destination, and Ethertype Rbtype. If a data packet with an envelope is received by R on link L, with layer 2 destination address All-Rbs or R's MAC address on that link (otherwise the packet will be discarded): a) If destination Rbridge in envelope=0: a1) if the packet was received on a non-spanning tree link, drop the packet a2) else, forward the packet onto all links in the spanning tree, decrementing the hop count in the envelope, and adding the extra layer 2 header with destination=All-Rbs. Also, for each link on which R is DR, remove the envelope and forward the packet onto the link. b) If destination Rbridge in envelope=R, then remove the envelope, and if D is locally attached on link L1, forward the naked packet onto L1. If D is not locally attached, drop the packet. Perlman & Williams Expires December 12, 2003 [Page 7] Internet-Draft Routing Bridge June 2003 c) If destination Rbridge in envelope=Ri, not equal to 0 or R: c1) forward the packet towards Ri, decrementing the hop count, and adding a new layer 2 header with source address=R's MAC address on the link to which R is forwarding, and destination address the MAC address of the Rbridge which is the next hop towards Ri. 3. Optimization for IP 3.1 Introduction With the design above, IP would work on top of such an Rbridged campus. However, there are some optimizations possible. To make optimizations for IP, Rbridges look beyond the layer 2 header. For IP packets, the DR (in addition to learning the source layer 2 address) learns the source IP address. This information (IP addresses, MAC address) of the source of the packet is sent around in link state information. This optimization allows: a) a local Rbridge to answer ARP queries for destination IP addresses that have been learned through the link state information. This keeps ARP traffic from being flooded throughout the campus. b) More timely keep-alives of IP addresses on the local link, since IP provides some mechanisms such as ARP that the DR can use to ensure that that IP address still resides on the link 3.2 IP Data Packet handling If a naked packet is received by R, if R is the DR, then in addition to learning the source MAC address, R checks to see if the layer 2 protocol type indicates "IP", and if so, also learns the location of the source IP address, assuming that the source address is within the campus's IP prefix. 3.3 Handling ARPs Only the DR (and the real destination, if it's on that link) will answer an ARP query. If R is DR, and sees an ARP query for D: a) if D is unknown, R creates its own ARP query (indicating itself as the querying source), using an envelope indicating destination Perlman & Williams Expires December 12, 2003 [Page 8] Internet-Draft Routing Bridge June 2003 Rbridge=0, and forwards the ARP query along the spanning tree b) if D is known to reside on another link for which R is DR, R responds to the ARP query with the MAC address of D c) if D is known to reside on the same link, R drops the ARP query and lets D respond d) If D is known (through the link state database) as being attached to R1, with the mapping (D,d), then R responds to the ARP query with D's MAC address "d". If R receives a response to its ARP query from D, and D is not in the link state database, then R responds to the original ARP query with the MAC address indicated in the received ARP response. 3.4 Keeping the link cache up-do-date To ensure that IP addresses remain in the link cache if the endnode is still attached, the DR, once it learns that S is on the link, periodically issues ARP queries to S on that link. This cuts down on flooded ARP queries from the campus for S, since S will remain in the link state database as long as it is alive. It also allows quick learning that S is down, so that it can be removed from the link state database (and be reachable at its new location, if it has moved). 4. Alternatives Instead of passing around MAC addresses and IP addresses in link state information, this information could be learned by all Rbridges based on seeing data traffic. This could be accomplished by adding an additional field to the envelope consisting of "source Rbridge". When any router R1 observed a data packet with source Rbridge=R, R1 sould copy the inside layer 2 source address, and (if it's an IP packet), the inside IP address, and make a mapping that that layer 2 address (and that layer 3 address) should be routed to R. This alternative, although elegant, had the disadvantages: a) it increases the size of the envelope for all packets (since the field "source Rbridge" must be included b) it forces more processing on enveloped data packets by all Rbridges, since every such packet must be examined to find the inner source layer 2 and layer 3 (if it's an IP packet) addresses c) it does not allow the tighter mapping of link location possible by Perlman & Williams Expires December 12, 2003 [Page 9] Internet-Draft Routing Bridge June 2003 having the DR on the link explicitly poll the endnode to see if it is still alive. Therefore, caches of routers would be slower to remove incorrect entries when an endnode moves. 5. Conclusions This design allows a plug-and-play campus to appear as a single IP subnet, with a stable routing protocol and robust forwarding header (as opposed to spanning tree, where routes are suboptimal, the header does not contain a hop count and packets can proliferate exponentially when being forwarded, and to avoid temporary loops a time must expire before new links can start being used for forwarding). There is a possibility of one-hop suboptimality, if the DR is not the optimal entrance point to the destination LAN. However, given that most topologies are switched LANs, this would be rare. There is also the possibility of an additional one-hop suboptimality at the source LAN, since the DR might not be the optimal exit point from that LAN, and the DR might forward to R1, on the same LAN. It is possible to eliminated this one-hop suboptimality by having R1 know this, through the routing algorithm, or by being explicitly delegated to by the DR for this destination, and having R1 forward the packet directly. This optimization is not trivial to implement, and given today's topologies of switched LANs, it's not necessarily worth it to implement this. This solution has all the advantages of a bridged LAN, and is considerably more stable, and allows optimal routing. 6. Security Considerations With this design, an endnode could transmit a packet with a forged source address and confuse the Rbridge learning, but this can be done with today's bridged LANs. If instead the campus were implemented as separate IP subnets, with routers instead of bridges, endnodes will have addresses explicit to their links, so an endnode on one link cannot as easily subvert routing to another endnode. TBD. Check rpsec for list of requirements 7. IANA Considerations No known IANA considerations arise from this document. Perlman & Williams Expires December 12, 2003 [Page 10] Internet-Draft Routing Bridge June 2003 8. Intellectual Property Notice Sun Microsystems may claim intellectual property rights over portions of the design described in this document. Some of the design may be covered by intellectual property from Digital Equipment Corporation. Normative References Informative References Authors' Addresses Radia Perlman Sun Microsystems One Network Drive Burlington, MA 01803 USA Phone: +1 781 442 0086 EMail: Radia.Perlman@sun.com Aidan Williams Motorola Australian Research Centre Locked Bag 5028 Botany, NSW 1455 Australia Phone: +61 2 9666 0500 EMail: Aidan.Williams@motorola.com URI: http://www.motorola.com.au/marc/ Perlman & Williams Expires December 12, 2003 [Page 11] Internet-Draft Routing Bridge June 2003 Full Copyright Statement Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society. Perlman & Williams Expires December 12, 2003 [Page 12]