Internet Draft Feb 2006 Network Working Group Joel M. Halpern Internet Draft Manav Bhatia Riverstone Networks Paul Jakma Sun Microsystems Expires: August 2006 February 10, 2006 Advertising Equal Cost Multipath routes in BGP draft-bhatia-ecmp-routes-in-bgp-02.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet draft will expire on August 2006 Copyright Notice Copyright (C) The Internet Society (2006). Abstract This document describes an extensible mechanism that will allow a BGP [BGP4] speaker to advertise ECMP routes for a destination to its peers. It uses an existing, Multi Hop Capability [M-HOP], to advertise multiple paths that it is using to its peers. Halpern, Bhatia and Jakma [Page 1] Internet Draft Feb 2006 The mechanism described in this document is only applicable to routers that have the ability to inject multiple routing entries in their forwarding table. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED","MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [KEYWORDS] Table of Contents 1. Introduction...................................................2 2. Some BGP ECMP Scenarios........................................3 3. Requirements for the ECMP algorithm............................5 4. ECMP Capability................................................6 5. Operation when both peers are Multipath capable................6 6. Advertisement of Multipath BGP routes..........................6 7. Procedures for the Receiving Speaker...........................7 8. Working with Non Multipath capable/EBGP peers..................7 9. Configuring BGP ECMP support...................................9 10. Working with Multipath capable IBGP peers.....................9 11. Aggregation vis-à-vis Multi-Hop Solution.....................10 12. Security Considerations......................................11 13. Acknowledgements.............................................11 14. IANA Considerations..........................................11 15. Appendix A...................................................11 15.1 Constructing AS_PATHs....................................11 15.2 Advertising synthetic AS_PATHs...........................12 16. References...................................................12 17. Author's Addresses...........................................13 18. Intellectual Property Notice.................................14 19. Disclaimer of Validity.......................................14 20. Copyright Statement..........................................14 21. Acknowledgment...............................................14 1. Introduction Most of the current BGP implementations upon receiving multiple equal cost BGP routes from different peers can insert all of them (or a subset depending upon the local policies) in their forwarding table. This can be done to locally split the traffic across several paths. However, because BGP in its current state can only advertise one path to its peers, an implementation MUST choose from one of the best paths that it is using for the advertisement. This has implications for the BGP peers that receive such advertisements from ECMP capable BGP speakers. In the worst case it can lead to potential loops if the entire path information is not Halpern, Bhatia and Jakma [Page 2] Internet Draft Feb 2006 advertised to the peers. The best that can be currently done is to aggregate all the contributing paths and advertise the aggregated route. Doing this leads to a loss in the path length. This imposes a severe limitation on the kind of load balancing that a BGP peer can do. The use of BGP ECMP routes is most prevalent inside an AS to identify its local BGP routes that represent load balanced links. This is useful for applications that want to use the BGP protocol as a mechanism for propagating this information for load balancing across multiple IBGP paths. This document refers to all the candidate paths that remain after the tie breaking procedure, described in sec. 9.1.2.2, reaches step (f) as "ECMP BGP paths/routes". It should be noted that these paths shall have the same AS_PATH length, though the individual AS_PATH segments could differ. Advertising individual BGP ECMP paths with different NEXT_HOPs holds value in IBGP scenarios, as each IBGP peer can reach the NEXT_HOP on its own. However in case of EBGP, individual BGP paths MAY only be advertised if the contributing routers are on the same subnet. If the peers don’t lie on the same subnet, then information about each individual NEXT_HOP is not useful. This is because (i) the receiving EBGP peer is not be aware of the NEXT_HOP information inside some other ASes and, (ii) the EBGP speaker always resets the NEXT_HOP to itself when advertising routes. Thus, individual BGP ECMP paths for a destination are only advertised to IBGP peers or EBGP peers that lie on the same subnet. In other cases, only one path is announced which is an “aggregate” of all the individual equal cost paths for that destination. However, care must be taken to ensure that the AS_PATH length in this aggregated path is retained and enough information is there in the AS_PATH to enable the receiving peer (and the downstream peers) to detect AS loops. 2. Some BGP ECMP Scenarios o Load Splitting when receiving BGP routes Halpern, Bhatia and Jakma [Page 3] Internet Draft Feb 2006 A (AS X) \ \ \ C (AS Y) / / / B (AS Z) A, B and C are BGP speakers in AS X, Y and Z respectively. Assume that C is peered up with similar sized ISPs A and B and accepts entire Internet feed from each one of these. It is common in such scenarios for the ISP C to receive multiple routes of equal cost from both A and B. Ordinarily, C can use only one "best" route learnt from either of A or B. Configuring C for load balancing involves a lot of prepending, modifying routes, splitting prefixes received from A and B, etc. Even then, it is difficult to have a 50/50 split across A and B as the load can only be statically split. The best and the most obvious solution is to let C install ECMP BGP routes in its FIB and to advertise information about all these ECMP BGP routes to its downstream peers, in a manner that lets each one of them run its loop detection algorithm. o Load Splitting when advertising BGP routes. B(X) / \ / \__ / \ A(W) D(Z) \ _/ \ / \ / C(Y) A, B, C and D are BGP speakers in AS W, X, Y and Z respectively. The dashed lines show EBGP peering. Scenario 1: Traditional BGP without ECMP capabilities Assume that A has a block of 10.0.0.0/20 that it needs to advertise to both B and C. For the purpose of load balancing it splits this block into two smaller blocks of 10.0.0.0/21 and 10.0.8.0/21 and advertises each of these to B and C. Router A somehow needs to ensure that one block is more preferred in B and the other in C. This can be Halpern, Bhatia and Jakma [Page 4] Internet Draft Feb 2006 done by configuring policies that can prepend AS numbers, MEDs, etc. when advertising the BGP route associated with these prefixes. Say, 10.0.0.0/21 is prepended with AS numbers when advertised to B and the 10.0.8.0/21 when advertised to C. This ensures that all the traffic is split between B and C, as this is what is advertised to router D. The problems with this approach are: [1] Number of routes advertised increase. [2] Entries in FIB increase. [3] Complex policies need to be implemented at A to ensure that incoming traffic is balanced. [4] Load balancing is static and cannot ensure 50/50 split. Scenario 2: BGP with ECMP capabilities support Router A now does not need to split the original block of 10.0.0.0/20 into two smaller blocks. It can advertise 10.0.0.0/20 to both B and C. With B and C equidistant from D, the latter will insert BGP multipath route for 10.0.0.0/20 with NEXT_HOPs as B and C and will advertise this to its downstream peers. Whatever traffic comes to D will be split across routers B and C. The load balancing this way is dynamic and not static. Entries in the FIB remain under control and Router A need not implement those complex policies. Router D needs to advertise information about the paths it is using to its downstream peers, so that they can run the loop detection algorithm. o Suboptimal Routing in Route Reflector clients Route Reflection can result in suboptimal routing due to the client not having full visibility to all the BGP paths in the AS. This is because the RR selects the best path and reflects only that best path to its clients. In case the RR has equal cost BGP routes, then it shall select the one based on the lower Router ID. As a result, the clients do not receive the full view of the available paths, or atleast the paths that are equidistant from the RR. This is bad, as this can result in suboptimal routing from the client's perspective. A client may have selected a different best path if more paths had been made visible to it. With BGP ECMP, the RR can advertise all the equal cost BGP routes that it has to its client, giving the client more options to choose from. The extensions proposed in this draft provide provision for the RR to reflect all the routes to its clients. 3. Requirements for the ECMP algorithm (a) An ECMP capable peer must be able to advertise each individual BGP path to other ECMP capable IBGP peers. Halpern, Bhatia and Jakma [Page 5] Internet Draft Feb 2006 (b) When advertising routes to non ECMP capable peers it must preserve as much of the AS Path structure as possible. This includes, the individual AS elements and the path length. We in this document propose an algorithm to advertise BGP ECMP routes, to both routers capable of ECMP and the ones without, so as to honor each of the requirements mentioned above. 4. ECMP Capability A router that wishes to express its capability to advertise ECMP routes uses the multi-hop capability [M-HOP]. We define a new bit, ECMP bit, in the bitmask of flags. This is set if the router intends to install ECMP routes. +-------+---+---+---+---+---+---+----+----+ | Bit: | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | +-------+---+---+---+---+---+---+----+----+ | flag: | R | R | R | R | R | R |ECMP| AE | +-------+---+---+---+---+---+---+----+----+ ECMP and AE flags are mutually exclusive and both of them MUST not be set together. By advertising the Multi-hop capability to a peer with the ECMP bit set (multipath capable), the BGP Speaker conveys to its peer that the speaker is capable of receiving and properly handling the BGP ECMP updates from that peer. 5. Operation when both peers are Multipath capable In the following sections, "Local speaker" refers to a router which is advertising the BGP multipath routes, and the "Receiving Speaker" refers to a router that peers with the former to accept multiple BGP routes for a destination. Consider that the capability with the proper flags has been exchanged between the Local speaker and the Receiving speaker, and a BGP session between them is established. The following sections detail the procedures that shall be followed by the Local speaker as well as the Receiving speaker once the capability has been exchanged, and the local speaker wants to advertise some BGP multipath routes. 6. Advertisement of Multipath BGP routes To advertise BGP ECMP paths, the speaker uses MULTIPLE_HOP path attribute. Refer to [M-HOP] for details regarding how this path attribute is used. Halpern, Bhatia and Jakma [Page 6] Internet Draft Feb 2006 7. Procedures for the Receiving Speaker The Receiving Speaker upon receiving the MULTIPLE_HOP attribute will understand that the Local Speaker has advertised multipath BGP routes. In a single UPDATE message all the prefixes will have identical attributes, except for the next-hops, which will be carried in the MULTIPLE_HOP attribute. It will run the modified decision process as explained in the Section 1 and depending upon the result will either - inject multiple routes into Local-RIB and advertise multiple paths to its peers OR - inject a single route which has better path attributes than the other routes that it has just received. If the Receiving Peer receives some withdrawn routes along with the other path attributes and MULTIPLE_HOP attribute then it shall understand that some of the previously advertised multipath BGP routes have been removed and an implementation MUST proceed with removing all such paths. 8. Working with Non Multipath capable/EBGP peers This section discusses how BGP ECMP routes are advertised to non multipath capable routers or EBGP peers lying on different subnets. This is relevant only when the local speaker has installed BGP ECMP routes in its FIB and wants to convey this information to other peers that are not capable of understanding these capabilities. At this point the local speaker needs to aggregate this information and it can follow any algorithm that preserves as much of the available AS path structure as possible. A multipath capable speaker that has installed multiple BGP paths in its forwarding table MUST follow an algorithm to advertise the AS_PATH for each one of these routes in a manner that makes it possible for the downstream peers to run the AS Path loop detection algorithm. However, the algorithm MUST ensure that the AS_PATH length remains unaltered. There can exist multiple ways to do the above and we recommend one such algorithm which preserves most of the AS path structure and information. When crossing the multipath capable boundary to non-multipath capable, the speaker MUST send out a synthetic AS_PATH to the latter, Halpern, Bhatia and Jakma [Page 7] Internet Draft Feb 2006 where each element of the synthetic AS_PATH is an AS SET, built with the AS values corresponding to each segment in each contributing AS_PATH. This is possible since each of the contributing AS_PATHs are of the same length. The above procedure is described as a pseudo code. Note that the pseudo-code shown was chosen for clarity, not efficiency. It is not intended to specify any particular implementation. BGP implementations MAY use any algorithm which produces the same results as those described here. Given two AS_PATHs X and Y, each with N number of segments, which we wish to merge into a new combined AS_PATH, Z of N number of segments: Expand every AS_SEQUENCE segment in X and Y which contains multiple AS values into single-valued segments, such that the number of segments is equivalent to the path length, with order preserved. for every segment, n, from 0 to N create a segment Z(n) of type AS_SET for every AS value in segments X(n) and Y(n) add the AS value into Z(n) Resulting AS_PATH Z will consist of n AS_SETs, each AS_SET segment having all AS values in segments X(n) and Y(n). To cite an example, consider a BGP speaker (say in AS A1) having the following paths for a destination D1: Path 1: AS_PATH "a b c", Origin IGP, MED 10, NEXT_HOP N1 Path 2: AS_PATH "x y z", Origin IGP, MED 20, NEXT_HOP N2 It inserts these two paths in its forwarding table, and announces the following to its non-ecmp capable peer: AS_PATH: AS_SET (a,x) AS_SET (b,y) AS_SET (c,z) Origin IGP, MED 10, NEXT_HOP [N1 or N2] The AS_PATH constructed for advertisement to an EBGP peer is AS_PATH: AS_SEQUENCE "A1" AS_SET (a,x) Halpern, Bhatia and Jakma [Page 8] Internet Draft Feb 2006 AS_SET (b,y) AS_SET (c,z) Refer to Appendix A for more complex AS_PATH scenarios. The beauty of this new AS_PATH structure is that it retains the AS_PATH length of the original contributing paths and also retains enough AS information for the receiving peer to do loop prevention. For instance, if the router that has been used in the above example is peered up with another router R2 in AS "c", then R2 will reject all UPDATEs that have AS "c" in the AS_PATH. The Multipath capable router when advertising routes to a non- multipath capable IBGP peer can pick up any one of the NEXT_HOPs from the available list. This will not create any problems because this NEXT_HOP will actually fall within the AS_PATH set-sequence that is being advertised. For EBGP peers, the ECMP capable router, will as usual, put itself as the NEXT_HOP. 9. Configuring BGP ECMP support An implementation MUST provide a configuration option to set and unset this feature. The administrator should ensure that the maximum number of multipath routes that all the routers install in their FIB, remains consistent inside an AS. 10. Working with Multipath capable IBGP peers This section explains as to how ECMP feature will work in the normal scenarios. Assume that the two IBGP speakers A and B exchange this capability. Consider a case where A receives multiple updates for NLRI N' with Nexthops N0, .. Ni, .. Nm. Say it runs its decision process and finds that routes with the Nexthops Nj, Nk and Nl are equal and that it needs to advertise all three of them to B. Also assume that Nj and Nk share the same path attributes (Origin, AS Path, Local Pref, etc). A makes an UPDATE message and uses the MULTIPLE_HOP path attribute. It puts the AFI, number of next-hops as 2, length of the first next- hop (Nj), network address of Nj, length of Nk and the network address of Nk. When this UPDATE message is received by B, it looks at the MULTIPLE_HOP path attribute and understands that there are multiple Halpern, Bhatia and Jakma [Page 9] Internet Draft Feb 2006 routes to reach N'. It inserts two routes for N' with the next-hops as Nj and Nk. A also needs to announce N' with some other path attributes and the next-hop Nl. It makes an UPDATE message, puts the path attributes, and puts the MULTIPLE_HOP path attribute. It fills the AFI, number of next-hops as 1, length of the first next-hop Nl and the network address of Nl. This UPDATE message is sent to B. When B receives this UPDATE message it knows that this is not an implicit WITHDRAW from N' as it comes with the MULTIPLE_HOP path attribute. It simply appends this new route in its BGP database, runs the decision process, and proceeds as normal. Assume that at some point later, A needs to withdraw the route associated with the tuple [N', Nk]. It makes an UPDATE message, puts N' in the unfeasible routes and inserts path attributes and the MULTIPLE_HOP path attribute, keeping the next-hop inside as Nk. When B receives this UPDATE message it understands that A now wants to remove a route associated with N'. It looks at MULTIPLE_HOP and finds the next-hop as Nk. It thus removes, only the route associated with Nk. 11. Aggregation vis-à-vis Multi-Hop Solution Another approach to do ECMP is via advertising all the contributing routes through aggregation. In this approach any router that installs multiple BGP routes in its FIB can aggregate the contributing routes and advertise the aggregated route to its peers. The problem with this approach is that the AS_PATH length is not preserved. A clever implementation could prepend its own AS the required number of times to preserve the path length when advertising routes to EBGP peers, but this will not work when advertising routes to IBGP peers. Moreover, lumping everything in one AS_SET hides information and makes it harder for the downstream peers to make out as to what is actually happening. Our method of constructing the synthetic AS_PATH is somewhat complex but preserves maximum information and is regular expression friendly. Another problem with aggregating the information at the router that is doing ECMP is that this router cannot advertise individual contributing routes to its other ECMP capable peers. Advertising individual routes is desirable in some cases as different BGP peers make their path decisions based on their local policies and advertising them just one aggregated route makes it impossible for them to do so. Halpern, Bhatia and Jakma [Page 10] Internet Draft Feb 2006 Moreover, by providing them with the NEXT_HOP information each peer can choose its best IGP route to reach those NEXT_HOPs. Last but not the least, putting all the ASes inside one set can cause the SET to overflow. In that case, the path length would increase by 1. 12. Security Considerations This extension to BGP does not change the underlying security issues inherent in the existing BGP. 13. Acknowledgements The authors would like to thank Tony Li and Curtis Villamizar for their valuable comments and suggestions. 14. IANA Considerations This document uses an attribute type to indicate additional next-hops for the BGP paths. This must be assigned by IANA as per RFC 2842. 15. Appendix A 15.1 Constructing AS_PATHs This section deals with some scenarios that could occur. Consider that ecmp capable Router R1 has received multiple paths for a destination D1 and it is connected to a non-ecmp capable/EBGP router R2. R1 thus cannot use the MULTIPLE_HOP attribute to announce these routes. [Scenario 1] Say, R1 has the following BGP paths that it installs in its FIB. Path 1: AS_PATH: AS_SEQ "a b", AS_SET "p1 p2", AS_SET "p3 p4", NEXT_HOP N1 Path 2: AS_PATH: AS_SEQ "x y", AS_SET "q1 q2 q3", AS_SET "q4", NEXT_HOP N2 It is to be noted in this case when R1 runs its decision process, AS Path lengths are the same because when counting this number, an AS_SET counts as 1, no matter how many ASes are in the SET. The AS_PATH that R1 thus constructs when announcing the UPDATE to R2 (assuming R2 is an IBGP peer) is: Halpern, Bhatia and Jakma [Page 11] Internet Draft Feb 2006 AS_PATH: AS_SET "a x", AS_SET "b y", AS_SET "p1 p2 q1 q2 q3", AS_SET "p3 p4 q4" It will create a new AS_SEQ segment if R2 is an EBGP peer. [Scenario 2] Say, R1 has the following BGP paths that it installs in its FIB. Path 1: AS_PATH: AS_SEQ "a b c", AS_SET "p1 p2", NEXT_HOP N1 Path 2: AS_PATH: AS_SEQ "x y", AS_SET "q1 q2 q3", AS_SET "q4", NEXT_HOP N2 The AS_PATH that R1 thus constructs when announcing the UPDATE to R2 (assuming R2 is an IBGP peer) is: AS_PATH: AS_SET "a x" AS_SET "b y" AS_SET "c q1 q2 q3" AS_SET "p1 p2 q4" 15.2 Advertising synthetic AS_PATHs This section discusses some optimizations/cleanups that can be done by a BGP speaker when constructing the synthetic AS_PATH, to advertise to a non-ecmp capable or an EBGP peer. X(n) and Y(n) are two AS_PATHs, each with N number of segments, which will be merged into a new combined synthetic AS_PATH, Z of N number of segments: - If X(n) and Y(n) are both of type AS_SEQUENCE and contain the same value, the type of Z(n) MUST be set to AS_SEQUENCE, the value being the single as value concerned common to X(n) and Y(n). - Duplicate AS values MUST be removed from Z(n) once all values have been added to it, if the implementation has not already discarded duplicate values while iterating through X(n) and Y(n) when constructing segment Z(n) 16. References [M-HOP] Bhatia, M., Halpern, J. and Jakma, P., “Advertising Multiple Nexthop Routes in BGP”, Halpern, Bhatia and Jakma [Page 12] Internet Draft Feb 2006 draft-bhatia-idr-multiple-hops-00.txt, February 2006, Work in Progress [BGP4] Rekhter, Y., Li, T. and Hares, S., "A Border Gateway Protocol 4 (BGP-4)", RFC 4271, March 1995 [BGP-CAP] Chandra, R. and J. Scudder, "Capabilities Advertisement with BGP-4", RFC 3392, November 2002 [MED] D. McPherson, V, Gill, D. Walton, and A. Retana, "Border Gateway Protocol (BGP) Persistent Route Oscillation Condition", RFC 3345, August 2002. [BGP-IPv6]Marques, P. and F. Dupont, "Use of BGP-4 Multiprotocol Extensions for IPv6 Inter-Domain Routing", RFC 2545, March 1999 [KEYWORDS]Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [AFI] http://www.iana.org/assignments/address-family-numbers [SAFI] http://www.iana.org/assignments/safi-namespace [CONFED] McPherson, D., Scudder, J., and P. Traina, "Autonomous System Confederations for BGP", draft-ietf-idr-rfc3065bis-05 (work in progress), October 2005. [MPBGP] Bates, T., R. Chandra, D. Katz, and Y. Rekhter, Multiprotocol Extension for BGP-4", RFC 2858, June 2000. 17. Author's Addresses Joel M. Halpern Email: joel@stevecrocker.com Manav Bhatia Riverstone Networks, Inc. Email: manav@riverstonenet.com Paul Jakma Sun Microsystems Email: paul.jakma@sun.com Halpern, Bhatia and Jakma [Page 13] Internet Draft Feb 2006 18. Intellectual Property Notice The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf- ipr@ietf.org. 19. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 20. Copyright Statement Copyright (C) The Internet Society (2006). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. 21. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Halpern, Bhatia and Jakma [Page 14] Internet Draft Feb 2006 Halpern, Bhatia and Jakma [Page 15]