idr B. Dickson Internet-Draft Afilias Canada, Inc Expires: August 8, 2008 February 5, 2008 Enhanced BGP Capabilities for Exchanging Second-best and Back-up Paths draft-dickson-idr-second-best-backup-00 Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on August 8, 2008. Copyright Notice Copyright (C) The IETF Trust (2008). Dickson Expires August 8, 2008 [Page 1] Internet-Draft BGP Second-Best and Back-up February 2008 Abstract This Internet Draft describes an enhanced way to exchange prefix information, to permit multiple copies of a prefix with different paths to be announced and withdrawn. This negotiated capability provides faster local (inter-AS) and global (intra-AS) convergence, reduces path-hunting, improves route- reflector behaviour, including eliminating both persistent oscillations and BGP "wedgies". Additional prefix instances have new optional transtive BGP attributes, to control path selection. Withdrawl of prefixes will require path attributes. Benefits are seen both when deployed intra-AS, and on inter-AS peering. Dickson Expires August 8, 2008 [Page 2] Internet-Draft BGP Second-Best and Back-up February 2008 Author's Note This Internet Draft is intended to result in this draft or a related draft(s) being placed on the Standards Track for idr. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [5]. Intended Status: Proposed Standard. Table of Contents 1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1. Localized Information . . . . . . . . . . . . . . . . . . 4 1.2. The Withdrawl Problem . . . . . . . . . . . . . . . . . . 5 1.3. The Uniqueness Problem . . . . . . . . . . . . . . . . . . 5 2. Proposed Changes . . . . . . . . . . . . . . . . . . . . . . . 6 2.1. New Negotiated Option: USE_SECOND_BEST . . . . . . . . . . 6 2.2. New Optional Path Attribute: SECOND_BEST . . . . . . . . . 6 2.3. New Negotiated Option: USE_BACKUP_ONLY . . . . . . . . . . 6 2.4. New Optional Path Attribute: BACKUP_ONLY . . . . . . . . . 6 2.5. New Update Format . . . . . . . . . . . . . . . . . . . . 6 2.6. New Withdraw Format . . . . . . . . . . . . . . . . . . . 7 3. Modifications to BGP Behavior . . . . . . . . . . . . . . . . 8 3.1. Changes to Path Selection Rules . . . . . . . . . . . . . 8 3.2. Second Best - Basic Method . . . . . . . . . . . . . . . . 8 3.3. Second Best - Route Reflector . . . . . . . . . . . . . . 9 3.4. Second Best - Inter-AS Hybrid Method . . . . . . . . . . . 9 3.5. Backup Only - Basic Method . . . . . . . . . . . . . . . . 9 3.6. Backup Only - Route Reflector . . . . . . . . . . . . . . 10 3.7. IBGP vs EBGP . . . . . . . . . . . . . . . . . . . . . . . 10 4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11 5. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.1. Normative References . . . . . . . . . . . . . . . . . . . 12 5.2. Informative References . . . . . . . . . . . . . . . . . . 12 Appendix A. Path-Hunting Examples . . . . . . . . . . . . . . . . 13 Appendix B. Persistent Oscillation Examples . . . . . . . . . . . 14 Appendix C. BGP Wedgie Examples . . . . . . . . . . . . . . . . . 16 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 19 Intellectual Property and Copyright Statements . . . . . . . . . . 20 Dickson Expires August 8, 2008 [Page 3] Internet-Draft BGP Second-Best and Back-up February 2008 1. Background Even when all the best current practises are observed, operational problems may be experienced when running a BGP network. These include slow convergence due to path-hunting, persistant oscillations [1], and BGP "wedgies" [2]. Standardization of MRAI timers helps this, as well as RFC 5004 [4]. These RFCs identify the above issues as needing further work. 1.1. Localized Information The problems listed above occur as a result of additional information not being available (either on a transient basis, or permanently.) In the case of "path hunting", the information needed for achieving a stable final state is eventually received, but until it is, sub- optimal forwarding will occur, and possibly even transient routing loops. The "problem" mechanisms involved are: o the suppression of announcement of "second-best" paths, because of IBGP-received "best" paths; o the suppression by route-reflectors, of IBGP non-best paths (i.e. those normally seen directly by IBGP peers) o the suppression of announcement of "second-best" paths, because of EBGP-received "best" paths. o the lack of explicit global mechanism for expressing de-prefering announcements via "back-up" providers. When a prefix+path received is better than the local "best", the new "best" is normally sent. However, once a new "best" is received, the side-effect is to force the speaker to WITHDRAW the previous best path within the same "regime" (IBGP mesh or EBGP peers). When we consider the extra (e.g. suppressed) information, with special rules on what to send and how to treat it, the specified problems may go away, or be reduced in scope, duration, or likelihood. Dickson Expires August 8, 2008 [Page 4] Internet-Draft BGP Second-Best and Back-up February 2008 1.2. The Withdrawl Problem When a prefix (plus path) is withdrawn, the desired stable state is for the next-best path for that prefix (if one exists) to be chosen at each BGP speaker per its local policy. If that second-best path is already on hand, the delay and intermediate states can be reduced or entirely avoided. This is especially true for both intra-AS and inter-AS "path hunting". To avoid inconsistent behavior, routing loops, and routing- information loops, the second-best path received from a neighbor, should never be selected as a best path locally. The second-best path from a neighbor MUST ONLY be considered as a candidate for best path, when the previous best path from that neighbor is withdrawn. When this occurs, the path in question is promoted to "best" status. 1.3. The Uniqueness Problem Currently, for each prefix, only one path for that prefix is ever announced from one peer to another (except in the instance of Route Reflectors). Because of this property, uniqueness, a withdrawl on a prefix does not require path information. This also means that a change of best path is accomplished via an update for a prefix with the new path information. If, however, more than one path for a given prefix was sent, then any attempt to withdraw a prefix+path would require that the specific path for the prefix being withdrawn be supplied in the withdrawl update message. In an environment where multiple paths per prefix are possible, but only one path per prefix is maintained, then two steps would be involved in changing the "best" path. In no particular order, that would be the withdrawl of the old prefix+path, and the announcement of the new prefix+path. Dickson Expires August 8, 2008 [Page 5] Internet-Draft BGP Second-Best and Back-up February 2008 2. Proposed Changes 2.1. New Negotiated Option: USE_SECOND_BEST This is a new BGP Capabilities value, which can be optionally included in the capabilities negotiation. The specific value is a code-point to be assigned by IANA. 2.2. New Optional Path Attribute: SECOND_BEST This is a new BGP Path Attribute type. It may only be used if the USE_SECOND_BEST capability has been negotiated. It will generally be required for up to half of the Update messages sent to a peer with whom USE_SECOND_BEST has been negotiated. The type value is a new code point to be assigned by IANA. This is an Optional, Non-Transitive, Non-Extended, Non-Partial attribute. All the "attr flag bits" (from BGP [3]) are zero. The length is 1, and the value is 1. 2.3. New Negotiated Option: USE_BACKUP_ONLY This is a new BGP Capabilities value, which can be optionally included in the capabilities negotiation. The specific value is a code-point to be assigned by IANA. 2.4. New Optional Path Attribute: BACKUP_ONLY This is a new BGP Path Attribute type. The type value is a new code point to be assigned by IANA. This is an Optional, Transitive, Non- Extended, Non-Partial attribute, with the "attr flag bits" (from BGP [3]) set to appropriate values. The length is 1, and the value is 1. 2.5. New Update Format Update messages are identical to existing format, with the exception of the new optional Path Attributes, SECOND_BEST and/or BACKUP_ONLY. If BGP capability USE_SECOND_BEST has been negotiated, any Update MAY have a Path Attribute which include SECOND_BEST. Likewise, if the BGP capability USE_BACKUP_ONLY has been negotiated, any Update MAY have a Path Attribute which includes BACKUP_ONLY. More than one instance of a given prefix, with distinct values of Path Attributes, MAY be sent between BGP speakers. At most four instances may be sent, specifically one of each combination of with/without SECOND_BEST and BACKUP_ONLY: One with neither, one with SECOND_BEST only, one with just BACKUP_ONLY, and one with both SECOND_BEST and BACKUP_ONLY. Dickson Expires August 8, 2008 [Page 6] Internet-Draft BGP Second-Best and Back-up February 2008 Two prefix paths are considered identical if they differ only in the presence or absence of either of the new attributes (SECOND_BEST and/or BACKUP_ONLY). An Update which contains a path which differs by either or both of these, will result in the path information for the prefix being modified. 2.6. New Withdraw Format Since it is no longer possible to identify which instance of an prefix is affected by an update containing a withdrawl, all withdrawls SHOULD contain path information. Where a withdrawl would not be ambiguous, implementations MAY send withdrawls without path information. Where a withdrawl is not amibiguous, implementations MUST send path information with any withdrawl. Withdrawl Update messages do not require either SECOND_BEST or BACKUP_ONLY attributes, and if either is present, it MUST be ignored. Dickson Expires August 8, 2008 [Page 7] Internet-Draft BGP Second-Best and Back-up February 2008 3. Modifications to BGP Behavior 3.1. Changes to Path Selection Rules The path selection rules for BGP (section 9.1.2.2 of BGP4 [3]) are changed as follows: o The following rule is placed before step (a): If paths with and without BACKUP_ONLY are both available, those with BACKUP_ONLY are eliminated o The following rule is a modification to step (c): Step (c) is first performed INCLUDING paths with SECOND_BEST. If, at the end of the first attempt at step (c), only paths with SECOND_BEST remain, re-run step (c), this time EXCLUDING the paths with SECOND_BEST. After this modified version of step (c), the remaining paths MUST NOT have the SECOND_BEST attribute. In other words, Step (c) MUST remove any SECOND_BEST paths. o The remainder of the usual BGP path selection rules are applied as normal o If the final path selected has the BACKUP_ONLY attribute, that attribute is preserved. The path selection rules for "Second Best" path are as follows: o The already-selected "best" path is removed from the set of paths to compare o The same rules are applied as for the "best" path o The selected path is advertised with the attribute SECOND_BEST applied o If the selected path had the BACKUP_ONLY attribute, that attribute is preserved The prefix instances for consideration of second-best path are the REMAINDER of non-SECOND_BEST instances, and the SECOND_BEST instance received on the in-RIB from which the best path was selected (if one exists). Only one SECOND_BEST instance received may be considered for the local (and out-RIB) SECOND_BEST path. 3.2. Second Best - Basic Method Once the capabality for doing so has been negotiated between a pair of BGP speakers, each sends the best two paths for each prefix. The Dickson Expires August 8, 2008 [Page 8] Internet-Draft BGP Second-Best and Back-up February 2008 path information will include the additional SECOND_BEST attribute on the second best path. When the current "best" path is withdrawn, the withdrawl MAY be propogated without having to perform a full BGP table path selection. The current "second best" path in the local-RIB is promoted to "best". This is because the alternate candidates have already been evaluated and "second-best" has already been selected. Whenever an AS consists of a mesh of BGP speakers who have negotiated this capability, the withdrawl will propogate through the entire AS. This will either have no effect, or with a change in "best" without requiring non-local information to choose the new "best" path. 3.3. Second Best - Route Reflector The "best" and "second best" are reflected. The same mechanism is used for determining both best and second-best per prefix. Updates must be reflected whenever the choice of either or both of the "best" or "second best" change. Withdrawls may be propogated immediately. 3.4. Second Best - Inter-AS Hybrid Method When a withdrawl of the current best path is received from a peer doing USE_SECOND_BEST, and the rules for sending updates require that an update for this prefix be sent to a peer who does not support USE_SECOND BEST, the current second-best instance of the prefix is sent to that peer in an Update. The neighbor does not need the withdraw, since the new path replaces the old path. 3.5. Backup Only - Basic Method The main reason for establishing the BACKUP_ONLY attribute is to permit the global implementation of actual "backup only" announcements. It is not to facilitate change of policies, or to circumvent local policies, instead it is to make possible the implementation of policies where those have been negotiated by two or more parties. Currently, there are several documented scenarios in the "Wedgies" RFC [2] where the mutually desired policy is either unable to be implemented, or does not deterministically reach the desired state. Use of the BACKUP_ONLY attribute on announcements sent to a backup provider, permit these problems to be resolved. The same prefix is announced to both the primary and backup provider. When announced to the primary provider, the BACKUP_ONLY attribute is Dickson Expires August 8, 2008 [Page 9] Internet-Draft BGP Second-Best and Back-up February 2008 NOT set. When announced to the backup provider, the BACKUP_ONLY attribute IS set. The propogation of the BACKUP_ONLY instance will be limited by the availability of multiple paths and the use of SECOND_BEST peerings. In Figure 6 (of Appendix C), the BACKUP_ONLY instance will be seen by the backup provider, and be passed with both SECOND_BEST and BACKUP_ONLY to the backup provider's transit provider. The latter will prefer any other instace without BACKUP_ONLY, even if it has applied a LOCAL_PREFERENCE to the received prefix instance. Should the other instance be withdrawn, the BACKUP_ONLY will be selected and subsequently propogated. The withdrawl will also eventually result in an Update with the BACKUP_ONLY attribute but WITHOUT the SECOND_BEST attribute (since the prefix will now only be reachable via the backup provider.) 3.6. Backup Only - Route Reflector Route Reflectors operate the same as always. The BACKUP_ONLY attribute MUST be preserved during reflection. Thus, if "Second Best" is in operation, then the BACKUP_ONLY attribute of both best and second-best MUST be preserved on both instances. And, if "Second Best" is not in use, then the selected "best" prefix, if it has BACKUP_ONLY set, must be reflected with BACKUP_ONLY as well. 3.7. IBGP vs EBGP The same rules apply for EBGP->EBGP, EBGP->IBGP, IBGP->EBGP, and IBGP->IBGP. If a particular peering has had USE_SECOND_BEST negotiated, then any update for a particular prefix that results in new selection of either or both of best and second-best, the new selections (and possible withdrawl of old selections) is sent to the appropraite peers. If a particular peering has had USE_BACKUP_ONLY negotiated, then updates which have BACKUP_ONLY MAY be sent. Dickson Expires August 8, 2008 [Page 10] Internet-Draft BGP Second-Best and Back-up February 2008 4. Acknowledgements The author wishes to acknowledge the helpful guidance of Joe Abley, Tony Li, and Yakhov Rehkter. The author also wishes to acknowledge the insight gained from his Scottish Deerhound, Skylar, winning a Reserve Best-in-Show. (The selection method of "second best" comes from the Reserve system used at the group and best-in-show levels of dog shows). Dickson Expires August 8, 2008 [Page 11] Internet-Draft BGP Second-Best and Back-up February 2008 5. References 5.1. Normative References [1] McPherson, D., Gill, V., Walton, D., and A. Retana, "Border Gateway Protocol (BGP) Persistent Route Oscillation Condition", RFC 3345, August 2002. [2] Griffin, T. and G. Huston, "BGP Wedgies", RFC 4264, November 2005. [3] Rekhter, Y., Li, T., and S. Hares, "A Border Gateway Protocol 4 (BGP-4)", RFC 4271, January 2006. [4] Chen, E. and S. Sangli, "Avoid BGP Best Path Transitions from One External to Another", RFC 5004, September 2007. 5.2. Informative References [5] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. Dickson Expires August 8, 2008 [Page 12] Internet-Draft BGP Second-Best and Back-up February 2008 Appendix A. Path-Hunting Examples (These will be included in a subsequent version of this ID.) Dickson Expires August 8, 2008 [Page 13] Internet-Draft BGP Second-Best and Back-up February 2008 Appendix B. Persistent Oscillation Examples Consider the example in Figure 1 where o R1, R2, R3, R4, and R5 belong to one AS. o R1 is a route reflector with R2 and R3 as its clients. o R4 is a route reflector with R5 as its client. o The IGP metrics are as listed. o External paths (a), (b), and (c) are as described in Figure 2. +----+ 1 +----+ | R1 |-------------| R4 | +----+ +----+ | \ | | \ | 3| \ 2 | 6 | \ | | \ | +----+ +----+ +----+ | R2 | | R3 | | R5 | +----+ +----+ +----+ | | | (a) (b) (c) Figure 1 Path AS_PATH MED a 1 3 10 b 2 3 1 c 2 3 0 Figure 2 With the addition of "second best", we have: R1 has the following: Path AS_PATH MED IGP-metric a 1 3 10 3 (received:best) (best) b 2 3 1 2 (received:best) c 2 3 0 7 (received:best) (second_best - not sent) R4 has the following: Path AS_PATH MED IGP-metric a 1 3 10 4 (received:best) (best - not sent) c 2 3 0 6 (received: best) (second_best) Dickson Expires August 8, 2008 [Page 14] Internet-Draft BGP Second-Best and Back-up February 2008 This results in R1 having: Path AS_PATH MED IGP-metric a 1 3 10 3 (received:best) (best) b 2 3 1 2 (received:best) c 2 3 0 7 (received:second_best) (second_best - not sent) By including the second_best in the best path calculation, the persistent oscillation problem is resolved. Dickson Expires August 8, 2008 [Page 15] Internet-Draft BGP Second-Best and Back-up February 2008 Appendix C. BGP Wedgie Examples The following examples from RFC 4264 [2] show the effects of the proposed changes, in resolving "wedgie" issues. +----+ +----+ |AS 3|----------------|AS 4| +----+ peer peer +----+ |provider |provider | | |customer | +----+ | |AS 2| | +----+ | |provider | | | |customer |customer +-------+ +----------+ backup| |primary +----+ |AS 1| +----+ Figure 6 In Figure 6 above, the announcement via the backup link is sent with BACKUP_ONLY. o AS 4 sends the "best" (the direct link to AS 1) to AS 3. o AS 2 sends its "best", which is the BACKUP_ONLY path from AS 1, to AS 3, also with BACKUP_ONLY (since it is a transitive attribute). o AS 3 and AS 4 exchange their respective "best" paths. o AS 3 prefers the path "4 1" over "2 1" because "2 1" is BACKUP_ONLY. o AS 3 sends a revised BACKUP_ONLY update to AS 4 as SECOND_BEST. o AS 3 sends the new "best" to AS 2. o AS 2 sends a revised BACKUP_ONLY update to AS 3 as SECOND_BEST. This state will be reached regardless of sequence of disconnects and reconnects. Link failures will also result in propogation of withdrawls of "best" Dickson Expires August 8, 2008 [Page 16] Internet-Draft BGP Second-Best and Back-up February 2008 and the SECOND_BEST promotions will result in immediate correct behavior. +----+ +----+ |AS 3|----------------|AS 4| +----+ peer peer +----+ |provider |provider | | |customer |customer +----+ +----+ |AS 2| |AS 5| +----+ +----+ |provider |provider | | |customer |customer +-------+ +----------+ backup| |primary for 192.9.200.0/25 primary| |backup for 192.9.200.128/25 +----+ |AS 1| +----+ Figure 7 In Figure 7 above, the announcements via the backup links will work the same as in Example 1. +----+ +----+ |AS 3|----------------|AS 4| +----+ peer peer +----+ ||provider |providerS |+-----------+ | |customer |customer | +----+ +----+ | |AS 2|-------|AS 5| | +----+ peer +----+ | |provider |provider | | | | |customer +-+customer |customer +-------+ |+----------+ backup| ||primary +----+ |AS 1| +----+ Figure 8 In Figure 8 above, the announcements via both backup links will Dickson Expires August 8, 2008 [Page 17] Internet-Draft BGP Second-Best and Back-up February 2008 result in: o AS 2 selecting its best path via "3 4 1" (the only path it hears from AS 3) o AS 2 hearing two paths from AS 5: * its "second best" path "5 3 4 1" * another path marked SECOND_BEST and BACKUP_ONLY with path "5 1" o AS 2 hearing a BACKUP_ONLY directly from AS 1 Any announcement that AS 3 hears from AS 2 or AS 5 will always be marked BACKUP_ONLY. Thus, any combination of break/restore on any links in any order, will always result in the desired state being reached. Dickson Expires August 8, 2008 [Page 18] Internet-Draft BGP Second-Best and Back-up February 2008 Author's Address Brian Dickson Afilias Canada, Inc 4141 Yonge St, Suite 204 North York, ON M2P 2A8 Canada Email: brian.peter.dickson@gmail.com URI: www.afilias.info Dickson Expires August 8, 2008 [Page 19] Internet-Draft BGP Second-Best and Back-up February 2008 Full Copyright Statement Copyright (C) The IETF Trust (2008). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Acknowledgment Funding for the RFC Editor function is provided by the IETF Administrative Support Activity (IASA). Dickson Expires August 8, 2008 [Page 20]