Network Working Group P. Francois Internet Draft O. Bonaventure Expiration Date: January 2006 M. Shand S. Previdi S. Bryant July 2005 Loop-free convergence using ordered FIB updates draft-francois-ordered-fib-00.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than a "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This draft proposes a mechanism to prevent transient loops during non-urgent topology changes by ordering the FIB updates on routers, in networks which use link state routing protocols. This mechanism can be used in case of graceful link shutdowns, metric changes and when a link is enabled. The solution can also be used in conjunction with an IPFRR mechanism wich turns a sudden link failure in a non- urgent change, by ensuring a local protection of the link. After a non-urgent topology change, each routers computes a rank that defines the time at which it can safely update its FIB. A completion message mechanism is also proposed in order to speed up the convergence process. Francois et al. Expires January 2006 [Page 1] Internet Draft Ordered FIB Updates July 2005 1. Introduction With link-state protocols [ISIS,OSPF], when the network topology changes, some routers need to modify their Forwarding Information Base (FIB) to take into account the new topology. Each topology change causes a convergence phase. During this phase, routers may transiently have inconsistent FIBs, which leads to packet loops and losses, even if the reachability of the destinations is not compromised after the topology change. When the link state change is a metric update and when a new link is brought up in the network, there is no direct loss of connectivity, but transient loops during the convergence phase may still cause packet loss. The same observation can be made in the case of a link failure. If the link is protected [IPFRR,MPLSFRR], a loop free convergence mechanism may also be used in place of the usual fast, best effort approach. The remainder of this document is organised as follows. We first show in section 2 that there is an ordering for the FIB updates when a single link fails, is activated or when an IGP metric changes. We also present an ordering for the cases of node failure and node activation. In section 3, we present an implementation of the ordering schemes. In section 4, a mechanism aimed at speeding up the loop-free convergence process is presented. In section 5, we present additional shortcuts to the ordering scheme. 2. Ordering of the FIB updates In this section, we briefly present an ordering of the FIB updates that allows to avoid transient forwarding loops in the network in the case of topology changes. A more precise analysis of the rerouting dynamics and correctness proofs of the mechanism can be found in [PFOB05]. 2.1 Link down / metric increase We first consider the failure of a link or the increase of the metric associated with one directed link. In this case, to ensure the transient consistency of the forwarding tables of the routers, a router R should update its FIB AFTER all the routers that used R BEFORE the event, to reach the affected link. Let us show that this rule ensures the transient consistency of the routers. We assume that the link X->Y goes down and is protected, or its metric is increased. Firstly, once a packet reaches a router R with an outdated FIB for its destination, it will only traverse routers with an outdated FIB, according to the rule. It thus reaches X, whose protection will Francois et al. Expires January 2006 [Page 2] Internet Draft Ordered FIB Updates July 2005 ensure that the packet reaches its destination. In the case of a metric increase, the packet will simply be forwarded along X->Y and reach its destination. Secondly, a packet cannot loop while being exclusively forwarded by routers with an updated FIB, so that any packet is proved to never enter a loop. In other words, when the ordering is respected, a packet entering the network that follows a path composed of updated or unaffected routers cannot loop. If it reaches an outdated router, it cannot reach an updated router anymore, so that it will not loop. 2.2 Link up / metric decrease In the case of link up events or metric decreases, to ensure the transient consistency of the forwarding tables of the routers, a router R should update its FIB BEFORE the routers that WILL use R to reach the affected link. Let us show that this rule ensures the transient consistency of the routers. We assume that the link X->Y goes up or that its metric is decreased. Firstly, when a packet reaches a router R that has updated its FIB, all the routers on the path from R to X have updated their FIB, so that the packet will reach X and be forwarded along X->Y, to finally reach its destination. Secondly, a packet could not loop between routers that have not yet updated their FIB, so that any packet is proved to never enter a loop. 2.3 Router down An ordering of the FIB updates can also be performed to avoid transient loops in the case of a router down event. There are multiple implementation choices to advertise the failure of a node. Firstly, a router being shut down can send a new link-state packet that describes the failure of all its links, by setting the age of each link to MAX_AGE. Secondly, the router can purge its own link- state packet. Finally, and only in IS-IS, the router can send its link-state packet by setting the overload-bit. All those solutions will let the other routers of the network know about the failure by receiving one single message. The other routers of the network can avoid loops while adapting to the node failure by respecting a rule that is very similar to the one Francois et al. Expires January 2006 [Page 3] Internet Draft Ordered FIB Updates July 2005 mentioned in the case of a link down event. A router R should update its FIB AFTER all the routers that use R to reach the router that is shutdown. Let us show that this ordering ensures the transient consistency of the forwarding. We assume that the router X is shut down. Firstly, if a packet to a destination D reaches a router that has not updated its FIB and that used X to reach D, it will only traverse routers that have not updated their FIB, to finally reach X. X will then forward the packet to its nexthop for D, and the packet will reach its destination. This nexthop could not have X in its path(s) to D before the event, as the contrary would mean that there was a loop before the event. Secondly, a packet that does not reach a router with an outdated FIB cannot loop. 2.4 Router up In IS-IS, when a router X is brought up, it should firstly send its link-state packet with an overload bit set. After a while, each of its neighbours will have sent a link-state packet describing their links to X as being up. At that moment, X can send a link-state packet with an overload bit unset, so that the processing of this link-state packet by the other routers will make them consider all the links connected to X as being up, in one single FIB update. In OSPF, the overload-bit does not exists. An OSPF router that is brought up should thus wait for a while before sending its own LSA, or firstly send its LSA with a MAX-METRIC assigned to each of its links, and only send a LSA with the real metrics after a while. By proceeding like this, we have the opportunity to order those FIB updates and avoid forwarding inconsistencies during the convergence. The ordering of the FIB updates follows a rule that is very similar to the rule mentioned in the case of a link up event. A router R should update its FIB BEFORE all the routers that WILL use R to reach the router that is brought up. Let us show that this ordering ensures the transient consistency of the forwarding. We assume that the router X comes in the network. Firstly, when a packet with destination D reaches a router R that has already updated its FIB, and that uses X to reach D, the packet will reach X as it will traverse routers that have already updated their FIB. X will then forward the packet to its nexthop for destination D. Francois et al. Expires January 2006 [Page 4] Internet Draft Ordered FIB Updates July 2005 The path from this nexthop to D cannot contain X, so that the packet is proved to be consistently forwarded to D. Secondly, a packet that exclusively traverses routers with an outdated FIB cannot loop. 3. Implementation of the ordering In this section, we show how the proposed ordering can be applied by routers that have to adapt to a topology change. 3.1 Link down / metric increase To respect the proposed ordering, routers can compute a rank that will be used to determine the time at which they can perform their FIB update. In the case of the failure of link X->Y or an increase of the metric of link X->Y, router R will compute the reverse Shortest Path Tree (rSPT) of Y. This rSPT gives the shortest paths to reach Y. The branch of the reverse SPT that is below R corresponds to the set of shortest paths to R that are used by the routers that reach X->Y via R. The rank of router R is defined as the depth (in number of hops) of this branch. To avoid confusion in the case of ECM paths, we can say that the the maximum depth must be taken into account. Router R will update its FIB at time : T0 + rank * MAX_FIB where T0 is the arrival time of the link-state packet containing the topology change and MAX_FIB a network-wide constant that reflects the maximum time required to update a FIB. The value of MAX_FIB is implementation specific and is out of the scope of this document. This value must be agreed by all the routers of the network. This agreement can be done by using a capability TLV as defined in [SLFTV]. All the routers that use R to reach X->Y will compute a lower rank than R, so that the order will be respected. It should be noted that only the routers that used X->Y before the event have to compute their rank. 3.2 Link up / metric decrease In the case of a link up or a link metric decrease affecting link X->Y, a router R must have a rank that is higher than the rank of the routers that it will use to reach X->Y, according to the rule mentioned in 2.2. The rank of R is thus equal to the number of hops between R and X in its Shortest Path Tree. When R has multiple equal cost paths to X, the rank is the length of the longest path in hops Francois et al. Expires January 2006 [Page 5] Internet Draft Ordered FIB Updates July 2005 to X. Router R will update its FIB at time : T0 + rank * MAX_FIB where T0 is the arrival time of the link-state packet containing the topology change and MAX_FIB a network-wide constant that reflects the maximum time required to update a FIB. It should be noted that only the routers that use X->Y after the event have to compute a rank, i.e. only the routers that have X->Y in their SPT after the link-state change. 3.3 Router down When a router X is shutdown, the rank that has to be applied by a router R is equal to the depth of the branch under R in rSPT(X). In the case of ECM paths to R, the maximum of the length must be taken into account. Note that X will only be shutdown once all the routers have updated their FIB. The rank that is computed by X thus gives the time at which X can be effectively shut down. 3.4 Router up When a router X comes in the network, the rank that has to be applied by a router R is equal to the maximum length of its ECM paths to X, in its updated Shortest Path Tree. Note that X will be the first to update its FIB, as it will have a rank equal to 0. This translates the fact that X must have built its FIB before any router can use it to forward packets. 4. Completion messages As noted in the previous sections, only the routers that are currently using a link affected by a topology change need to update their FIB. The FIB of all the other routers does not need to be updated at all. Furthermore, the routers that need to update their FIB do not necessarily need MAX_FIB seconds to perform their FIB update [CCRJULY]. In this section, we show how completion messages can be used to speed up the convergence after a non-urgent topology change by shortcuting the computed rank while respecting the transient consistency of the routers. To do this, we adapt the mechanism proposed in [INFOCOM]. Routers maintain a waiting list, that lists the neighbours from which they have to wait for a completion message before being allowed to update their own FIB. Upon reception of a completion message from a Francois et al. Expires January 2006 [Page 6] Internet Draft Ordered FIB Updates July 2005 neighbour, a router removes this neighbour from its waiting list. Once its waiting list becomes empty, the router is allowed to update its FIB. The next sections explain how the waiting list must be built for each type of event, and when completion messages are sent by the routers. 4.1 Link down / metric increase Let us assume that the link X->Y goes down, or its metric is increased. A router R that currently has X->Y in its SPT computes rSPT(Y) to determine its rank. By doing this, it computes the set of routers (and thus implicitly the set of its neighbours) that use it to reach link X->Y. The waiting list of R is equal to the set of neighbours that use it to reach X->Y. These are the neighbours below R in rSPT(Y). The current SPT of R gives the set of neighbours that it uses to reach X->Y. These neighbours have to wait for R, and R is thus present in their waiting list. This set of neighbours is the set of neighbours to which R will send its completion messages, once it has updated its FIB. 4.2 Link up / metric decrease Let us assume that the link X->Y goes up, or its metric is updated. A router R will compute its new SPT (SPT_new(R)). If X->Y is in present in SPT_new(R), R will use link X->Y and thus needs to compute its rank. For this, R computes the set of routers (and implicitly its set of neighbours) that it uses to reach X->Y. The nexthops for X in SPT_new(R). R must reroute after those routers to respect the ordering. Its waiting list is thus equal to those nexthops. When R has updated its FIB, it can send a completion message to its neighbours, so that a neighbour that has R in its waiting list will be allowed to remove R from it. 4.3 Router down Let us assume that a router X goes down. A router R will compute rSPT(X) to determine its rank. When doing this, it also computes the set of its neighbours that uses R to reach X. The waiting list of R is equal to this set. The current SPT of R gives the set of neighbours that will have R in their waiting list. Once R has updated its FIB, it will send its completion messages to those neighbours. Francois et al. Expires January 2006 [Page 7] Internet Draft Ordered FIB Updates July 2005 4.4 Router up Let us assume that a router X comes in the network. A router R will compute its new SPT. It thus also computes its nexthops for X. This set of nexthops (more than one in the case of equal cost paths to X) is the waiting list that R will use. Once R has updated its FIB, it sends a completion message to its neighbours, so that a neighbour that has R in its waiting list will be allowed to remove R from it. 5. Ranking Shortcuts. In the case of a link (X->Y) down or metric increase, a router R can sometimes shortcut its rank because it can decide, when computing the rSPT, which neighbours are not affected by the event. These are the neighbours that did not use the link X->Y, so that their paths to all the destinations do not change. Let us assume that link X->Y goes down. While R computes rSPT(Y), it can mark the routers that use the link X->Y. These are the routers that are below X->Y in the rSPT. If a neighbour N of R is not marked, all the FIB updates that will let packets be forwarded to N by R can be performed directly. Note that if N uses Y->X, R can still reroute to N because R and N are not affected for the same set of destinations, so that the destinations affected in R, using X->Y, are never affected in a router N using Y->X. If a FIB update contains an ECMP entry for a destination d, and not all the nexthops for d were not marked, then the normal ranking can be applied for this destination, or the FIB update will have to be modified in order to only reroute to the unmarked nexthops. In this case, a new FIB update will have to be performed for d when the rank timer has elapsed or all the necessary completion messages have arrived. The choice between these two solutions is a trade-off between an increased number of FIB updates and a faster convergence. References [OSPF] J. Moy, OSPF Version 2. RFC 2328. Apr. 1998. [IS-IS] "Intermediate system to Intermediate system routeing information exchange protocol for use in conjunction with the Protocol for providing the Connectionless-mode Network Service (ISO 8473)," ISO/IEC 10589:2002, Second Edition. Francois et al. Expires January 2006 [Page 8] Internet Draft Ordered FIB Updates July 2005 [IPFRR] M. Shand, S. Bryant, IP Fast Reroute Framework, draft-ietf-rtgwg-ipfrr-framework-03.txt, June 2005 [MPLSFRR] Pan, P. et al, "Fast Reroute Extensions to RSVP-TE for LSP Tunnels", RFC 4090 [PFOB05] P. Francois, O. Bonaventure. Avoiding transient loops during IGP convergence. IEEE INFOCOM 2005, March 2005, Miami, Fl., USA [CCRJULY] P. Francois, C. Filsfils, J. Evans, O. Bonaventure. Achieving sub-second IGP convergence in large IP networks, ACM SIGCOMM Computer Communication Review, July 2005 [SLFTV] A. Atlas, S. Bryant, M. Shand, Synchronization of Loop Free Timer Values, draft-atlas-bryant-shand-lf-timers-00.txt Appendix A : Pseudo-codes Figure 1 shows the pseudocode for the router reaction in case of non- urgent link failure or metric increase. Pseudo-code for link (X->Y) down/metric increase in router R : If(X->Y in SPT(R)){ //R is affected by the event //Compute rspt(Y) rspt = rSPT(Y); //compute the rank from rspt (can also be done directly //during rspt computation) worst_case_update_delay = depth(R,rspt) * MAX_FIB //Build the waiting list : these are the neighbours below R in rspt // WaitingList = getNeighboursUpstream(R,rspt); //Build the list of neighbours to which R will send a //completion message : //these are the nexthops for X in the current SPT of R completionMessageSendingList = nexthops(X,SPT(R)); //Wait for the waiting list to be empty or the rank time //to elapse. while(not WaitingList.empty() || worst_case_update_delay.elapsed()){ Francois et al. Expires January 2006 [Page 9] Internet Draft Ordered FIB Updates July 2005 if(completionMessageReceived()){ //New completion message received WaitingList.remove(completionMessage.sender); } } //Rank time has elapsed or waiting list is empty UpdateFIB(); //FIB has been updated, send completion messages ForEach(N in completionMessageSendingList){ send(N,"FIB Updated for X->Y"); } } else{ //R is not affected by the event, nothing to do } Figure 1 : Pseudocode for link down/metric increase Figure 2 provides the pseudocode for the reaction of a router to a non-urgent link up or metric decrease event. Pseudo-code for link (X->Y) up/metric decrease in router R : //Compute the new SPT. newSPT = computeSPT(R) If(X->Y in newSPT){ //R is affected by the event //compute the rank of the router. This is the length (in hops) //from R to X in the new spt. (note that it is equal to the //length from R to X in the old spt. worst_case_update_delay = PathLength(R,X,newSPT) * MAX_FIB //Build the waiting list : these are the nexthops for X in the new SPT. WaitingList = nexthops(R,X,newSPT); //Wait for the waiting list to be empty or the rank time to elapse. while(not WaitingList.empty() || worst_case_update_delay.elapsed()){ if(completionMessageReceived()){ //New completion message received WaitingList.remove(completionMessage.sender); } } //Rank time has elapsed or waiting list is empty UpdateFIB(); //Fib updated, send completion messages to the neighbours. ForEach(N in Neighbours(R)) Send(N,"Fib Updated X->Y"); Francois et al. Expires January 2006 [Page 10] Internet Draft Ordered FIB Updates July 2005 } else{ //R is not affected by the event, nothing to do } Figure 2 : Pseudocode for link up/metric decrease Authors Addresses Pierre Francois Universite catholique de Louvain Place Sainte Barbe, 2 B-1348, Louvain-La-Neuve Belgium Email: pfr@info.ucl.ac.be Olivier Bonaventure Universite catholique de Louvain Place Ste Barbe, 2 B-1348, Louvain-La-Neuve Belgium Email: Bonaventure@info.ucl.ac.be Mike Shand Cisco Systems, 250, Longwater Avenue, Green Park, Reading, RG2 6GB, United Kingdom Email: mshand@cisco.com Stefano Previdi Cisco Systems Via Del Serafico 200 00142 Roma - Italy Email : sprevidi@cisco.com Stewart Bryant Cisco Systems, 250, Longwater Avenue, Green Park, Reading, RG2 6GB, United Kingdom Email: stbryant@cisco.com Francois et al. Expires January 2006 [Page 11] Internet Draft Ordered FIB Updates July 2005 Intellectual Property Considerations The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf- ipr@ietf.org. Full Copyright Statement Copyright (C) The Internet Society (2005). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Francois et al. Expires January 2006 [Page 12]