Network Working Group P. Lapukhov
Internet-Draft E. Nkposong
Intended status: Informational Microsoft Corporation
Expires: March 05, 2014 September 01, 2013

Centralized Routing Control in BGP Networks Using Link-State Abstraction
draft-lapukhov-bgp-sdn-00

Abstract

Some operators deploy networks consisting of multiple BGP Autonomous-Systems (ASNs) under the same administrative control. There are also implementations which use only one routing protocol, namely BGP, as in [I-D.lapukhov-bgp-routing-large-dc], for example. In such designs, inter-AS traffic engineering is commonly implemented using BGP policies, by configuring multiple routers at the ASN boundaries. This distributed policy model is difficult to manage and scale due to its dependency on complex routing policies and the need to develop and maintain a model for per-prefix path preference signaling. One example of such models could be standard BGP community-based (see [RFC1997]) signaling, which requires careful documentation and consistent configuration. Furthermore, automating such policy configuration changes for the purpose of centralized management requires additional efforts and is dependent on a particular vendor's configuration management (CLI extensions, NetConf [RFC6241] etc).

This document proposes a method for inter-AS traffic engineering for use with the kind of deployment scenarios outlined above. No protocol changes or additional features are required to implement this method. The key to the proposed methodology is a new software entity called "BGP Controller" - a special purpose application that peers with all eBGP speakers in the managed network. This controller constructs live state of the underlying BGP ASN graph and presents multi-topology view of this graph via a simple API to third-party applications interested in performing network traffic engineering. An example application could be an operational tool used to drain traffic from network devices. In response to changes in the logical network topology proposed by these applications, the controller computes new routing tables, and pushes them down to the network devices via the established BGP sessions.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on March 05, 2014.

Copyright Notice

Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.


Table of Contents

1. Introduction

BGP was intentionally designed as a path-vector protocol, since efficiently distributing link-state information for Internet-sized graph is virtually impossible. However, some network deployments leverage multiple BGP ASN to separate IGP domains, or simply use BGP as the only routing protocol. See, for example [I-D.lapukhov-bgp-routing-large-dc] which proposes using a BGP AS either per network device or "horizontal" device group, within a data-center. In such cases, the number of BGP ASNs is very small when compared to the Internet - on the order of few thousands in the largest case.

Under these assumptions, it becomes possible to build and maintain a link-state graph of the complete inter-AS topology and compute network paths based on this link-state information. In accomplishing this, it is desirable to avoid adding any protocol extensions so that current implementations can leverage the proposed method, such as those described, for example in [RWHITE2005]. Instead, this document proposes the use of a centralized agent (referred to as "BGP Controller" or simply "the controller") that peers with all eBGP speakers in the underlying network. The BGP Controller is responsible for constructing an up-to-date link-state view of the BGP inter-AS graph and pushing down routing information (prefixes and their associated next-hops) to the network devices via BGP updates. The new routing information reflects the results of link-state path computations performed by the controller. Such routing information push is possible because BGP supports the next-hop attribute that could be recursively resolved via either IGP or BGP. Notice that while the controller pushes routing information to the device, the underlying BGP processes also compute the best-paths for the same prefixes using the path-vector logic in the regular way. However, the BGP Controller could override this information by manipulating BGP attributes of injected routes, such as LOCAL_PREF to make its own advertisements more preferred.

Third party applications can influence routing computations by creating logical alternations of the network link-state graph, e.g. changing the cost of the links from the BGP Controllers point of view. This document will refer to those constructs as "alternate topologies" (or simply "topologies" for short), while the original, unaltered, link-state graph will referred to as the "default topology". The controller would use these alternate topologies to make routing decisions different from those that BGP would have made based on available information. It is possible to create multiple alternate topologies and associate different prefixes with every topology, with the restriction that each prefix maps to one and only one topology. Once this mapping is defined, the BGP Controller would perform autonomously, detecting network faults and reacting by re-computing routing information as needed based on the effect that the failure has across all instantiated topologies.

In many aspects, the proposed method was inspired by and is similar to the "Routing Control Platform" [RCP], but differs in the fact that link-state discovery is done using BGP mechanics only, and overall BGP is the only protocol used to build the system.

2. Overview

2.1. Use Cases

Primary intended use case of the BGP Controller is inter-AS traffic engineering. This includes, but is not limited, to the following:

The main benefit of the proposed approach is centralized control of the above functions. There is no need to configure policies on multiple devices, as all routing changes could be done using the uniform light-weight API to the controller. This ensures ease of automation and consistent changes. Furthermore, such a centralized model should be deployed to augment the classical distributed routing policy configuration. The advantage is that centralized control could be disabled at any time, falling the network back to the "traditional" BGP decision model, thus allowing for a safe state to roll-back to. Next, knowing the link-state of the network may allow avoiding the BGP path-hunting problem, and improve global BGP convergence timing in a large group of heavily meshed ASNs. Additionally, to avoid the phenomena of routing micro-loops the controller could enforce certain ordering for the network device programming sequence. Specifically, every time a link-state change is proposed to the controller, the devices in the network are programmed starting with those farther away from the change in terms of the metric of the existing graph. The same logic applies to link-down conditions detected by the controller via the health probing mechanism described below.

2.2. Architectural Assumptions

Firstly, the devices in the network are assumed (but not required) to have minimal BGP policy applied, enough for them to exchange routing information and compute best-paths based on shortest AS_PATH lengths. This means that the configured policy should not override best-path selection process using LOCAL_PREF or any other BGP attributes for enforcing a custom routing policy. The assumption of the "minimal policy" allows for making the BGP Controllers update logic less intrusive, as described further in the section Section 4.7. Next, every device is assumed to advertise a locally bound prefix into BGP for the purpose of BGP peering with the controller. That is, the controller peers "inband" with the devices it controls - either by initiating iBGP sessions to all devices or by passively accepting the sessions from the devices. As will be shown in the Section 5, inband peering requirement is important to avoid inconsistencies between multiple controllers programming the same network.

Another major assumption is how the link-state graph vertices are defined. From the BGP Controller perspective, there are two type of vertices: Figure 1 illustrates this concept:

The following

  
            Legend
            ------- eBGP 
            ....... iBGP   

                                  eBGP Peering
                                        |
                                  +-----+-----+
                                  |     |     |
                                  |   +-+-+   |
                                  |   |R3 |   |
                                  |   +-+-+   |
                                  |     |     |
                                  +-----+-----+
                                        |

                                   eBGP Peering                          

                                  |            |                         
                        +---------+------------+----------+              
                        |         |     AS1    |          |              
                        |       +-+-+        +-+-+        |                   
                        |       |R1 |        |R2 |        |                
                        |       +-+-+        +-+-+        |              
                        |         |            |          |              
                        +---------+------------+----------+              
                                  |            |                               

                    Type 1: Each device is individual graph vertex 
                          (three vertices, each with two edges).

                                  |           |
                             +----+-----------+----+
                             |    |           |    |
                             |  +-+-+       +-+-+  |
                             |  |R1 |.......|R2 |  |
                             |  +---+.     .+---+  |
                             |    .    .  .   .    |
                             |    .     .     .    |
                             |    .    .  .   .    |
                             |    .  .     .  .    |
                             |  +---+       +---+  |
                             |  |R3 |.......|R4 |  |
                             |  +-+-+       +---+  |
                             |    |           |    |
                             +----+-----------+----+
                                  |           |

                    Type 2: All devices below are grouped into 
                           single vertex with four edges.
          

Figure 1: Graph Vertices

Routing information could be associated with a graph vertex either by means of static binding or dynamic discovery: this process is described in details in sections Section 4.3. When programming the network prefixes into the devices, the controller does not inject a prefix back in the vertex the prefix is associated with.

The BGP Controller decision logic is independent of the address family, and could apply to both IPv4 and IPv6 prefixes equally. It is possible to run two independent controllers, one for each address family. This allows for full "fate decoupling" between the address families, though may result in duplication of the link state information.

The edges of the constructed link-state graph may have two attributes: metric, which is additive, and capacity (bandwidth) that is non-additive. The former is used to compute shortest paths, and the latter could be used to compute ECMP weight values in case where multiple equal-cost paths exist to the same vertex. For every ECMP path, the minimum capacity value that occurs along that path will be used as its weight by the controller, if the underlying network supports weighted ECMP functionality.

2.3. BGP Controller

The Figure 2 demonstrates the BGP Controller peering with the network devices. Multiple managed devices peer via eBGP following the traditional BGP design. For simplicity, we assume that every device belongs to it's own ASN - see Section 4.6 for more information on handling the "compound" Type-2 vertices consisting of multiple BGP speakers interconnected with iBGP mesh. Prefixes P3, P4 and P5 are associated with the devices (vertices) in ASNs 3, 4, and 5 respectively using techniques described in Section 4.3. The other remaining vertices are assumed to be purely transit for the purpose of this discussion.

These devices exchange routing information in the usual manner and the BGP Controller establishes iBGP peering sessions with every device. It uses the technique described in section Section 3.1 to build the inter-AS link-state graph. For now, it is sufficient to say that the discovery process uses special "beacon" prefixes dynamically injected into the network and relayed back to the controller to discover the state of the links interconnecting the graph vertices.

    
            Legend:

            ------- iBGP (controller to network)
            ....... eBGP (ASN to ASN)

                                 BGP Controller
                                   +-------+
                                   |       | 
                                   +-------+
                                    || | ||
                                    || | ||     
                      +-------------+| | |+--------------+     
                      |         +----+ | +----+          |                 
                      |         |      |      |          |
                      |         v      |      v          |
                      |       +---+    |    +---+        |         
                      |       |AS1|....|....|AS2|        |      
                      v       +---+    |    +---+        v 
                    +---+       .      |     . .       +---+
                 P3 |AS3|........      |     . ........|AS4| P4                           
                    +---+              |     .         +---+
                      .                V     .           .
                      .              +---+....           .
                      ...............|AS5|................        
                                     +---+ 
                                       P5
          

Figure 2: BGP Controller

At this point, the BGP Controller has knowledge of the link-state graph as well as the prefixes associated with every vertex, and can now run Dijkstra's SPF algorithm to compute shortest paths between vertices. A result of this computation would be routing tables built for every vertex. The Section 3.2 below demonstrates the adjacency list built by the controller for the above topology, as well as routing-tables computed for every vertex. The next-hops in the routing tables presented in the figure are simply the vertices to send the packets to. When programming the network devices, the actual IP addresses of the next-hops are computed as described in Section 4.1 section. This routing state corresponds to the unaltered (default) topology.

3. Link-State Abstraction and Multiple Topologies

This section provides detailed information on the link-state abstractions used by the controller and how those are used to perform traffic engineering in the underlying network.

3.1. Link-State Discovery Process

The network devices that the controller peers with establish eBGP peering sessions with each other. The fact that there is one-to-one correspondence between eBGP sessions and underlying IP link allows using the state of the eBGP session as the indication of the IP link health. Specifically, this is accomplished by injecting special "beacon" prefixes into every vertex (which could be a device or collection of devices interconnected with iBGP mesh) and expecting those beacons to be re-advetised back to the controller by every vertex adjacent to the point of injection. If a particular BGP session is down, the injected prefix will not be re-advertised by the affected peer back to the controller, allowing us to conclude that the corresponding link is down.

The Figure 3 demonstrates this process. For simplicity, we assume that every device belongs to its own BGP ASN. The BGP controller injects prefix X into device R1 and expects to hear this prefix from device R2. At the same time, it is desirable to prevent this prefix from leaking any farther than one hop away from R1, i.e. make sure it is not re-advertised to R3. To accomplish this, prefix X could be tagged with a special community value, which is replaced with the well-known community "no-export" when advertising over eBGP session. Because of this policy, the prefix will be announced back to the controller as it uses iBGP session for peering, but not any further to eBGP peers of router R2 in our case. An alternative to using the standard BGP communities could be leveraging the wide-communities limiting the scope of the announced prefixes - see [I-D.raszuk-wide-bgp-communities] for more details on this technique.