Network Working Group C. Filsfils, Ed. Internet-Draft D. Cai, Ed. Intended status: Informational S. Previdi Expires: April 5, 2017 Cisco W. Henderickx Alcatel-Lucent R. Shakir BT D. Cooper F. Ferguson Level3 S. Lin Microsoft T. LaBerge Cisco B. Decraene Orange L. Jalil Verizon J. Tantsura Ericsson October 4, 2016 Interconnecting Millions Of Endpoints With Segment Routing draft-filsfils-spring-large-scale-interconnect-03 Abstract This document describes an application of Segment Routing to scale the network to support hundreds of thousands of network nodes, and tens of millions of physical underlay endpoints. This use-case can be applied to the interconnection of massive-scale DC's and/or large aggregation networks. Forwarding tables of midpoint and leaf nodes only require a few tens of thousands of entries. Status of This Memo Filsfils, et al. Expires April 5, 2017 [Page 1] Internet-Draft Segment Routing October 2016 This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on May 1, 2016. Copyright Notice Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Reference Design . . . . . . . . . . . . . . . . . . . . . . . 3 3. Control Plane . . . . . . . . . . . . . . . . . . . . . . . . . 5 4. Illustration of the scale . . . . . . . . . .. . . . . . . . . 5 5. Optional Designs . . . . . . . . . . . . . . . . . . . . . . . 6 6. Deployment Model . . . . . . . . . . . . . . . . . . . . . . . .7 7. Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . ..7 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 9. Manageability Considerations . . . . . . . . . . . . . . . . . 8 10. Security Considerations . . . . . . . . . . . . . . . . . . . 8 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 9 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . .9 Filsfils, et al. Expires April 5, 2017 [Page 2] Internet-Draft Segment Routing October 2016 Filsfils, et al. Expires April 5, 2017 [Page 3] Internet-Draft Segment Routing October 2016 1 Introduction This document describes how SR can be used to interconnect millions of endpoints. 1.1. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. Term Definition ----------- ------------------------------------------------ Agg Aggregation BGP Border Gateway Protocol DC Data Center DCI Data Center Interconnect ECMP Equal Cost MultiPathing FIB Forwarding Information Base LDP Label Distribution Protocol LFIB Label Forwarding Information Base MPLS Multi-Protocol Label Switching PCE Path Computation Element PCEP Path Computation Element Protocol PW Pseudowire SR Segment Routing TI-LFA Topology Independent - Loop Free Alternative 2 Reference Design +-------+ +--------+ +--------+ +-------+ +-------+ A DCI1 Agg1 Agg3 DCI3 Z | DC1 | | M1 | | C | | M2 | | DC2 | | DCI2 Agg2 Agg4 DCI4 | +-------+ +--------+ +--------+ +-------+ +-------+ For example, an operator could do the following: -Independent ISIS-OSPF/SR instance in core (C) -Independent ISIS-OSPF/SR instance in Metro1 (M1) -Independent ISIS-OSPF/SR instance in Metro2 (M2) -BGP/SR in DC1 -BGP/SR in DC2 -Agg routes are redistributed from C to M and from M to DC domains. Nothing else is distributed Filsfils, et al. Expires April 5, 2017 [Page 4] Internet-Draft Segment Routing October 2016 -Same homogenous SRGB throughout the domains (e.g. 16000-23999) -Allocate unique SRGB sub-ranges to each metro and core domains: 16000-16999 to the core, 17000-17999 to the metro1, 18000-18999 to the metro2. Specifically, Agg3 is 16003 and the anycast SID for (Agg3, Agg4) is 16006. DCI3 is 17003 and the anycast SID for (DCI3, DCI4) is 17006 -Re-use the same SRGB sub-range for each DC: e.g. 20000-23999. Specifically A and Z are both 20001. 3. Control-plane It is out of the scope of this document to describe how the SRTE Policies are computed and programmed at the source nodes. This section provides a high-level description of an implemented control-plane. The service orchestration programs A with a PW to a remote next-hop Z with a given SLA contract (low-latency path, be disjoint from a specific core plane, be disjoint from a different PW service, etc.). A automatically detects that it does not have reachability to Z. It then automatically sends a PCEP request to an SR PCE for an SRTE policy that provides reachability to Z with the requested SLA. The SR PCE is made of two components. A multi-domain topology and a compute block. The multi-domain topology is continuously refreshed from BGP-LS feeds from each domain. The compute block implements TE algorithms designed specifically for SR path expression. Upon receiving the PCEP request, the SR PCE computes the solution (e.g. {16003, 16005, 18001} and provides it to A. The SR PCE logs the request as a stateful query and hence recomputes another solution upon any multi-domain topology changes that invalidates the previous solution. A receives the PCEP reply with the solution. A installs the received SRTE policy in the dataplane. A automatically steers the PW on that SRTE policy. 4. Illustration of the scale 1 core domain and 100 leaf domains Core domain has 200 core nodes. Assume two nodes per each leaf domain, with specific node segment and anycast segments, it's 300 Filsfils, et al. Expires April 5, 2017 [Page 5] Internet-Draft Segment Routing October 2016 prefix segments in total. Assume a core node connects only one leaf domain. Each leaf domain has 6,000 leaf node segments. Each leaf-node has 500 endpoints attached, thus 500 adjacency segments. In total, it is 3M endpoints per leaf domain. Network wide scale: 6,000x100=600,000 nodes 6,000x100x500=300M endpoints Per-node segment scale: Leaf node segment scale: 6,000 (leaf node segments) + 300 (core node segments) + 500 (adj segments) = 6,800 Core node segment scale: 6,000 (leaf domain segments) + 300 (core domain segments) = 6,300 In the above calculation, it didn't count the link adjacency segments, which is local to the node. Typically it should be <100. Note, depends on the leaf node FIB capability, we could split the leaf domain into multiple smaller domains. For the above example, we can split the leaf domain to 6 smaller leaf domains. So each leaf node only need to learn 1000 (leaf node segments) + 300 (core node segments) + 500 (adj segments)= 1,800 segments. 5 Optional Designs 5.1 SRGB size In the simplified illustrations of this document, we picked a small homogenous SRGB range of 16000-23999. In practice, a large-scale design would use a bigger range such as 16000-80000, or even larger. 5.2 Redistribution of Agg routes The operator might choose to not redistribute the Agg routes into the Metro/DC domains. In that case, more segments are required to express an inter-domain path. For example, A would use an SRTE policy {DCI1, Agg1, Agg3, DCI3, Z} to reach Z instead of {Agg3, DCI3, Z} in the reference design. 5.3 Sizing of the domains and number of Tiers The operator is free to choose among a small number of larger leaf domains, a large number of small leaf domains or a mix of small and Filsfils, et al. Expires April 5, 2017 [Page 6] Internet-Draft Segment Routing October 2016 large domains. The operator is free to use a 2-tier design (Core/Metro) or a 3-tier (Core/Metro/DC). 5.4 Local Segments to Hosts/Servers Local segments can be programmed at any leaf node (e.g. Z) in order to identify locally-attached hosts (or VM's). For example, if Z has bound a local segment 40001 to a local host ZH1, then A uses the following SRTE Policy to reach that host: {16006, 17006, 20001, 40001}. Such local segment could represent the NID (Network Interface Device) device in the context of the SP access network, or VM in the context of the DC network. 5.5 Compressed SRTE policies We earlier saw that A could reach Z with a low-latency SLA contract via the SRTE policy {16001, 16002, 16003, 17006, 20001}. It is clear that the control-plane solution can install an SRTE policy {16002, 16003, 17006} at Agg1, collect the Binding SID allocated by Agg1 to that policy (e.g. 4001) and hence program A with the compressed SRTE policy {16001, 4001, 20001}. From A, 16001 leads to Agg1. Once at Agg1, 4001 leads to the DCI pair (DCI3, DCI4) via a specific low-latency path {16002, 16003, 17006}. Once at that DCI pair, 20001 leads to Z. Binding SID's allocated to "intermediate" SRTE policies allow to compress "end-to-end" SRTE policies. {16001, 4001, 20001} expresses the same path as {16001, 16002, 16003, 17006, 20001} but with 2 less segments. Binding SID's also provide for an inherent churn protection. When the core topology changes, the control-plane can update the low- latency SRTE policy from Agg1 to the DCI pair to DC2 without updating the SRTE policy from A to Z. 6 Deployment Model It is expected that this design be deployed as a green field but as well in interworking (brown field) with seamless-mpls design (draft- ietf-mpls-seamless-mpls). 7 Benefits Filsfils, et al. Expires April 5, 2017 [Page 7] Internet-Draft Segment Routing October 2016 7.1 Inter-domain interconnection of millions of endpoints We have illustrated how millions of endpoints across different domains can be interconnected. 7.2 Simplified operation We have eliminated two protocols (LDP, RSVP-TE) and have not added any. The design leverage the core IP protocols: ISIS, OSPF, BGP, PCEP with straightforward SR extensions. 7.3 Inter-domain SLA We leverage TILFA sub-50msec FRR upon Link/Node/SRLG failure. We leverage the optional use of Anycast SID's for further availability improvement. We have shown how inter-domain SLA's can be delivered: e.g. latency vs cost optimized path, disjointness from bacbone planes, disjointness from other services, disjointness between primary and backup paths We note that the existing inter-domain solutions (Seamless MPLS) do not provide any support for SLA contracts. They just provide a best- effort reachability across domains. 7.4 Scale We have eliminated two protocols and not added any. We have eliminated midpoint states on a per-service basis. 7.5 ECMP Each policy (intra or inter-domain, with or without TE) is expressed as a list of segments. As each segment is optimized for ECMP, therefore the entire policy is optimized for ECMP. The ECMP gain of anycast prefix segment should also be considered (e.g. 16001 load- shares across any gateway from L1 leaf domain to Core and 16002 load- shares across any gateway from Core to L2 leaf domain. 8. IANA Considerations None 9. Manageability Considerations TBD 10. Security Considerations Filsfils, et al. Expires April 5, 2017 [Page 8] Internet-Draft Segment Routing October 2016 TBD 11. Acknowledgements We would like to thank Giles Heron, Alexander Preusche and Steve Braaten for their contribution to the content of this document. 12. References 12.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 12.2. Informative References [draft-ietf-mpls-seamless-mpls] Leymann, et al., "Seamless MPLS Architecture", draft-ietf-mpls-seamless-mpls-07, (work in progress), July 2015 [draft-francois-spring-segment-routing-ti-lfa-01] Pierre Francois, et al., "Topology Independent Fast Reroute using Segment Routing", draft-francois-spring-segment-routing-ti-lfa-01, (work in progress), April 2015 Authors' Addresses Clarence Filsfils (editor) Cisco Systems, Inc. Brussels BE Email: cfilsfil@cisco.com Dennis Cai (editor) Cisco Systems, Inc. 170, West Tasman Drive San Jose, CA 95134 US Email: dcai@cisco.com Stefano Previdi Cisco Systems, Inc. Via Del Serafico, 200 Rome 00142 Italy Email: sprevidi@cisco.com Filsfils, et al. Expires April 5, 2017 [Page 9] Internet-Draft Segment Routing October 2016 Wim Henderickx Alcatel-Lucent Email: wim.henderickx@alcatel-lucent.com Rob Shakir BT Email: rob.shakir@bt.com Dave Cooper Level 3 Email: Dave.Cooper@Level3.com Francis Ferguson Level 3 Email: Francis.Ferguson@level3.com Tim LaBerge Cisco Email: tlaberge@cisco.com Steven Lin Microsoft Email: slin@microsoft.com Bruno Decraene Orange Email: bruno.decraene@orange.com Luay Jalil Verizon 400 International Pkwy Richardson, TX 75081 Email: luay.jalil@verizon.com Jeff Tantsura Ericsson jeff.tantsura@ericsson.com Filsfils, et al. Expires April 5, 2017 [Page 10]