BESS WorkGroup S. Mohanty Internet-Draft M. Ghosh Intended status: Informational A. Sajassi Expires: May 5, 2020 Cisco Systems S. Breeze Claranet J. Uttaro ATT November 2, 2019 BGP EVPN Flood Traffic Optimization at EVPN Gateways draft-mohanty-bess-evpn-bum-opt-01 Abstract In EVPN, the Broadcast, Unknown Unicast and Multicast (BUM) traffic is sent to all the routers participating in the EVPN instance. In a multi-homing scenario, when more than one PEs share the same Ethernet Segment, i.e. there are more than one PEs in a redundancy group, only the PE that is the Designated-Forwarder (DF) for the ES will forward that packet on the access interface whereas all non-DF PEs will drop the packet. In deployments such as EVPN Gateways (EVPN GW) or Data Center Interconnect (DCI) routers, this can be quite wasteful. This is especially true if there are significantly more EVPN GW or DCI PEs all participating in the same sets of ES and vES. This draft explores the problem and provides solutions for the same. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on May 5, 2020. Mohanty, et al. Expires May 5, 2020 [Page 1] Internet-Draft BGP BUM Optimization November 2019 Copyright Notice Copyright (c) 2019 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Requirements Language and Terminology . . . . . . . . . . . . 2 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Problem Description . . . . . . . . . . . . . . . . . . . . . 4 4. Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.1. DF Election per-mcast-flow . . . . . . . . . . . . . . . 5 4.2. Suppress the advertisement of the IMET route . . . . . . 5 4.3. Advertisement of the IMET route from the BDF . . . . . . 7 5. Protocol Considerations . . . . . . . . . . . . . . . . . . . 7 6. Operational Considerations . . . . . . . . . . . . . . . . . 8 7. Security Considerations . . . . . . . . . . . . . . . . . . . 8 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 8 9. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 8 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 8 10.1. Normative References . . . . . . . . . . . . . . . . . . 8 10.2. Informative References . . . . . . . . . . . . . . . . . 9 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 9 1. Requirements Language and Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. o ES: Ethernet Segment o vES: Virtual Ethernet Segment o EVI: Ethernet virtual Instance, this is a mac-vrf. o IMET: Inclusive Multicast Route Mohanty, et al. Expires May 5, 2020 [Page 2] Internet-Draft BGP BUM Optimization November 2019 o DF: Designated Forwarder o BDF: Backup Designated Forwarder o DCI: Data Center Interconnect Router 2. Introduction EVPN [RFC7432] describes a solution for disseminating mac addresses over an mpls core via the Border Gateway Protocol. In EVPN, data plane learning is confined to the access, and the control plane learning happens via BGP in the core. This prevents unnecessary flooding in the data plane as the traffic is directed to where the destination is learnt from. However, in case of Broadcast, Unknown Unicast and Multicast (BUM) traffic, the PE needs to do a flooding to all the other PEs in the domain. PEs elect a Designated Forwarder (DF) amongst themselves, for a given ES, by exchanging type-4 routes via BGP. The role of a DF is to forward BUM traffic received from the core, towards its access facing interface. A PE in a non-DF role will drop flood traffic received on its core-facing interface. Note that the DF election process is only confined to the set of PEs who host the same Ethernet Segment. Remote PEs are not interested in type-4 routes for Ethernet Segments that they do not host. Hence remote PEs are ignorant of the DFs for segments which is not local to them. Consequently, when the remote PE needs to do a BUM flooding using ingress replication, it will flood the frames to all participating PEs, irrespective of whether DFs or not. The key to creating a list of PEs with which to flood to, is the Inclusive multicast ethernet tag route which is described below. The IMET route (type-3) in EVPN advertises the BUM label for the EVI to all the other PEs who are interested in the same EVI. For ingress replication the label is encapsulated in the PMSI attribute. The label is used to encapsulate the BUM traffic at the ingress entity. This label is inserted just above the split-horizon label in the BUM frame. When the BUM packet is received by a PE that is multi-homed to the same Ethernet segment as the PE that originated the BUM packet, and, is the DF for that (EVI, ES) pair, after popping the transport label, the receiving PE is going to check if the split- horizon label is its own. If so, it will drop the packet if no other ES is configured. Otherwise it will forward the frame on all other Segments that are part of the same EVI. if the PE is not the DF, it will drop the packet immediately. Mohanty, et al. Expires May 5, 2020 [Page 3] Internet-Draft BGP BUM Optimization November 2019 ____ ____ __/ \__ ___/ \___ / \ / \ CE1+--+-+VTEP1 DCI1 PE1+---+CE10 | | | | | | | | | | CE2+--+-+VTEP2 EVPN DCI2 EVPN | | VXLAN | | MPLS | | FWD | | FWD | CE3+----+VTEP3 DCI3 | | | | | | | | | | | | | CEn+----+VTEPk DCIj / \__ ___/ \___ __/ \____/ \____/ An EVPN Datacenter network with VXLAN forwarding joined to a traditional EVPN network with MPLS forwarding. Adjoining DCI routers are said to be EVPN GW's. A DCI will have a single vES (ESI) per BD, with multiple VTEP next-hops. Figure 1 3. Problem Description In the Figure 1. above, DCI1, DCI2 and DCI3 are all multi-homed EVPN GW's for multiple VTEPs serving the same vES, say vES1. PE1 has a single host which is not multi-homed. The same EVPN instance (Bridge-Domain) exists on all the PEs and DCIs. For this EVPN instance, DCI1 is the Designated Forwarder on vES1 and DCI2 is the backup DF [RFC8584]. When PE1 sends the BUM traffic, the flooded frames are received by DCI1, DCI2, DCI3 up to DCIj. DCI1 is going to forward the flood traffic on its vES towards all VTEPs participating in vES1. DCI2, DCI3 and all DCIs up to DCIj will drop the flooded frames that they receive from the core. Here it is wasteful for DCI2, DCI3 and DCIj to receive the flooded frames. Whilst the majority of deployments usually have two DCIs as part of the redundancy group, in some cases, there may be more than two on the same vES. An example being when capacity demands of the DCI are close to the hardware limits of the DCI. In this scenario, operators may chose to protect their investments and increase their resilience by installing additional DCIs, instead of replacing them or further segmenting the datacenter network. Further, increasing Mohanty, et al. Expires May 5, 2020 [Page 4] Internet-Draft BGP BUM Optimization November 2019 the number of DCIs results in more efficient load-balancing across VNIs. We can now formally describe the issue. In general, consider an EVPN instance, EVIi, that exists in a DCI, say DCIj. As per existing EVPN behavior, even if DCIj is not the DF for any of its virtual Ethernet Segments and also there are no other single-homed Ethernet Segments that are part of EVIi in DCIj , then DCIj will still receive BUM traffic meant for EVIi from a remote PE, PEk. This traffic is simply dropped as PEk is not a DF for any of these virtual Ethernet Segments. 1. This is an unnecessary usage of bandwidth in the EVPN Core. 2. DCIj receives traffic which it drops which is non-optimal usage of the L2 Forwarding engine. 3. PEk replicates a copy of the Ethernet Frame to DCIj which is only to be dropped. This consumes cycles at PEk. In this draft we address the above problem and give possible solutions. 4. Solutions 4.1. DF Election per-mcast-flow Solving the bandwidth in the EVPN core is an operators primary concern. Given the majority of traffic volume in BUM comes from large multicast flows, adopting the mechanisms described in :"I- D.draft-ietf-bess-evpn-per-mcast-flow-df-election-00" not only improves the distribution of multicast traffic amongst DCI1...DCIj for a given vES, techniques such as not advertising the SMET from a non-DF DCI ensure that only DCIs who've won the election for the group, receive multicast traffic for the group. This solution explicitly requires IGMP snooping in the BD where the vES resides. This solution does not solve the problem of unnecessary Broadcast and Unknown Unicast being replicated to nDFs, but it solves the most prominent problem of bandwidth. 4.2. Suppress the advertisement of the IMET route The next solution is for a DCI not to advertise the IMET route if the outcome is to drop the flooded traffic Mohanty, et al. Expires May 5, 2020 [Page 5] Internet-Draft BGP BUM Optimization November 2019 o DCIj only needs to advertise "Inclusive Multicast Ethernet Tag route" (Type-3 route) for an EVPN Instance, EVIi if and only if EVIi is configured on at least one Ethernet Segment (which also has a presence in another DCI, i.e Multihomed) and DCIj is the DF for that specific Ethernet Segment. o The Type-3 SHOULD also be advertised if there is a "Single-Home" Ethernet Segment on an EVI. o Where a DCI is the first DF for an vES on an EVPN Instance, the IMET should be advertised, whereas on the Last DF to Non-DF transition, it should be withdrawn. In the Figure 2 the same EVPN instance exists in DCI1, DCI2, DCI3, DCIj and PE1. However, only DCI1 and PE1 advertise the IMET route. So PE1 sends the flood traffic to DCI1 only. ____ ____ __/ \__ - - ->___/ \___ / \ / \ CE1+--+-+VTEP1 DCI1 PE1+---+CE10 | | | | | | | | | | CE2+--+-+VTEP2 EVPN DCI2 EVPN | | VXLAN | | MPLS | | FWD | | FWD | CE3+----+VTEP3 DCI3 | | | | | | | | | | | | | CEn+----+VTEPk DCIj / \__ ___/ \___ __/ \____/ \____/ An EVPN GW Network Figure 2 With this approach, on a DF DCI1 failure, BUM traffic will be dropped until the IMET from the next elected DF [DCI2 through DCIj] is received at PE1. Note however; present behaviour is that BUM is also dropped based on route type 4 withdraw in the peering PEs. In comparison of this proposal with the existing methods, convergence delay will be MAX[Type 4, Type 3 Propagation delays] after the New DF is elected. This leads to our next solution extension, where convergence cannot be traded off over bandwidth optimization. Mohanty, et al. Expires May 5, 2020 [Page 6] Internet-Draft BGP BUM Optimization November 2019 4.3. Advertisement of the IMET route from the BDF 1. Multihomed PEs can easily compute the Backup DF, based on the DF election mode in operation. 2. Extending the previous solution, we are proposing that a PE should only advertise Type-3 for an EVI if and only if one of the conditions hold: * It has an Single Home Ethernet Segment, in the EVI * It is DF for at least one ES or vES, for that EVI * It is BDF for at least one ES or vES, for that EVI This would mean that, in Fig. 2, in addition to the IMET routes that are being advertised from DCI1, DCI2 also advertises the IMET route since it is the BDF. It can be seen from the above example that with increasing number of multi-homed PEs sharing the same vESs, only two DCIs will advertise IMET on behalf of an EVI. Of course, if there are some single-homed hosts, there may be some additional IMET advertisements. But the real benefits are in the data plane since this results in no BUM traffic for DCIs that do not need it; but would have, nevertheless, got it, as per the existing EVPN procedures. It is important to note that the solutions involving suppression of IMET should be limited to the following use case caveats; 1. BUM traffic for Ingress Replication (IR) cases 2. BDs with no igmp/mld/pim proxy 3. BDs with no OISM or IRBs 4. BDs with vES associated to overlay tunnels and no other ACs With these caveats, the suppression of IMET at non DF or BDF EVPN GWs provide complete control over BUM traffic distribution per-vES (per- BD). 5. Protocol Considerations This idea conforms to existing EVPN drafts that deal with BUM handling [RFC7432], and [I-D.ietf-bess-evpn-igmp-mld-proxy]. Additionally, to take DF Type 4 as explained in :"I-D.draft-ietf- bess-evpn-per-mcast-flow-df-election" into consideration, along the other conditions specified in Sections 4 and 5, the PE should Mohanty, et al. Expires May 5, 2020 [Page 7] Internet-Draft BGP BUM Optimization November 2019 advertise IMET if and only if there is at least one (S,G) for which it is DF. For all other DF Types, no additional considerations are required. 6. Operational Considerations None 7. Security Considerations This document raises no new security issues for EVPN. 8. Acknowledgements The authors would like to thank Jorge Rabadan, John Drake and Eric Rosen for discussions related to this draft. 9. Contributors Samir Thoria Cisco Systems US Email: sthoria@cisco.com Sameer Gulrajani Cisco Systems US Email: sameerg@cisco.com 10. References 10.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A Border Gateway Protocol 4 (BGP-4)", RFC 4271, DOI 10.17487/RFC4271, January 2006, . Mohanty, et al. Expires May 5, 2020 [Page 8] Internet-Draft BGP BUM Optimization November 2019 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 2015, . [RFC8584] Rabadan, J., Ed., Mohanty, R., Sajassi, N., Drake, A., Nagaraj, K., and S. Sathappan, "BGP MPLS-Based Ethernet VPN", RFC 8584, DOI 10.17487/RFC8584, April 2019, . 10.2. Informative References [I-D.ietf-bess-evpn-igmp-mld-proxy] Sajassi, A., Thoria, S., Patel, K., Yeung, D., Drake, J., and W. Lin, "IGMP and MLD Proxy for EVPN", draft-ietf- bess-evpn-igmp-mld-proxy-04 (work in progress), September 2019. [I-D.ietf-bess-evpn-per-mcast-flow-df-election] Sajassi, A., mishra, m., Thoria, S., Rabadan, J., and J. Drake, "Per multicast flow Designated Forwarder Election for EVPN", draft-ietf-bess-evpn-per-mcast-flow-df- election-01 (work in progress), March 2019. [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private Networks (VPNs)", RFC 4364, DOI 10.17487/RFC4364, February 2006, . Authors' Addresses Satya Ranjan Mohanty Cisco Systems 170 W. Tasman Drive San Jose, CA 95134 USA Email: satyamoh@cisco.com Mrinmoy Ghosh Cisco Systems 170 W. Tasman Drive San Jose, CA 95134 USA Email: mrghosh@cisco.com Mohanty, et al. Expires May 5, 2020 [Page 9] Internet-Draft BGP BUM Optimization November 2019 Ali Sajassi Cisco Systems 170 W. Tasman Drive San Jose, CA 95134 USA Email: sajassi@cisco.com Sandy Breeze Claranet 21 Southampton Row London WC1B 5HA United Kingdom Email: sandy.breeze@eu.clara.net Jim Uttaro ATT 200 S. Laurel Avenue Middletown, CA 07748 USA Email: uttaro@att.com Mohanty, et al. Expires May 5, 2020 [Page 10]