Bidirectional Forwarding Detection (BFD) on Link Aggregation Group (LAG) Interfaces
draft-mmm-bfd-on-lags-01

Abstract

This document proposes a mechanism to run BFD on Link Aggregation Group (LAG) interfaces. It does so by running an independent BFD session on every LAG member link.

A dedicated well-known multicast IP address for both IPv4 and IPv6 is introduced as the destination IP address of the BFD packets when running BFD on the member links of the LAG.

There is currently also no standard that describes how BFD runs on a LAG interface as a whole. This draft proposes a definition for this problem too while taking into consideration existing implementations.

Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http:/⁠/⁠datatracker.ietf.org/⁠drafts/⁠current/⁠.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on June 16, 2012.

Copyright Notice

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http:/⁠/⁠trustee.ietf.org/⁠license-⁠info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

1. Introduction
2. BFD over LAG with a single session
2.1. BFD over Big Pipe
2.2. Examples of existing implementations
3. BFD on LAG member links
3.1. BFD BLM session
3.2. BFD micro session
3.3. Concluded BFD state
4. BFD on LAG members and layer-3 applications
5. Application example: LMM using BLM
6. Security Consideration
7. IANA Considerations
8. Acknowledgements
9. References
9.1. Normative References
9.2. Informative References
Appendix A. IETF discussion status
Appendix A.1. Unicast vs. Multicast IP address
Appendix A.2. BFD packet details (Ethernet encaps)
Authors' Addresses

1. Introduction

The Bidirectional Forwarding Detection (BFD) protocol [RFC5880] provides a mechanism to detect faults in the bidirectional path between two forwarding engines, including interfaces, data link(s), and to the extent possible the forwarding engines themselves, with potentially very low latency.

BFD can be used for detecting failures of the path between two network devices. Typically the application clients are not aware of any inner structure of the underlying interface, being layer 3 applications themselves like Open Shortest Path First (OSPF) [RFC2328] or Border Gateway Protocol (BGP)[RFC4271]. While this works for interfaces like Ethernet and Packet Over SONET (POS), it causes problems for bundled interfaces like LAG.

A LAG is used to bind together several physical ports between two adjacent nodes so they appear to higher-layer protocols as a single, higher bandwidth "virtual" pipe. A LAG interface thereby allows aggregation of multiple network interfaces as one virtual interface for the purpose of providing fault-tolerance and higher bandwidth.

The problem with running BFD over a LAG is that with a single BFD session and without internal knowledge of the LAG structure it is impossible for BFD to guarantee a detection of anything but a full LAG shutdown within the BFD timeout period. The LAG shutdown is typically initiated by some LAG module, which we will refer to as the LAG Management Module (LMM) in the rest of the document. LAG timers are typically multiple times slower than the BFD detection timers (multiple 100msec of LMM vs. multiple 10msec of BFD). There is thus a need to bring some sort of determinism in how BFD runs over a LAG. There is also a need to detect member link failures much faster than what Link Aggregation Control Protocol (LACP) allows.

The document proposes establishing a BFD session over every member link the LAG is built upon. BFD can combine these information to provide fast detection for layer-3 applications.

While there are native Ethernet mechanisms to detect failures (802.1ax, .3ah) that could be used for LAG, the solution proposed in this document enables operators who have already deployed BFD over different technologies (e.g. IP, MPLS) to use a common failure detection mechanism.

2. BFD over LAG with a single session

2.1. BFD over Big Pipe

The simplest approach to run BFD on a LAG interface is to ignore the internal structure and treat the LAG as one "big pipe". To differentiate this mode of operation we call it "BFD over Big Pipe" or "BBP" for short. It corresponds to section 7.1 in RFC 5882 [RFC5882].

We need to standardize this approach. The following requirements define what it means to treat a LAG interface as a single interface with no additional structure:

BFD can send packets on any member link
BFD must accept packets from every member link
the Rx/Tx link can change any time and/or regularly with every change pattern without causing BFD to fail

The BFD session on the LAG interface then follows RFC 5880 and RFC 5881 in all details.

2.2. Examples of existing implementations

Because there is no standard, vendors have implemented their own proprietary mechanisms to run BFD over LAG interfaces. Two examples are shown here. Both satisfy the requirements in Section 2.1

Some implementations send BFD packets only over one member link . Others spray BFD packets over all member links of the LAG. There are issues with each of these approaches.

In the first approach, BFD sends packets onto the LAG and the LAG load balance algorithm will select a member port, which may be the same port for all the packets of this BFD session. BFD will remain up as long as this "primary" port is alive. It will go down once the primary port goes down till another port is selected as the primary. Problems arise with this design as BFD is oblivious to the presence of other member links in the LAG. If a non-primary link goes down, the BFD session remains unaffected as it can still send and receive BFD packets over the primary link. This results in all traffic sent over the failed member link getting dropped, till the LMM removes the failed link from the LAG.

In the second approach, BFD packets are sprayed over all the member links of a LAG. This is done naively via round-robin, where each BFD packet is sent using the subsequent member link, in a round-robin fashion. It solves the problem of BFD going down because of the primary port going down, but it still does not solve the problem of traffic getting lost when one of the member link goes down. This is because, when a member link goes down, BFD remains up and traffic continues to go over the link that has failed till a higher layer protocol detects this and removes the offending link from the LAG.

It is RECOMMENDED to implement this second approach due to it's deterministic behaviour.

3. BFD on LAG member links

The mechanism proposed for a fast detection of LAG member link failure is running BFD sessions on every LAG member link. We name this mode of BFD operations "BFD on LAG members" or "BLM" for short. It corresponds to section 7.3 in RFC 5882 [RFC5882].

3.1. BFD BLM session

The overall BLM session consists of the LAG interface, i.e. the aggregated link, a set of BFD sessions running on the member links and a new BFD state for the LAG; this state is explained in more detail in Section 3.3. We name the member-link sessions also micro BFD sessions; their details are discussed in Section 3.2.

The set of micro sessions is such that we have one micro session per member link. This set can change over the lifetime of a BLM session. E.g. BFD receives updates for the micro session set when links are physically added or removed from the LAG and will accordingly create or delete micro BFD sessions.

The details how the update happens are implementation specific and outside the document's scope. For example the client requesting the BLM session could provide these updates.

Per BLM session request only one address family can be used. I.e. the set of micro sessions belonging to the BLM session MUST either all use IPv4 or all use IPv6.

Multiple BLM session requests for the same LAG interface result in a shared BLM session. The set of micro sessions finally used is the superset of the individual micro session sets. If conflicting session parameters are requested then it is a local issue as to how to resolve the parameter conflicts, as explained in RFC 5882, Section 2.

3.2. BFD micro session

A single micro BFD session runs on every member link of the LAG. These micro BFD sessions follow RFC 5880 [RFC5880].

Only asynchronous mode is considered in this document. The echo function is outside the document's scope. At least one system MUST take the Active role (possibly both). The micro BFD sessions on the member links are independent BFD sessions. They use their own unique, local discriminator values, maintain their own set of state variables and have their own independent state machine. Timer values MAY be different, even among the micro sessions belonging to the same LAG, although it is expected that micro sessions belonging to the same LAG use the same timer values.

The demultiplexing of a received packet is solely based on the Your Discriminator field, if this field is nonzero. For the initial Down packet of a micro session this value may be zero. In this case demultiplexing MUST be based on some combination of other fields which MUST include the interface information of the member link.

When receiving a BFD packet for a micro session with a valid, non-zero Your Discriminator then a check MUST be done if the packet was received on the correct member link interface. If the check fails then the packet MUST be discarded. This test needs to be done before state variables for the micro sessions are updated by the received packet.

The BFD packets for the micro session are IP/UDP encapsulated as defined in RFC 5881 [RFC5881]. Control packets for each micro BFD session use a well-known link-local multicast IP address (224.0.0.X for IPv4, FF02::X for IPv6, to be assigned by IANA).

On Ethernet-based LAG member links the corresponding destination multicast MACs will be 01:00:5e:00:00:XX for IPv4 and 33:33:00:00:00:XX for IPv6. Each member link uses its own MAC address as the source MAC address.

3.3. Concluded BFD state

An additional state variable is introduced for BFD on LAG members: the concluded state. The state values are Down, Up and AdminDown. This state is not part of the micro session state machine. Instead it describes the overall state of the LAG. It is a local state and does not appear (directly) in any BFD packet on any link.

The concluded state may be set to AminDown for administrative purpose, to keep the BLM and the micro sessions indefinitely down. When the concluded state is entering AdminDown then all micro sessions belonging to the BLM MUST enter the AdminDown state as well.

A function must be defined, which evaluates all the states of the micro sessions that belong to the BLM. This function has two output values Down and Up and the concluded state is updated with the last evaluation result, unless it is already in AdminDown state. The evaluation takes place whenever a micro session is added, removed or is changing state.

The details of the evaluation function are outside the scope of the document. The function could for example test for a minimum number of micro sessions in Up state. The function could even be "outsourced" and e.g. the decision logic of the LMM module could be used.

The concluded state is important for layer-3 clients requesting BFD sessions over the LAG or over Vlans on the LAG. Details will be discussed in Section 4.

4. BFD on LAG members and layer-3 applications

Layer 3 protocols like e.g. OSPF may use BFD on LAG members in one of the following ways:

The session request from the client creates a virtual session. This virtual session is not sending actual BFD packets. Instead the state, which is reported to the layer-3 client, is based on the concluded state.

Implementations compliant to this standard MUST support this mode. This is the default mode in which BFD over LAG works.
The session request from the client creates a BBP session, as described in Section 2.1. BFD SHOULD update the state of the BBP session with the concluded state of the corresponding LAG in the following way:
- when the concluded state is Down then the BBP session state is transitioning to Down as well
- for a concluded state of Up or AdminDown the BBP session state is unaffected
This state update allows BBP session to run with more relaxed timer values as the more intense liveliness detection is done by the micro BFD sessions.

Compliant implementations MUST support this mode.

An implementation MUST provide a configuration knob which lets the user select the mode if both modes are supported.

5. Application example: LMM using BLM

There are certainly many ways to use BLM. Here is one example envisioned by the authors.

The LAG Management Module (LMM) could be envisaged as a client of BFD, requesting micro BFD sessions for all member links of the LAG. I.e. LMM is requesting a BLM session from BFD and takes responsibility to update the set of micro sessions when interfaces are administratively added or removed.

LMM uses BFD, instead of or in parallel with LACP, to monitor the health of the individual members links of the LAG. BFD takes a precedence over LACP in deciding the fate of individual member links when both are run in parallel.

A member link of the LAG is not used anymore for data forwarding when the associated micro BFD session running over that link goes down. The member link MUST be removed from the LAG. The BFD session for the link remains, i.e. it is not deleted.

To add a member link to the LAG, LMM MAY wait for the BFD session on the link to come Up. There may be a deadlock situation since the link interface not being active (e.g., layer 3 protocol down) may prevent BFD packets, including other control protocols packets (e.g. ARP) that are tightly coupled with the status of the interface, to be transmitted between the pair of interfaces, thus failing to bring up the interfaces.

To avoid the deadlock, BFD packets SHOULD NOT be blocked by the layer N protocol status of the interface when the application depends on the BFD status to enable layer N of the interface. If this cannot be achieved then the BFD status MUST be ignored by the application when bringing up an interface. The BFD status can then be used afterwards to bring the interface down.

The behaviour of the LMM MUST be configurable if waiting for BFD status of Up to add a member link is supported, to allow an alternative mode of adding the member link irrespective of the BFD state for interoperability purpose.

6. Security Consideration

This document does not introduce any additional security issues and the security mechanisms defined in [RFC5880] apply in this document.

Routers compliant to this standard will now need to process packets addressed to a new multicast address. This however, should not open any new attack vector as it is a link local multicast and the attacker would have to be on the same link as the router to launch such packets.

7. IANA Considerations

The IANA is requested to assign a well-known link-local multicast IP address: "224.0.0.XXX" for IPv4 and FF02::X for IPv6.

8. Acknowledgements

Most of the text for this document came originally from draft-chen-bfd-interface-00.

We would like to thank Dave Katz, Alexander Vainshtein, Greg Mirsky and Jeff Tantsura for their comments on this draft.

We would also like to thank the members of the BFD WG who expressed strong support about the need to run BFD on all the member links of a LAG.

9. References

9.1. Normative References

[RFC2119]	Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC5880]	Katz, D. and D. Ward, "Bidirectional Forwarding Detection (BFD)", RFC 5880, June 2010.
[RFC5881]	Katz, D. and D. Ward, "Bidirectional Forwarding Detection (BFD) for IPv4 and IPv6 (Single Hop)", RFC 5881, June 2010.
[RFC5882]	Katz, D. and D. Ward, "Generic Application of Bidirectional Forwarding Detection (BFD)", RFC 5882, June 2010.

9.2. Informative References

[RFC1042]	Postel, J. and J. Reynolds, "Standard for the transmission of IP datagrams over IEEE 802 networks", STD 43, RFC 1042, February 1988.
[RFC2328]	Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998.
[RFC4271]	Rekhter, Y., Li, T. and S. Hares, "A Border Gateway Protocol 4 (BGP-4)", RFC 4271, January 2006.

Appendix A. IETF discussion status

[this section will finally go away. It documents some of the discussions and decisions made recently on the BFD group list]

Appendix A.1. Unicast vs. Multicast IP address

The destination IP address for the BFD control packets for the micro BFD sessions can be Unicast or Multicast. Each has its set of advantages and disadvantages.

Advantages with using a Unicast IP destination address:

Minimal code changes to support micro BFD sessions per member link. A new UDP port number can be used to differentiate BFD control packets associated with the micro BFD sessions and the regular BFD sessions.

Disadvantages with using a Unicast IP destination address:

It is configuration intensive. Each LAG needs to be configured with the remote end's IP address for BFD to boot strap. Similarly, a change in the IP address of the interface would need all LAGs to be reconfigured. While one could minimize the amount of human intervention required, it cannot be completely eliminated.
ARP needs to be resolved for sending Unicast packets. This means that ARP must be resolved even before the first control packet is sent to bring up the micro BFD session. There are multiple ways to achieve this. The most logical approach is to mandate LACP on this LAG. This way., LACP will bring up the links so that ARP resolution can begin. However, this necessitates the need to run LACP along with BFD on all member links. The other option is to allow ARP processing even when the port state is down. This means that implementations would have to allow all packets with broadcast MAC and port MAC to be sent to CPU for processing. This violates the basic tenets of IP layering and opens a hole for a DoS attack. This also requires a huge change to the IP stack to allow packet Rx and Tx on ports that are down.
Not possible to support unnumbered IP interfaces.

Advantages with using a Multicast IP destination address:

No additional configuration is required, and the micro BFD sessions are set up automatically. It remains independent of the LAG IP addressing scheme. The member links get added to the LAG as soon as the micro BFD sessions come up.
This involves minimal modifications to data plane and L3 stack. Currently, ports that are down do process packets coming with certain well known L2 MAC addresses. This solution requires such ports to process packets addressed to another well known L2 MAC address (derived from the multicast IP address assigned by IANA).
Can support unnumbered IP interfaces.

Disadvantages with using a Multicast IP destination address:

Unfamiliar. Some bit of the data plane and the source code would need to be modified to accept BFD control packets that are multicast.
Need to allocate a new link local multicast address from IANA.

Based on the above analysis, we decided to go with multicast IP addressing scheme for the micro BFD sessions.

Appendix A.2. BFD packet details (Ethernet encaps)

[Either this section or the corresponding parts of Section 3.2 should remain in the final document. There is no intention to support both encapsulations]

The BFD packet is directly encapsulated into the Ethernet frame. The frame has the following format: Ethernet 802.3 header, then either:

Type/Length field set to 46 octets, then LLC/SNAP according to RFC 1042 [RFC1042] (can we reuse it for BFD?) with an Ethertype field in the SNAP header for BFD encapsulated in Ethernet [TBD], then the BFD packet itself.
Type/Length field set to the type of "BFD encapsulated in Ethernet", followed by the BFD packet

In both cases the Ethernet payload must be padded with zeros to reach 46 bytes if the size is not already larger.

When receiving an Ethernet frame the payload, without any potential LLC/SNAP header, is used for further BFD processing. Additional padding data MUST be ignored if it was required to reach the minimum payload length of 46 bytes.

IANA needs to assign a L2 multicast address 01-80-C2-XX-XX-XX that would be used as the destination MAC for all control packets in the micro BFD sessions.

A new Ethertype must be assigned by the IEEE Registration Authority to the BFD over Ethernet protocol that will be used for all micro BFD sessions.

[Needs more detailed work TBD.]

Authors' Addresses

Manav Bhatia Alcatel-Lucent Bangalore, 560045 India EMail: manav.bhatia@alcatel-lucent.com

Mach(Guoyi) Chen Huawei Technologies Co., Ltd Q14 Huawei Campus, No. 156 Beiqing Road, Hai-dian District Beijing, 100095 China EMail: mach@huawei.com

Zuliang Wang Huawei Technologies Co., Ltd Q15 Huawei Campus, No. 156 Beiqing Road, Hai-dian District Beijing, 100095 China EMail: liang_tsing@huawei.com

Liang Guo China Telecom Guangzhou, China EMail: guoliang@gsta.com

Marc Binderberger Lausanne, Switzerland EMail: marc@sniff.de