Detection of Primary Server Failure in DHCPv6 Failover
draft-zhang-dhc-dhcpv6-failure-detection-02

Abstract

In DHCPv6 failover or other multiple servers deployment scenarios, an automatic failure detection capability may be desirable. This document describes a detection method, with which the secondary server can detect the link failure between the primary server and clients. This document does not define any protocol details.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on August 1, 2018.

Copyright Notice

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

1. Introduction
2. Requirements Language
3. Problem Statement and Applicability
4. Detection of Primary Server Failure
5. Security Considerations
6. IANA Considerations
7. Normative References
Authors' Addresses

1. Introduction

[RFC7031] describes the requirements of DHCPv6 failover, [RFC6853] discusses a simpler redundancy deployment considerations of DHCPv6. Both scenarios employ multiple servers deployments to improve DHCPv6's reliability and availability. In such scenarios, two categories of DHCPv6 servers, primary and secondary servers, are serving the clients in the domain. Both servers should provide essential DHCPv6 service and maintian the consistent configurations and lease inforamtion. The primary server should be resposnible for answering clients' requests, while the secondary server is expected to be responsive in case of the primary server's failure.

Popular implementations of failover and redundancy designs always provide the ability that one server could detect its partner's failure. This goal could be achieved through various mechanisms such as timer-based solution and etc. However, such failure detection methods are not sufficient. Since they cannot work out in a situation that the connection between the primary and secondary servers is normal while the link between the primary server and clients is down. Under this circumstances, it would be desirable that the secondary server could detect such a failure automatically and take the responsibility of providing DHCPv6 services.

This document describes a method for the secondary server to detect such a failure between primary server and clients in a ordinary multiple servers deployment. The consideration of the potential preference conflict between the responsive secondary server and primary server is also presented.

2. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

3. Problem Statement and Applicability

[RFC3315] allows multiple servers working in one domain for high availability and other benefits. One of the main purposes of multiple DHCPv6 servers deployment and failover is to solve the single point of faiulre problem. The server failure could be divided into two categories, the first one is the failure between primary server and secondary server, the second one refers to the failure between primary server and clients. People and existing failover implementations always focused more on the former situation and has already came up with several automatic detection methods.

A common scenario of the second failure is a (physical) link failure between primary server and clients. Such link failure may not do harm to the primary server itself but could actually result in making the primary server unreachable for clients. If the secondary server is not able to detect such a failure, it will assume everything is okay and not provide DHCPv6 service for redundancy.

Section 5.1.1 of [RFC7031] illustrates the first kind of server failure and states that the secodnary server could easily detetct such failure according to lack of responses from the primary server. However, it is obvious that such method does not make sense for the second server failure discussed in this document. Thus, we propose a new method in this document to automatically discover the failure between primary server and clients.

4. Detection of Primary Server Failure

The failure detection method described in this document is based on the following assumptions.

The secondary server is reachable to clients while the primary server is not (at least to part of clients).
The primary server is not down and the link between primary and secondary server is normal.

Based on the assumptions above, if the primary server is not reachable for a client, the client may keep advertising SOLICIT or REQUEST messages (if stateless DHCPv6 is used, the client may keep sending INFORMATION-REQUEST message).

To achieve an automatic detection, the secondary server should implement an internal counter. This counter will count each time the secondary server receives a duplicated message (e.g. SOLICIT message) from a same client. Also a threshold value and a time period should be set at the secondary server side. If the count value is larger than the threshold value in the configured time period, and the secondary server cannot find anything wrong with the primary server (i.e. responses from the primary server is regular), it will consider there exists a failure between primary server and clients. And if the count value does not reach the threshold in the specific time period, the counter will be clear. The threshold and time period value may differ in different deployments, thus the specific value of threshold and time period and detailed implementation of counter is out of scope of this document.

The detection method described in this document is likely to lead to a situation that both the primary server and secondary server are responsive, at least for the clients that their link to the primary server is not down. The reason is that the primary server cannot detect there is a failure between itself and part of clients. Thus it will continue to provide its DHCPv6 service which may cause a conflict with the secondary server. As a result, part of clients may receive two responses from the two servers and cannot decide which should be used.

One possible solution is that every time the secondary server decide to take the responsibility of being a responsive server to provide DHCPv6 service,it should inform the primary server about it. Such a notification should be regardless of whether the primary server is available or not. Since the purpose is to make sure there will not be two servers offering service at the same time.

Once the primary server failure is detected and notification process is finished, the secondary server may start to serve as a responsive server or just report the condition but do nothing else.

5. Security Considerations

A sort of DoS attack can be performed by a malicious client, which can flood the SOLICIT message in the network, thus make the secondary server become responsive while the primary server is actually responsive to the other clients.

Further security considerations is TBD.

6. IANA Considerations

This document does not include an IANA request.

7. Normative References

[RFC2119]	Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997.
[RFC3315]	Droms, R., Bound, J., Volz, B., Lemon, T., Perkins, C. and M. Carney, "Dynamic Host Configuration Protocol for IPv6 (DHCPv6)", RFC 3315, DOI 10.17487/RFC3315, July 2003.
[RFC6853]	Brzozowski, J., Tremblay, J., Chen, J. and T. Mrugalski, "DHCPv6 Redundancy Deployment Considerations", BCP 180, RFC 6853, DOI 10.17487/RFC6853, February 2013.
[RFC7031]	Mrugalski, T. and K. Kinnear, "DHCPv6 Failover Requirements", RFC 7031, DOI 10.17487/RFC7031, September 2013.

Authors' Addresses

Lanshan Zhang BUPT University Beijing University of Posts and Telecommunications (BUPT) Beijing, 100876 P.R. China Phone: +86-13146885878 EMail: zls326@sina.com

Wendong Wang BUPT University Beijing University of Posts and Telecommunications (BUPT) Beijing, 100876 P.R. China EMail: wdwang@bupt.edu.cn

Yuchi Chen Tsinghua University Beijing, 100084 P.R. China Phone: +86-10-6278-5822 EMail: chenycmx@gmail.com

Linhui Sun BUPT University Beijing University of Posts and Telecommunications (BUPT) Beijing, 100084 P.R. China EMail: sunlinhui@bupt.edu.cn