Dynamic Host Configuration (DHC) T. Mrugalski Internet-Draft ISC Intended status: Standards Track K. Kinnear Expires: September 6, 2012 Cisco March 5, 2012 DHCPv6 Failover Design draft-mrugalski-dhc-dhcpv6-failover-design-00 Abstract DHCPv6 defined in [RFC3315] does not offer server redundancy. This document defines a design for DHCPv6 failover, a mechanism for running two servers on the same network with capability for either server to take over clients' leases in case of server failure or network partition. This is a DHCPv6 Failover design document, it is not protocol specification document. It is a second document in a planned series of three documents. DHCPv6 failover requirements are specified in [requirements]. A protocol specification document is planned to follow this document. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on September 6, 2012. Copyright Notice Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents Mrugalski & Kinnear Expires September 6, 2012 [Page 1] Internet-Draft DHCPv6 Failover Design March 2012 carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 2. Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1. Additional Requirements . . . . . . . . . . . . . . . . . 5 4. Protocol Overview . . . . . . . . . . . . . . . . . . . . . . 6 4.1. NORMAL State Overview . . . . . . . . . . . . . . . . . . 7 4.2. COMMUNICATION-INTERRUPTED State Overview . . . . . . . . . 7 4.3. PARTNER-DOWN State Overview . . . . . . . . . . . . . . . 8 4.4. RECOVERING State Overview . . . . . . . . . . . . . . . . 8 5. Connection Management . . . . . . . . . . . . . . . . . . . . 8 5.1. Creating Connections . . . . . . . . . . . . . . . . . . . 8 5.2. Endpoint Identification . . . . . . . . . . . . . . . . . 10 6. Resource Allocation . . . . . . . . . . . . . . . . . . . . . 11 6.1. Proportional Allocation . . . . . . . . . . . . . . . . . 12 6.2. Independent Allocation . . . . . . . . . . . . . . . . . . 12 6.3. Determining Allocation Approach . . . . . . . . . . . . . 13 6.3.1. IPv6 Addresses . . . . . . . . . . . . . . . . . . . . 13 6.3.2. IPv6 Prefixes . . . . . . . . . . . . . . . . . . . . 13 7. Failover Mechanisms . . . . . . . . . . . . . . . . . . . . . 13 7.1. Time Skew . . . . . . . . . . . . . . . . . . . . . . . . 13 7.2. Time expression . . . . . . . . . . . . . . . . . . . . . 14 7.3. Lazy updates . . . . . . . . . . . . . . . . . . . . . . . 14 7.4. MCLT concept . . . . . . . . . . . . . . . . . . . . . . . 14 7.4.1. MCLT example . . . . . . . . . . . . . . . . . . . . . 16 7.5. Unreachability detection . . . . . . . . . . . . . . . . . 17 7.6. Sending Data . . . . . . . . . . . . . . . . . . . . . . . 17 7.6.1. Required Data . . . . . . . . . . . . . . . . . . . . 18 7.6.2. Optional Data . . . . . . . . . . . . . . . . . . . . 18 7.7. Receiving Data . . . . . . . . . . . . . . . . . . . . . . 18 7.7.1. Conflict Resolution . . . . . . . . . . . . . . . . . 18 7.7.2. Acknowledging Reception . . . . . . . . . . . . . . . 19 8. Endpoint States . . . . . . . . . . . . . . . . . . . . . . . 19 8.1. State Machine Initialization . . . . . . . . . . . . . . . 19 8.2. NORMAL State Operation . . . . . . . . . . . . . . . . . . 19 8.3. COMMUNICATION-INTERRUPTED State Operation . . . . . . . . 19 8.4. PARTNER-DOWN State Operation . . . . . . . . . . . . . . . 19 8.5. RECOVERING State Operation . . . . . . . . . . . . . . . . 19 8.6. State Transitions . . . . . . . . . . . . . . . . . . . . 19 9. Proposed extensions . . . . . . . . . . . . . . . . . . . . . 19 Mrugalski & Kinnear Expires September 6, 2012 [Page 2] Internet-Draft DHCPv6 Failover Design March 2012 9.1. Active-active mode . . . . . . . . . . . . . . . . . . . . 20 10. Dynamic DNS Considerations . . . . . . . . . . . . . . . . . . 20 11. Reservations and failover . . . . . . . . . . . . . . . . . . 20 12. Protocol entities . . . . . . . . . . . . . . . . . . . . . . 20 12.1. Failover Protocol . . . . . . . . . . . . . . . . . . . . 21 12.2. Protocol constants . . . . . . . . . . . . . . . . . . . . 21 13. Open questions . . . . . . . . . . . . . . . . . . . . . . . . 21 14. Security Considerations . . . . . . . . . . . . . . . . . . . 21 15. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 16. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 22 17. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 17.1. Normative References . . . . . . . . . . . . . . . . . . . 22 17.2. Informative References . . . . . . . . . . . . . . . . . . 22 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 23 Mrugalski & Kinnear Expires September 6, 2012 [Page 3] Internet-Draft DHCPv6 Failover Design March 2012 1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. 2. Glossary This is a supplemental glossary that combined with definitions in Section 3 [requirements]. o Failover endpoint - The failover protocol allows for there to be a unique failover 'endpoint' per partner per role per relationship (where role is primary or secondary and the relationship is defined by the relationship-name). This failover endpoint can take actions and hold unique states. Typically, there is a one failover endpoint per partner (server), although there may be more. 'Server' and 'failover endpoint' are synonymous only if the server participates in only one failover relationship. However, for the sake of simplicity 'Server' is used throughout the document to refer to a failover endpoint unless to do so would be confusing. o Failover transmission - all messages exchanged between partners. o Independent Allocation - a prefix allocation algorithm to split the available pool of resources between the primary and secondary servers that is particularly well suited for vast pools (i.e. when available resources are not expected to deplete). See Section 6.2 for details. o Primary Server o Proportional Allocation - a prefix allocation algorithm to split the available free leases between the primary and secondary servers that is particularly well suited for more limited resources. See Section 6.1 for details. o Resource - an IPv6 address or a IPv6 prefix. o Secondary Server o Server - A DHCPv6 server that implements DHCPv6 failover. 'Server' and 'failover endpoint' as synonymous only if server participates in only one failover relationship. Mrugalski & Kinnear Expires September 6, 2012 [Page 4] Internet-Draft DHCPv6 Failover Design March 2012 3. Goals The failover protocol design provides a means for cooperating DHCPv6 servers to work together to provide a DHCPv6 service with availability that is increased beyond that which could be provided by a single DHCPv6 server operating alone. It is designed to protect DHCPv6 clients against server unreachability, including server failure and network partition. It is possible to deploy exactly two servers that are able to continue providing a lease on an IPv6 address or on an IPv6 prefix without the DHCPv6 client experiencing lease expiration or a reassignment of a lease to a different IPv6 address in the event of failure by one or the other of the two servers. This protocol defines active-passive mode, sometimes also called hot standby model. This means that during normal operation one server is active (i.e. actively responds to clients' requests) while the second is passive (i.e. it does receive clients' requests, but does not respond to them and only maintains a copy of lease database and is ready to take over incoming queries in case of primary server failure). Active-active mode (i.e. both servers actively handling clients' requests) is currently not supported for the sake of simplicity. Such mode may be defined as an exension at a later time. The failover protocol is designed to provide lease stability for leases with lease times beyond a short period. Due to the additional overhead required, failover is not suitable for leases shorter than 30 seconds. The DHCPv6 Failover protocol MUST NOT be used for leases shorter than 30 seconds. This design attempts to fulfill all DHCPv6 failover requirements defined in [requirements]. 3.1. Additional Requirements The following requirements are not related to failover mechanism in general, but rather to this particular design. 1. Minimize Asymmetry - while there are two distinct roles in failover (primary and secondary server), the differences between those two roles should be as small as possible. This will yield a simpler design as well as a simpler implementation of that design. Mrugalski & Kinnear Expires September 6, 2012 [Page 5] Internet-Draft DHCPv6 Failover Design March 2012 4. Protocol Overview The DHCPv6 Failover Protocol is defined as a communication between failover partners with all associated algorithms and mechanisms. Failover communication is conducted over a TCP connection established between the partners. The protocol reuses the framing format specified in Section 5.1 of DHCPv6 Bulk Leasequery [RFC5460], but uses different message types. Additional failover-specific message types will be defined. All information is sent over the connection as typical DHCPv6 Options, following format defined in Section 22.1 of [RFC3315]. After initialization, the primary server establishes a TCP connection with its partner. The primary server sends a CONNECT message with initial parameters. Secondary server responds with CONNECTACK. Depending on the failover state of each partner, they MUST initiate one of the binding update procedures. Each server MAY send an UPDREQ message to request its partner to send all updates that have not been sent yet (this case applies when partner has an existing database and wants to update it). Alternatively, a server MAY choose to send an UPDREQALL message to request a full lease database transmission including all leases (this case applies in case of booting up new server after installation, corruption or complete loss of database, or other catastrophic failure). Servers exchange lease information by using BNDUPD messages. Depending on local and remote state of a lease, a server may either accept or reject the update. Reception of lease update information is confirmed by responding with BNDACK message with appropriate status. The majority of the messages sent over a failover TCP connection consists of BNDUPD and BNDACK messages. A subset of available resources (addresses or prefixes) is reserved for secondary server use. This is required for handling a case where both servers are able to communicate with clients, but unable to communicate with each other. After initial connection is established, the secondary server requests a pool of available addresses by sending a POOLREQ message. The primary server assigns a pool to the secondary by transmitting a POOLRESP message and then sending a series of BNDUPD messages. The secondary server may initiate such pool request at any time when maintaining communication with primary server. Failover servers use a lazy update mechanism to update their failover partner about changes to their lease state database. After a server performs any modifications to its lease state database (assign a new lease, extend an existing one, release or expire a lease), it sends Mrugalski & Kinnear Expires September 6, 2012 [Page 6] Internet-Draft DHCPv6 Failover Design March 2012 its response to the client's request first (performing the "regular" DHCPv6 operation) and then informs its failover partner using a BNDUPD message. This BNDUPD message SHOULD be sent soon after the response is sent to the DHCPv6 client, but there is no specific requirement of a minimum time in which to do so. The major problem with lazy update mechanism is the case when the server crashes after sending response to client, but before sending the lazy update to its partner (or when communication between partners is interrupted). To solve this problem, concept known as the Maximum Client Lead Time (MCLT) (initially designed for DHCPv4 failover) is used. The MCLT is the maximum amount of time that one server can extend a lease for a client's binding beyond the time known by its failover partner. See Section 7.4 for detailed desciption how MCLT affects assigned lease times. Servers verify each others availability by periodically exchanging CONTACT messages. See Section 7.5 for discussion about detecting partner's unreachability. A server that is being shut down transmits a DISCONNECT message, closes the connection with its failover partner and stops operation. A Server SHOULD transmit any pending lease updates before transmitting DISCONNECT message. 4.1. NORMAL State Overview During normal operation when two partners are communicating, both remain in NORMAL state. All incoming requests are processed by the primary server and the secondary server receives appropriate updates. While operating in NORMAL state server a must switch to COMMUNICATIONS-INTERRUPTED if communication with its partner is severed. If its partner closes connection using DISCONNECT message, server moves immediately to either COMMUNICATIONS-INTERRUPTED state or to PARTNER-DOWN state, as configured by the operator. 4.2. COMMUNICATION-INTERRUPTED State Overview When a server discovers that its partner is not reachable, it switches into COMMUNICATIONS-INTERRUPTED state. In that state a server MUST NOT extend any lease time more than the MCLT beyond the lease time known by its failover partner. A server will extend leases that it previously assigned using the regular RENEW mechanism as clients will send their communications to this server (using a multicasted RENEW message with server's DUID or using unicasted RENEW message if configured). A server MUST also extend leases assigned by its partner. This is accomplished by replying to clients' REBIND messages. Again, a server MUST NOT extend a lease by more than Mrugalski & Kinnear Expires September 6, 2012 [Page 7] Internet-Draft DHCPv6 Failover Design March 2012 configured MCLT value beyond the time known by its partner. While in COMMUNICATIONS-INTERRUPTED state, each server MUST assign new leases only from its own pool. If a server is operating in COMMUNICATION- INTERRUPTED state and establishes connection with its partner (either by successfully completing periodic connection attempt or receiving an incoming connection from its partner), the server moves either into RECOVERING state or NORMAL state, depending on the state that its failover partner server is in. When a server moves into NORMAL state, it automatically sends all updated lease information to its failover partner. A server may be also administratively switched to PARTNER-DOWN state from COMMUNICATIONS-INTERRUPTED state. 4.3. PARTNER-DOWN State Overview While in PARTNER-DOWN state, server has a guarantee that its partner is not serving any leases. In such a case, it MUST extend existing leases that it knows about and may assign new leases from its own pool or the pool assigned to its partner. Since it knows that its partner is not extending any leases and does not assign new leases, it may extend leases by times longer than MCLT. It MUST NOT reallocate any existing IP address to a new client until that lease has expired and the server has waited the MCLT beyond the lease's expiration. This mode of operation is similar to the operation of a stand-alone DHCPv6 server and it does not offer any redundancy. In this state server SHOULD periodically attempt to connect to its failover partner. Once connection with its partner is established, the partner will swith to RECOVERING state. After the partner finishes its recovery and moves to RECOVER-DONE state, both servers will move to NORMAL state. 4.4. RECOVERING State Overview This transitional state represents a state, where two servers established connection, but one server needs to be updated with information prior to resuming normal operation. Upon entering the state, both servers transmit UPDREQ or UPDREQALL (depending on state of its local lease database). Once all lease information is exchanges, the recovering server moves to RECOVER-DONE state and then both servers switch to NORMAL state. While in RECOVER state, a server is not allowed to respond to clients. 5. Connection Management 5.1. Creating Connections Every server implementing the failover protocol SHOULD attempt to connect to all of its partners periodically, where the period is Mrugalski & Kinnear Expires September 6, 2012 [Page 8] Internet-Draft DHCPv6 Failover Design March 2012 implementation dependent and SHOULD be configurable. In the event that a connection has been rejected by a CONNECTACK message with a reject-reason option contained in it or a DISCONNECT message, a server SHOULD reduce the frequency with which it attempts to connect to that server but it SHOULD continue to attempt to connect periodically. When a connection attempt succeeds, if the server generating the connection attempt is a primary server for that relationship, then it MUST send a CONNECT message down the connection. If it is not a primary server for the relationship, then it MUST just drop the connection and wait for the primary server to connect to it. When a connection attempt is received, the only information that the receiving server has is the IP address of the partner initiating a connection. It also knows whether it has the primary role for any failover relationships with the connecting server. If it has any relationships for which it is a primary server, it should initiate a connection of its own to the partner server, one for each primary relationship it has with that server. If it has any relationships with the connecting server for which it is a seconary server, it should just await the CONNECT message to determine which relationship this connection is to serve. If it has no secondary relationships with the connecting server, it SHOULD drop the connection. To summarize -- a primary server MUST use a connection that it has initiated in order to send a CONNECT message. Every server that is a secondary server in a relationship attempts to create a connection to the server which is primary in the relationship, but that connection is only used to stimulate the primary server into recognizing that the secondary server is ready for operation. The reason behind this is that the secondary server has no way to communicate to the primary server which relationship a connection is designed to serve. A server which has multiple secondary relationships with a primary server SHOULD only send one stimulus connection attempt to the primary server. Once a connection is established, the primary server MUST send a CONNECT message across the connection. A secondary server MUST wait for the CONNECT message from a primary server. If the secondary server doesn't receive a CONNECT message from the primary server in an installation dependent amount of time, it MAY drop the connection and send another stimulus connection attempt to the primary server. Mrugalski & Kinnear Expires September 6, 2012 [Page 9] Internet-Draft DHCPv6 Failover Design March 2012 Every CONNECT message includes a TLS-request option, and if the CONNECTACK message does not reject the CONNECT message and the TLS- reply option says TLS MUST be used, then the servers will immediately enter into TLS negotiation. Once TLS negotiation is complete, the primary server MUST resend the CONNECT message on the newly secured TLS connection and then wait for the CONNECTACK message in response. The TLS-request and TLS-reply options MUST NOT appear in either this second CONNECT or its associated CONNECTACK message as they had in the first messages. The second message sent over a new connection (either a bare TCP connection or a connection utilizing TLS) is a STATE message. Upon the receipt of this message, the receiver can consider communications up. A secondary server MUST NOT respond to the closing of a TCP connection with a blind attempt to reconnect -- there may be another TCP connection to the same failover partner already in use. 5.2. Endpoint Identification The proper operation of the failover protocol requires more than the transmission of messages between one server and the other. Each endpoint might seem to be a single DHCPv6 server, but in fact there are situations where additional flexibility in configuration is useful. A failover endpoint is always associated with a set of DHCPv6 prefixes that are configured on the DHCPv6 server where the endpoint appears. A DHCPv6 prefix MUST NOT be associated with more than one failover endpoint. The failover protocol SHOULD be configured with one failover relationship between each pair of failover servers. In this case there is one failover endpoint for that relationship on each failover partner. This failover relationship MUST have a unique name. There is typically little need for addtional relationships between any two servers but there MAY be more than one failover relationship between two servers -- however each MUST have a unique relationship name. Any failover endpoint can take actions and hold unique states. This document frequently describes the behavior of the protocol in terms of primary and secondary servers, not primary and secondary failover endpoints. However, it is important to remember that every 'server' described in this document is in reality a failover endpoint that resides in a particular process, and that several failover end- Mrugalski & Kinnear Expires September 6, 2012 [Page 10] Internet-Draft DHCPv6 Failover Design March 2012 points may reside in the same server process. It is not the case that there is a unique failover endpoint for each prefix that participates in a failover relationship. On one server, there is (typically) one failover endpoint per partner, regardless of how many prefixes are managed by that combination of partner and role. Conversely, on a particular server, any given prefix will be associated with exactly one failover endpoint. When a connection is received from the partner, the unique failover endpoint to which the message is directed is determined solely by the IP address of the partner, the relationship-name, and the role of the receiving server. 6. Resource Allocation Currently there are two allocation algorithms defined for resources (addresses or prefixes). Additional allocation schemes may be defined as future extensions. 1. Proportional Allocation - This allocation algorithm is a direct application of algorithm defined in [dhcpv4-failover] to DHCPv6. Available resources are split between primary and secondary server. Released resources are always returned to primary server. Primary and secondary servers may initiate a rebalancing procedure, when disparity between resources available to each server reaches a preconfigured threshold. Only resources that are not leased to any clients are "owned" by one of the servers. This algorithm is particularly well suited for scenarios where amount of available resources is limited, as may be the case for prefix delegation. See Section 6.1 for details. 2. Independent Allocation - This allocation algorithm assumes that available resources are split between primary and secondary servers as well. In this case, however, resources are assigned to a specific server for all time, regardless if they are available or currently used. This algorithm is much simpler than proportional allocation, because resource imbalance doesn't have to be checked and there is no rebalancing for independent allocation. This algorithm is particularly well suited for scenarios where the there is an abundance of available resources which is typically the case for DHCPv6 address allocation. See Section 6.2 for details. Mrugalski & Kinnear Expires September 6, 2012 [Page 11] Internet-Draft DHCPv6 Failover Design March 2012 6.1. Proportional Allocation In this allocation scheme, each server has its own pool of available resources. Note that a resource is not "owned" by a particular server throughout its entire lifetime. Only a resource which is available is "owned" by a particular server -- once it has been leased to a client, it is not owned by either failover partner. When it finally becomes available again, it will be owned initially by the primary server, and it may or may not be allocated to the secondary server by the primary server. So, the flow of a resource is as follows: initially a resource is owned by the primary server. It may be allocated to the secondary server if it is available, and then it is owned by the secondary server. Either server can allocate available resources which they own to clients, in which case they cease to own them. When the client releases the resource or the lease on it expires, it will again become available and will be owned by the primary. A resource will not become owned by the server which allocated it initially when it is released or the lease expires because, in general, that server will have had to replenish its pool of available resources well in advance of any likely lease expirations. Thus, having a particular resource cycle back to the secondary might well put the secondary more out of balance with respect to the primary instead of enhancing the balance of available addresses or prefixes between them. TODO: Need to rework this v4-specific vocabulary to v6, once we decide how things will look like in v6. When they are used, these proportional pools are used for allocation when in every state but PARTNER-DOWN state. In PARTNER-DOWN state a failover server can allocate from either pool. This allocation and maintenance of these address pools is an area of some sensitivity, since the goal is to maintain a more or less constant ratio of available addresses between the two servers. TODO: Reuse rest of the description from section 5.4 from [dhcpv4-failover] here. 6.2. Independent Allocation In this allocation scheme, available resources are split between servers. Available resources are split between the primary and secondary servers as part of initial connection establishment. Once resources are allocated to each server, there is no need to reassign them. This algorithm is simpler than proportional allocation since Mrugalski & Kinnear Expires September 6, 2012 [Page 12] Internet-Draft DHCPv6 Failover Design March 2012 it requires no less initial communicagtion and does not require a rebalancing mechanism, but it assumes that the pool assigned to each server will never deplete. That is often a reasonable assumption for IPv6 addresses (e.g. servers are often assigned a /64 pool that contains many more addresses than existing electronic devices on Earth). This allocation mechanism SHOULD be used for IPv6 addresses, unless configured address pool is small or is otherwise administratively limited. Once each server is assigned a resource pool during initial connection establishment, it may allocate assigned resources to clients. Once a client release a resource or its lease is expired, the returned resource returns to pool for the same server. Resources never changes servers. During COMMUNICATION-INTERRUPTED events, a partner MAY continue extending existing leases when requested by clients. A healthy partner MUST NOT lease resources that were assigned to its downed partner and later released by a client unless it is in PARTNER-DOWN state. 6.3. Determining Allocation Approach 6.3.1. IPv6 Addresses 6.3.2. IPv6 Prefixes 7. Failover Mechanisms This section lays out an overview of the communication between partners and other mechanisms required for failover operation. As this is a design document, not a protocol specification, high level ideas are presented without implementation specific details (e.g. lack of on-wire formats). Implementation details will be specified in a separate draft. 7.1. Time Skew Partners exchange information about known lease states. To reliably compare a known lease state with an update received from a partner, servers must be able to reliably compare the times stored in the known lease state with the times received in the update. Although a simple approach would be to require both partners to use synchronized time, e.g. by using NTP, such a service may become unavailable in some scenarios that failover expects to cover, e.g. network partition. Therefore a mechanism to measure and track relative time differences between servers is necessary. To do so, each message Mrugalski & Kinnear Expires September 6, 2012 [Page 13] Internet-Draft DHCPv6 Failover Design March 2012 MUST contain FO_TIMESTAMP option that contains the timestamp of the transmission in the time context of the transmitter. The transmitting server MUST set this as close to the actual transmission as possible. The receiving partner MUST store its own timestamp of reception event as close to the actual reception as possible. The received timestamp information is then compared with local timestamp. To account for packet delay variation (jitter), the measured difference is not used directly, but rather the moving average of last TIME_SKEW_PKTS_AVG packets time difference is calculated. This averaged value is referred to as the time skew. Note that the time skew algorithm allows cooperation between clients with completely desynchronized clocks as well as those whose desynchronization itself is not constant. 7.2. Time expression Timestamps are expressed as number of seconds since midnight (UTC), January 1, 2000, modulo 2^32. Note: that is the same approach as used in creation of DUID-LLT (see Section 9.2 of [RFC3315]). Time differences are expressed in seconds and are signed. 7.3. Lazy updates Lazy update refers to the requirement placed on a server implementing a failover protocol to update its failover partner whenever the binding database changes. A failover protocol which didn't support lazy update would require the failover partner update to complete before a DHCPv6 server could respond to a DHCPv6 client request. The lazy update mechanism allows a server to allocate a new or extend an existing lease and then update its failover partner as time permits. Although the lazy update mechanism does not introduce additional delays in server response times, it introduces other difficulties. The key problem with lazy update is that when a server fails after updating a client with a particular lease time and before updating its partner, the partner will believe that a lease has expired even though the client still retains a valid lease on that address or prefix. 7.4. MCLT concept In order to handle problem introduced by lazy updates (see Section 7.3), a period of time known as the "Maximum Client Lead Time" (MCLT) is defined and must be known to both the primary and secondary servers. Proper use of this time interval places an upper bound on the difference allowed between the lease time provided to a Mrugalski & Kinnear Expires September 6, 2012 [Page 14] Internet-Draft DHCPv6 Failover Design March 2012 DHCPv6 client by a server and the lease time known by that server's failover partner. The MCLT is typically much less than the lease time that a server has been configured to offer a client, and so some strategy must exist to allow a server to offer the configured lease time to a client. During a lazy update the updating server typically updates its partner with a potential expiration time which is longer than the lease time previously given to the client and which is longer than the lease time that the server has been configured to give a client. This allows that server to give a longer lease time to the client the next time the client renews its lease, since the time that it will give to the client will not exceed the MCLT beyond the potential expiration time acknowledged by its partner. The fundamental relationship on which much of The correctness of this protocol depends is that the lease expiration time known to a DHCPv6 client MUST NOT under any circumstances be more than the maximum client lead time (MCLT) greater than the potential expiration time known to a server's partner. The remainder of this section makes the above fundamental relationship more explicit. This protocol requires a DHCPv6 server to deal with several different lease intervals and places specific restrictions on their relationships. The purpose of these restrictions is to allow the other server in the pair to be able to make certain assumptions in the absence of an ability to communicate between servers. The different times are: desired valid lifetime: The desired valid lifetime is the lease interval that a DHCPv6 server would like to give to a DHCPv6 client in the absence of any restrictions imposed by the failover protocol. Its determination is outside of the scope of this protocol. Typically this is the result of external configuration of a DHCPv6 server. actual valid lifetime: The actual valid lifetime is the lease interval that a DHCPv6 server gives out to a DHCPv6 client. It may be shorter than the desired valid lifetime (as explained below). potential valid lifetime: The potential valid lifetime is the potential lease expiration interval the local server tells to its partner in a BNDUPD message. Mrugalski & Kinnear Expires September 6, 2012 [Page 15] Internet-Draft DHCPv6 Failover Design March 2012 acknowledged potential valid lifetime: The acknowledged potential valid lifetime is the potential lease interval the partner server has most recently acknowledged in a BNDACK message. 7.4.1. MCLT example The following example demonstrates the MCLT concept in practice. The values used are arbitrarily chosen are and not a recommendation for actual values. The MCLT in this case is 1 hour. The desired valid lifetime is 3 days, and its renewal time is half the valid lifetime. When a server makes an offer for a new lease on an IP address to a DHCPv6 client, it determines the desired valid lifetime (in this case, 3 days). It then examines the acknowledged potential valid lifetime (which in this case is zero) and determines the remainder of the time left to run, which is also zero. To this it adds the MCLT. Since the actual valid lifetime cannot be allowed to exceed the remainder of the current acknowledged potential valid lifetime plus the MCLT, the offer made to the client is for the remainder of the current acknowledged potential valid lifetime (i.e., zero) plus the MCLT. Thus, the actual valid lifetime is 1 hour. Once the server has sent the REPLY to the DHCPv6 client, it will update its failover partner with the lease information. However, the desired potential valid lifetime will be composed of one half of the current actual valid lifetime added to the desired valid lifetime. Thus, the failover partner is updated with a BNDUPD with a potential valid lifetime of 3 days + 1/2 hour. When the primary server receives a BNDACK to its update of the secondary server's (partner's) potential valid lifetime, it records that as the acknowledged potential valid lifetime. A server MUST NOT send a BNDACK in response to a BNDUPD message until it is sure that the information in the BNDUPD message has been updated in its lease database. Thus, the primary server in this case can be sure that the secondary server has recorded the potential lease interval in its stable storage when the primary server receives a BNDACK message from the secondary server. When the DHCPv6 client attempts to renew at T1 (approximately one half an hour from the start of the lease), the primary server again determines the desired valid lifetime, which is still 3 days. It then compares this with the remaining acknowledged potential valid lifetime (3 days + 1/2 hour) and adjusts for the time passed since the secondary was last updated (1/2 hour). Thus the time remaining of the acknowledged potential valid interval is 3 days. Adding the MCLT to this yields 3 days plus 1 hour, which is more than the Mrugalski & Kinnear Expires September 6, 2012 [Page 16] Internet-Draft DHCPv6 Failover Design March 2012 desired valid lifetime of 3 days. So the client is renewed for the desired valid lifetime -- 3 days. When the primary DHCPv6 server updates the secondary DHCPv6 server after the DHCPv6 client's renewal REPLY is complete, it will calculate the desired potential valid lifetime as the T1 fraction of the actual client valid lifetime (1/2 of 3 days this time = 1.5 days). To this it will add the desired client valid lifetime of 3 days, yielding a total desired potential valid lifetime of 4.5 days. In this way, the primary attempts to have the secondary always "lead" the client in its understanding of the client's valid lifetime so as to be able to always offer the client the desired client valid lifetime. Once the initial actual client valid lifetime of the MCLT is past, the protocol operates effectively like the DHCPv6 protocol does today in its behavior concerning valid lifetimes. However, the guarantee that the actual client valid lifetime will never exceed the remaining acknowledged partner server potential valid lifetime by more than the MCLT allows full recovery from a variety of failures. 7.5. Unreachability detection Each partner maintains an FO_SEND timer for each partner connection. The FO_SEND timer is reset every time any message is transmitted. If the timer reaches the FO_SEND_MAX value, a CONTACT message is transmitted and timer is reset. The CONTACT message may be transmitted at any time. Discussion: Perhaps it would be more reasonable to use echo-reply approach, rather than periodic transmissions? 7.6. Sending Data Each server updates its failover partner about recent changes in lease states. Each update must include following information: 1. resource type - non-temporary address or a prefix 2. resource information - actual address or prefix 3. valid life time requested by client 4. IAID - Identity Association used by client, while obtaining this lease. (Note1: one client may use many IAID simulatenously. Note2: IAID for IA, TA and PD are orthogonal number spaces.) Mrugalski & Kinnear Expires September 6, 2012 [Page 17] Internet-Draft DHCPv6 Failover Design March 2012 5. valid life time sent to client 6. potential valid life time 7. preferred life time sent to client 8. CLTT - Client Last Transaction Time, a timestamp of the last received transmission from a client 9. assigned FQDN names, if any (optional) Discussion: Do we need T1 as well? Something like next expected client transmission? Q: Maybe we could reuse IA_NA and IA_PD options here? Yes. Q: Do we care about preferred lifetime? (presumably no). Certainly not what was requested by the client. Q: Do we care about IAID? (presumably yes) Yes. 7.6.1. Required Data 7.6.2. Optional Data 7.7. Receiving Data 7.7.1. Conflict Resolution TODO: This is just a loose collection of notes. This section will probably need to be rewritten as a a flowchart of some kind. The server receiving a lease update from its partner must evaluate the received lease information to see if it is consistent with already known state and decide which information - previously known or just received - is "better". The server should take into consideration the following aspects: if the lease is already assigned to specific client, who had contact with client recently, start time of the lease, etc. The lease update may be accepted or rejected. Rejection SHOULD NOT change the flag in a lease that says that it should be transmitted to the failover partner. If this flag is set, then it should be transmitted, but if it is not already set, the rejection of a lease state update SHOULD NOT trigger an automatic update of the failover partner sending the rejected update. The potential for update storms is too great, and in the unusual case where the servers simply can't agree, that disagreement is better than an update storm. Mrugalski & Kinnear Expires September 6, 2012 [Page 18] Internet-Draft DHCPv6 Failover Design March 2012 Discussion: There will definitely be different types of update rejections. For example, this will allow a server to treat differently a case when receiving a new lease that it previously haven't seen than a case when partner sents old version of a lease for which a newer state is known. 7.7.2. Acknowledging Reception 8. Endpoint States The following sections contain detailed description of all possible states of failover endpoint. 8.1. State Machine Initialization TODO 8.2. NORMAL State Operation TODO 8.3. COMMUNICATION-INTERRUPTED State Operation TODO 8.4. PARTNER-DOWN State Operation TODO 8.5. RECOVERING State Operation TODO 8.6. State Transitions TODO 9. Proposed extensions The following section discusses possible extensions to the proposed failover mechanism. Listed extensions must be sufficiently simple to not further complicate failover protocol. Any proposals that are considered complex will be defined as stand-alone extensions in separate documents. Mrugalski & Kinnear Expires September 6, 2012 [Page 19] Internet-Draft DHCPv6 Failover Design March 2012 9.1. Active-active mode A very simple way to achieve active-active mode is to remove the restriction that seconary server MUST NOT respond to SOLICIT and REQUEST messages. Instead it could respond, but MUST have lower preference than primary server. Clients discovering available servers will receive ADVERTISE messages from both servers, but are expected to select the primary server as it has higher preference value configured. The following REQUEST message will be directed to primary server. Discussion: Do DHCPv6 clients actually do this? DHCPv4 clients were rumored to wait for a "while" to accept the best offer, but to a first approximation, they all take the first offer they receive that is even acceptable. The benefit of this approach, compared to the "basic" active--passive solution is that there is no delay between primary failure and the moment when secondary starts serving requests. Discussion: The possibility of setting both servers preference to an equal value could theoretically work as a crude attempt to provide load balancing. It wouldn't do much good on its own, as one (faster) server could be chosen more frequently (assuming that with equal preference sets clients will pick first responding server, which is not mandated by DHCPv6). We could design a simple mechanism of dynamically updating preference depending on usage of available resources. This concept hasn't been investigated in detail yet. 10. Dynamic DNS Considerations TODO: Descibe DNS Updates challenges in failover environment. It is nicely described in Section 5.12 of [dhcpv4-failover]. 11. Reservations and failover TODO: Describe how lease reservation works with failover. See Section 5.13 in [dhcpv4-failover]. 12. Protocol entities Discussion: It is unclear if following sections belong to design or protocol draft. It is currently kept here as a scratchbook with list of things that will have to be defined eventually. Whether or not it will stay in this document or will be moved to the protocol spec Mrugalski & Kinnear Expires September 6, 2012 [Page 20] Internet-Draft DHCPv6 Failover Design March 2012 document is TBD. 12.1. Failover Protocol This section enumerates list of options that will be defined in failover protocol specification. Rough description of purpose and content for each option is specified. Exact on wire format will be defined in protocol specification. 1. OPTION_FO_TIMESTAMP - convey information about timestamp. It is used by time skew measurement algorithm (see Section 7.1). 12.2. Protocol constants This section enumerates various constants that have to be defined in actual protocol specification. 1. TIME_SKEW_PKTS_AVG - number of packets that are used to calculate average time skew between partners. See (see Section 7.1). 13. Open questions This is scratchbook. This section will be removed once questions are answered. Q: Do we want to support temporary addresses? I think not. They are short-lived by definition, so clients should not mind getting new temporary addresses. Q: Do we want to support CGA-registered addresses? There is currently work in DHC WG about this, but I haven't looked at it yet. If that is complicated, we may not define it here, but rather as an extension. [If it moves forward, we need to support it.] 14. Security Considerations TODO: Security considerations section will contain loose notes and will be transformed into consistent text once the core design solidifies. 15. IANA Considerations IANA is not requested to perform any actions at this time. Mrugalski & Kinnear Expires September 6, 2012 [Page 21] Internet-Draft DHCPv6 Failover Design March 2012 16. Acknowledgements This document extensively uses concepts, definitions and other parts of [dhcpv4-failover] document. Authors would like to thank Shawn Routher, Greg Rabil, and Bernie Volz for their significant involvement and contributions. This work has been partially supported by Department of Computer Communications (a division of Gdansk University of Technology) and the Polish Ministry of Science and Higher Education under the European Regional Development Fund, Grant No. POIG.01.01.02-00-045/ 09-00 (Future Internet Engineering Project). 17. References 17.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2131] Droms, R., "Dynamic Host Configuration Protocol", RFC 2131, March 1997. [RFC3074] Volz, B., Gonczi, S., Lemon, T., and R. Stevens, "DHC Load Balancing Algorithm", RFC 3074, February 2001. [RFC3315] Droms, R., Bound, J., Volz, B., Lemon, T., Perkins, C., and M. Carney, "Dynamic Host Configuration Protocol for IPv6 (DHCPv6)", RFC 3315, July 2003. [RFC3633] Troan, O. and R. Droms, "IPv6 Prefix Options for Dynamic Host Configuration Protocol (DHCP) version 6", RFC 3633, December 2003. [RFC4704] Volz, B., "The Dynamic Host Configuration Protocol for IPv6 (DHCPv6) Client Fully Qualified Domain Name (FQDN) Option", RFC 4704, October 2006. [RFC5460] Stapp, M., "DHCPv6 Bulk Leasequery", RFC 5460, February 2009. 17.2. Informative References [I-D.ietf-dhc-dhcpv6-redundancy-consider] Tremblay, J., Brzozowski, J., Chen, J., and T. Mrugalski, "DHCPv6 Redundancy Deployment Considerations", draft-ietf-dhc-dhcpv6-redundancy-consider-02 (work in Mrugalski & Kinnear Expires September 6, 2012 [Page 22] Internet-Draft DHCPv6 Failover Design March 2012 progress), October 2011. [RFC2136] Vixie, P., Thomson, S., Rekhter, Y., and J. Bound, "Dynamic Updates in the Domain Name System (DNS UPDATE)", RFC 2136, April 1997. [dhcpv4-failover] Droms, R., Kinnear, K., Stapp, M., Volz, B., Gonczi, S., Rabil, G., Dooley, M., and A. Kapur, "DHCP Failover Protocol", draft-ietf-dhc-failover-12 (work in progress), March 2003. [requirements] Mrugalski, T. and K. Kinnear, "DHCPv6 Failover Requirements", draft-ietf-dhc-dhcpv6-failover-requirements-00 (work in progress), October 2011. Authors' Addresses Tomasz Mrugalski Internet Systems Consortium, Inc. 950 Charter Street Redwood City, CA 94063 USA Phone: +1 650 423 1345 Email: tomasz.mrugalski@gmail.com Kim Kinnear Cisco Systems, Inc. 1414 Massachusetts Ave. Boxborough, Massachusetts 01719 USA Phone: +1 (978) 936-0000 Email: kkinnear@cisco.com Mrugalski & Kinnear Expires September 6, 2012 [Page 23]