Happy Earballs: Success with Dual-Stack SIP
draft-worley-sipcore-happy-earballs-00

Abstract

TBD: The Session Initiation Protocol (SIP) supports multiple transports running both over IPv4 and IPv6 protocols. In more and more cases, a SIP user agent (UA) is connected to network interfaces with multiple address families. In these cases sending a message from a dual stack client to a dual stack server may suffer from the issues described in [RFC6555] ("Happy Eyeballs"): the UA attempts to send the message using IPv6, but IPv6 connectivity is not working to the server. This can cause significant delays in the process of sending the message to the server. This negatively affects the user's experience.

TBD: This document builds on [RFC6555] by modifying the procedures specified in [RFC3263] and related specifications to require that a client ensure that communication targets are accessible before sending messages to them, to allow a client to contact targets out of the order required by other specifications, and to require a client to properly distribute the message load among targets over time.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on October 6, 2017.

Copyright Notice

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

1. Introduction
2. Terminology
3. Structure of This Document

3.1. Scope of Applicability

4. Baseline Procedures

4.1. Target Ordering
4.2. Priority Nodes
4.3. Unordered Nodes
4.4. Combinations of Prioritization and Unordered Nodes
4.5. Load-Balancing Nodes

5. Primary Procedure Modifications

5.1. Permitted to Reorder Targets
5.2. Must Preserve Traffic Distribution

6. Additional Requirements

6.1. Address Family Preference
6.2. Address Selection
6.3. Vias
6.4. DNS Caching
6.5. Unused Flows
6.6. Debugging and Troubleshooting

7. Consequences of the New Requirements
8. Generic Algorithm

8.1. Policies and Implementor Latitude
8.2. Examples

8.2.1. Two Unordered Targets, Both Cached
8.2.2. Two Unordered Targets, Both Cached, One Slow
8.2.3. Two Unordered Targets, One Cached
8.2.4. Two Unordered Targets, Neither Cached
8.2.5. Three Unordered Targets, One Cached
8.2.6. Two Prioritized Targets, Both Cached
8.2.7. Two Prioritized Targets, the Second Cached
8.2.8. Two Prioritized Targets, the First Cached
8.2.9. Two Prioritized Targets, Neither Cached
8.2.10. Three Targets

9. Using a Simplified Data Structure
10. Security Considerations
11. IANA Considerations
12. History

12.1. Changes from draft-worley-sip-happy-earballs-01 to draft-worley-sipcore-happy-earballs-00
12.2. Changes from draft-worley-sip-happy-earballs-00 to draft-worley-sip-happy-earballs-01
12.3. Changes from draft-worley-sip-he-connection-01 to draft-worley-sip-happy-earballs-00
12.4. Changes from draft-worley-sip-he-connection-00 to draft-worley-sip-he-connection-01
12.5. Changes from draft-johansson-sip-he-connection-01 to draft-worley-sip-he-connection-00

13. References

13.1. Normative References
13.2. Informative References

Appendix A. Implementing Load Balancing
Acknowledgments
Author's Address

1. Introduction

                            Earballs -- n., another word for ears.
                             Made famous by the animated American TV
                             spy comedy, "Archer".

                             "Ow, my earballs!" -- Cheryl Tunt, "Archer"

                             -- from "Urban Dictionary"

The Session Initiation Protocol (SIP) [RFC3261] and the documents that extended it provide support for both IPv4 and IPv6. However, this support has problems with environments that are characteristic of the transitional migratory phase from IPv4 to IPv6 networks: During this phase, many server and client implementations run on dual-stack hosts. In such environments, a dual-stack host will likely suffer greater connection delay, and by extension an inferior user experience, than an IPv4-only host.

The difficulty stems from the reality that a device cannot predict whether apparent IPv6 connectivity to another device is usable; both devices may have IPv6 addresses and yet some transit network between the two may not transport IPv6. SIP requires a device that transmits a request to one destination address (e.g., the apparently useful IPv6 address) to wait for a response for a substantial period (usually 32 seconds) before transmitting the request to another destination address (the less-preferred IPv4 address). The result is that apparent IPv6 connectivity that is not functional can cause substantial delays in processing SIP requests. Especially when the requests are call setups (INVITE requests) this creates a very poor user experience.

The overall procedure for transmitting a SIP message is as follows: The specification of where a message is to be sent is called a "goal"; in the common case, the goal is the host-port part of a SIP URI. The SIP client uses specified algorithms (particularly [RFC3263]) to make DNS lookups to determine a prioritized set of "targets" (basically, destination IP address family, IP address, transport protocol, and port). Under these specifications, the client contacts the targets in order until one is contacted successfully. In order to contact a target, the client establishes a transport connection (if necessary), sends the message using the transport (possibly resending the message several times), and then (for requests) waits for a response (either provisional or final). The process ends successfully if the client receives a response. The process ends unsuccessfully if the client receives a permanent error from the transport layer or if a SIP timer (Timer B or Timer F in [RFC3261]) expires. Timeouts generally default to 32 seconds. If the user has to wait for even one timeout, this will seriously degrade the user experience. Thus, it is desirable to minimize the number of times the client has timeouts when sending requests.

If the target list contains both IPv6 addresses and IPv4 addresses, this procedure can degrade the user's experience in common situations. Typically, this problem arises when the client has an IPv6 interface, the server's preferred address is an IPv6 address, but the transit networks between the client and server do not carry IPv6. This can cause the client to attempt to send a SIP request to the IPv6 target and then wait for a timeout before it continues with an IPv4 target. This problem parallels a problem that was widely seen in web browsers. The web browser problem was cured by specifying that web browsers should use a "Happy Eyeballs" algorithm [RFC6555] to determine the order in which to contact target addresses.

This document builds on the concepts of [RFC6555] by amending these procedures, so that targets derived from a goal may be contacted in a different order than is specified by the previous specifications. By analogy with the name "Happy Eyeballs" for similar algorithms in web browsers, we label these algorithms "Happy Earballs" [UD].

Conceptually, the Happy Earballs algorithm is composed of a number of interlocking components:

For a given message, the set of targets is organized into a directed acyclic graph (or partial order) specifying which targets must be contacted before other targets.
Targets which are unresponsive, or which respond too slowly relative to other targets in the set, are moved to the end of the order.
Despite this change, the relative proportions of traffic sent to responsive targets remains as specified by the baseline specifications.
The client caches the known response times of targets.
The client does not send the message to a target before knowing the target's response time (unless this is the last target in the set). If needed, the client sends a "probe" transmission (which has no semantic effect) to determine the target's response time.

These components are assembled into an event-driven algorithm for sending the message to targets and, when necessary, sending probes to targets. Implementors have flexibility in choosing among targets that are not prioritized, and choosing when to send probes to targets. Implementors also have flexibility defining when one target is considered "too slow" compared to another target. (TBD: Is that true?) Nonetheless, all variants of this algorithm have the properties of distributing traffic among responsive targets as specified by the baseline specifications and not sending messages to non-responsive targets (unless all responsive targets have failed).

This document does not address the similar problem for media (RTP and RTCP) packets. Problems with media connectivity can be addressed using ICE [RFC5245], ALTC [RFC6947], or probing using RTP packets.

2. Terminology

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

baseline:: the prior specifications of the behavior of a client for sending a message to a goal. The baseline specifications are modified by this document.
cache:: (verb) to store temporarily information regarding the response time of a target so as to accelerate future message transmission; (noun) the collection of information so stored
client:: the device that must send a message
CRTT(T):: for a target T, the cached round-trip response time of T (if any is cached). There is a special value of "infinity" if T did not respond at all. Collectively, these values are called "CRTT() values". Contrasted with RTT(T), which is the actual round-trip response time (at some given moment).
dual-stack:: narrowly, a device with both IPv4 and IPv6 interfaces; broadly, a device with more than one interface, including those with interfaces that use two or more address families
flow:: a group of transmissions to a target which are considered related by a transport protocol. For connection-oriented protocols, is the data carried by a connection. For connectionless protocols, is all messages sent to a particular target (5-tuple). For protocols with security associations, is all messages sent within a particular security association.
goal:: the identification of a particular server. E.g., a URI, a TSAP, a via-para, or information provided by the context of the message.
graph:: a directed, acyclic graph whose nodes are a set of targets and whose arcs are requirements that one target must be sent to before another target is sent to. (Mathematically, the graph defines a partial order on the targets.) Graph arcs are drawn without arrowheads, and are to be read as the leftmost endpoint must be sent to before the rightmost endpoint. A graph is always relative to the sending of a particular message and may be modified during the processing of the message. A graph may be augmented with dummy nodes that do not represent targets in order to reduce the number of arcs that must be drawn to express the requirements. (Abstractly, dummy nodes may be considered targets to which sends immediately fail.)
initial:: a target which has no target prioritized before it (considered relative to all targets in the target set, or some subset or rearrangement of the target set, depending on the context)
Limit(t):: a function converting one time value into another. If RTT(T1) > Limit(RTT(T0)), then target T1 responds "too slowly" relative to the response time of target T0, and T1 is considered non-responsive. Depends on two parameters, "m" and "f". Limit(infinity) is considered to be infinity.
normal:: a target which is not slow (relative to a particular goal)
NSAP:: "Network Service Access Point", the identification of a network interface, which comprises an address family and an address
priority:: the situation where the client must send the message to a first target before it may send the message to a second target. The prioritization of a set of targets necessarily forms a partial order and can be represented by a directed acyclic graph.
probe:: a transport operation that attempts to determine if a target is responsive, without transmitting a message. Since a probe does not send a message, if the transmission fails, it does not commit the client to waiting a timeout period before sending the message to another target.
quick:: a target whose cached RTT() is less than Limit(0), and thus is never slow for any goal
RTT(T):: for a target T, the actual round-trip response time of T. There is a special value of "infinity" if T does not respond at all. Collectively, these values are called "RTT() values". Contrasted with CRTT(T), which is the RTT(T) measurement recorded in the cache.
responsive:: a property of a target T0 relative to the set of targets for a goal: the response time of the target T0 is sufficiently short when compared with the response time of other targets; RTT(T0) <= RTT(T1) for any T1 in the set of targets. See TBD Section 5 for discussion.
send:: (verb and noun) to attempt transmission of a (possibly modified copy of) a message to a target. Contrasted with "successfully send", which is when the message is received by the server or when the client detects that the message is received by the server. Does not include "probe" transport operations.
server:: the (conceptual) device to which a message is to be sent. May consist of multiple physical devices listening on multiple network interfaces.
slow:: a target T0 which appears to be "too slow" based on the cached RTT() information, i.e., there is another target T1 of the goal for which RTT(T0) > Limit(RTT(T1)) using the cached RTT() values. This includes the case where RTT(T0) is infinity, i.e., T0 does not respond at all. Opposite of "normal".
target:: the complete specification of a transport to be used to send a message from the client to the server. A target is commonly conceptualized as "protocol/address/port" (which is a TSAP), but the target also includes the TSAP that will be used as the source of the communication, and so "5-tuple" is more accurate. In many cases, the source TSAP is determined by the destination TSAP, so it is not mentioned. Perforce, the transport protocol and address family of the source and destination TSAPs are the same.
timeout:: (noun) the period of time after sending a message which a client must wait before it is permitted to send the message to another target without receiving positive indication of the failure of the first transmission. For sending requests, either Timer B or Timer F [RFC3261]. Refers either to the abstract time interval or a particular instance after a particular send operation. (verb) the event when the timeout period has expired.
TSAP:: "Transport Service Access Point", the identification of an endpoint of a transport flow, which usually comprises a transport protocol, an NSAP (or network address), and a port number
traffic:: the messages for a particular goal that are successfully sent to a particular target or set of targets; the number of such messages sent over a period of time; or the fraction of such messages relative to all messages for the goal

3. Structure of This Document

This document assumes that the context of the message provides a "goal", which is the specification of the device or collection of devices which are the server, and that there are existing "baseline" specifications which translate the goal into a set of "targets", and in what order(s) the client may send (possibly modified copies of) the message to the targets, until one of the send operations is successful.

Section 4 introduces a model of how the baseline specifications operate: in every instance, they produce a partial order or directed acyclic graph of targets specifying which targets must be sent to before sending to which other targets. This section also describes how "load-balanced" targets (SRV records with equal priority) are be handled within this model.

Section 5describes the major modifications of the procedures for how a client sends a message to a server.

This document relaxes the requirements on the client regarding the order(s) in which the message may be sent to the targets, that is, it permits additional orders, so that the client is less likely to have to wait for a timeout. On the whole, when network connectivity is imperfect, this allows clients to transmit the messages to servers more quickly than they would using the unmodified baseline specifications.

However, this document also places additional restrictions on the client's sending behavior to ensure that the overall traffic distribution among the targets converges over time to the distribution that would have resulted from obeying the baseline specifications.

Section 6 requires certain additional behaviors that ensure that the use of IPv6 is not disadvantaged in mixed IPv4/IPv6 networks as well as a number of miscellaneous requirements to optimize the behavior of clients.

Section 7 discusses some consequences of the new requirements, including what new orders of targets are permitted, what behaviors minimize the time needed to successfully send a message, techniques for probing a target (that is, determining if it is responsive without sending the message, and thus possibly committing to waiting for a timeout period), and suitable approaches for caching information about targets.

These consequences, when assembled with the overarching goal of minimizing the time to successfully send the message, produce a generic algorithm, which is succinctly described in event-driven terms in Section 8.

Section 8.2 applies the generic algorithm to a number of simple cases, including how it replicates "Happy Eyeballs" behaviors in analogous situations.

Section 9 constructs a simplified algorithm (which is a subset of the generic algorithm) that replaces the directed acyclic graph with a list of lists.

3.1. Scope of Applicability

This document modifies any SIP target selection processes that are defined now or may be defined in the future, excepting those that explicitly exempt themselves.

This document does not affect communications specified to be carried only by a single WebSocket transport, as in those contexts there is only one transport target (the WebSocket connection), and hence the target selection process is trivial.

Similarly, this document does not affect sending a message when it is specified to be carried by a particular existing connection, as when a client sends a response to a request that it received via a connection-oriented protocol, and that connection still exists.

A client MUST NOT consider the set of the target URIs of a "forking" operation to be a single goal to which the processes of this document apply. Instead, the modifications MUST be applied to each of those URIs as separate goals. This is because the decision of whether to send a request to a later forking target may be affected by the SIP response to an earlier transmission ([RFC3261] section 16). A forking proxy may, as part of its policy, apply some or all of these procedures to a forking operation. This may be useful when sending a request to a UAS that has registered multiple addresses for the same AoR (as may be discerned by examining the +sip.instance media feature tags in the registrations [RFC5626]). However, any such processing is outside the scope of this document.

4. Baseline Procedures

The situation that this document addresses is when a SIP device is required to send a message (which may be either a request or a response). This document uses the term "client" for the device which must send the message. The client is given a "goal", which is the specification of the "server", which is the (possibly composite) device to which the message is to be sent. (Both of these usages are broader than the usage in [RFC3261].) The purpose of client is to successfully send the message to the server.

(Note that in the case of a request, when the message is sent to a target, a Via header field will be added to the message, and that the added Via header field will be different for each target. If the client is a UAC, it may effectively use the SIP signaling path as a connectivity check for the media path by varying the SDP between requests sent to different targets. This document considers all of these versions of the message to be copies of the original message to be sent.)

If the message is a request, the goal is usually the hostport part of the URI in either the first Route header field or the request-line. If the message is a response, the goal can be specified by the first via-param in the first Via header field. If the message is to be sent to an outbound proxy as specified by a DHCP option ([RFC3319] or [RFC3361]), then the goal is the ordered list of addresses or domain names provided by the DHCP option. In other situations, the goal may be specified by other means.

Baseline specifications (e.g., [RFC3263], [RFC3319], [RFC3361], [RFC6724], [RFC7984]) prescribe the construction of a set "targets" which are potential transport destinations to which the message can be sent. Which specifications apply is determined by the context of the message. Targets are commonly conceptualized as protocol/address/port combinations, but in general they are the pairs of source and destination TSAPs that provide the full specification of a transport flow.

For example, the sending of an initial REGISTER message can involve six steps of expanding the goal into a list of targets:

Selecting one of the SIP domain names from the list provided by a DHCP option [RFC3319] and [RFC3361].
NAPTR translation from a SIP domain name to a server domain name [RFC3263].
Selecting a transport protocol (e.g., UDP or TCP), if there is no transport parameter in the URI.
SRV translation of a server domain name to server interface names [RFC2782].
A/AAAA translation of a server interface name to server addresses.
Selecting a source address to use to communicate with a server address [RFC6724].

The process of deriving a set of targets from a goal can be conceptualized as constructing a tree, with the root node being the goal and the leaf nodes being the targets (whether or not an implementation constructs such a representation). Each non-leaf node is expanded into zero or more child nodes by the application of the appropriate baseline specification.

For a particular node, the relevant baseline specification may prescribe relationships between the traffic volume sent to the subsets of targets that are descended from its children. E.g., a standard may prescribe prioritization, such that if any target descended from a higher-priority child is responsive, no traffic should be sent to any target descended from a lower-priority child. (SRV records and DHCP options can specify prioritization.) Similarly, a standard may prescribe load balancing, such that if there are responsive targets descended from two children, the ratio of traffic to the two subsets targets descended from the two children must be a particular (non-zero positive) number. (SRV records can specify load balancing.) Alternatively, a node may place no restrictions on the traffic to the subsets of targets descended from its children.

As always, the construction of the tree and the traffic restrictions incorporated into it may be modified by the local policy. In this document, we assume that all modifications are made to the tree that summarizes the requirements of the baseline specifications. This makes it easier to determine the the interaction of local policy with the procedure modifications of this document. And this assumption does not limit the generality of what local policy may do, since the local policy can remove any ordering restrictions from the tree, thus permitting almost any behavior by the modified procedures.

The targets as generated by the baseline specifications MAY be subsetted by deleting any targets that the client cannot access for reasons such as the client does not implement the protocol, or it does not have a network interface that supports the protocol, or it does not have a network interface that can communicate with the address. Removing these targets at an early stage of processing does not affect the on-the-wire behavior of either the baseline processes or the modified processes, since sends to such targets fail immediately.

What constitutes failure of a send depends on the situation, and may be a transport protocol failure, the absence of a timely 100 Trying response, or a 503 response ([RFC3261] section 21.5.4 and [RFC3263] section 4.3). For any particular message, either the message is successfully sent to exactly one target or the overall sending process fails.

In the worst-case situation, the process may require waiting for one or more transaction timeouts (e.g., Timer B or Timer F in [RFC3261]) before successfully transmitting the message to a target. As the timeouts are typically 32 seconds, such a wait severely impacts the user experience.

4.1. Target Ordering

The baseline specifications assume that the client will effectively generate an order in which to contact the targets, then the client will sequence through the list, sending the message to each target until one of the sends is deemed to be successful. (Each send may include retransmissions of the message.) This is because at any stage, the client's next action is determined only by the goal and the fact that sending to the previous targets has failed -- the first target in the order is the target that the client will choose first (which depends only on the goal), the second target in the order is the target that the client will choose if and when the send to the first target fails (which depends only on the goal and the identity of the first target), etc.

(In mathematical terms, the target order is a total ordering of the targets that is compatible with the partial ordering of the targets specified by the traffic restrictions.)

The order must be compatible with the traffic restrictions imposed by the specifications on the targets, and the traffic restrictions are coded into the nodes of the target-derivation tree. The following subsections discuss how the target order reflects these constraints for every non-leaf node type.

4.2. Priority Nodes

If a node specifies that its children are prioritized, all of the targets descended from a higher-priority child must precede all of the targets descended from a lower-priority child. Suppose the tree of targets has one interior node that specifies prioritization of three targets. This can result from these DNS records:

    _sip._udp.example.com.    SRV    1 1 5060 sip1.example.com
    _sip._udp.example.com.    SRV    2 1 5060 sip2.example.com
    _sip._udp.example.com.    SRV    3 1 5060 sip3.example.com
    sip1.example.com.         A      192.0.2.1
    sip2.example.com.         A      192.0.2.2
    sip3.example.com.         A      192.0.2.3

We show the tree with the targets from left to right from highest priority to lowest priority:

         |
    -priority--
    |    |    |
    A    B    C

We can then represent the traffic restrictions in a (directed, acyclic) graph. In our graphs, the arcs are to be read from left to right, requiring sending to the left endpoint before sending to the right endpoint. In this case, sending to A must be done before sending to B, and sending to B before sending to C. We can add dummy "start" and "finish" nodes, represented by "*", for convenience in later analysis. We can imagine them to be targets to which sending fails immediately.

    *----A----B----C----*

There is only one allowed target order:

    A B C

4.3. Unordered Nodes

If a node places no traffic restrictions on its children, there is no collective relationship between the targets descended from its children. The targets descended from different children may be appear in any order, and can even be interleaved. We show the tree of a simple example:

    sip.example.com.         A      192.0.2.1
    sip.example.com.         A      192.0.2.2
    sip.example.com.         A      192.0.2.3

         |
    -unordered-
    |    |    |
    A    B    C

We then represent the lack of traffic restrictions by a graph which has no lines between targets. We add dummy start and finish nodes to clarify that all three targets are part of one graph:

      A
     / \
    *-B-*
     \ /
      C

There are six allowed target orders:

    A B C

    A C B

    B A C

    B C A

    C A B

    C B A

4.4. Combinations of Prioritization and Unordered Nodes

A prioritized pair of hosts each with an unordered pair of targets can be specified with these DNS records:

    _sip._udp.example.com.    SRV    1 1 5060 sip1.example.com
    _sip._udp.example.com.    SRV    2 1 5060 sip2.example.com
    sip1.example.com.         A      192.0.2.1
    sip1.example.com.         AAAA   2001:DB8::1
    sip2.example.com.         A      192.0.2.2
    sip2.example.com.         AAAA   2001:DB8::2

These records result in this tree:

              |
        ---priority---
        |            |
    unordered    unordered
    |       |    |       |
    A       B    C       D

From this tree, we generate this graph. We can simplify the graph by adding a dummy target which must follow both A and B and must precede both C and D, in order to avoid writing four arcs A-C, A-D, B-C, and B-D.

      A   C
     / \ / \
    *   *   *
     \ / \ /
      B   D

There are four allowed target orders:

    A B C D

    A B D C

    B A C D

    B A D C

Now consider the situation of a client that has two IPv4 interfaces, 192.0.2.129 and 192.0.2.193 (on two separate /26 subnets), with the first prioritized before the second. If the client is configured to attempt to use every interface for every destination address (rather than being restricted by the source address selection rules Section 6.2), every destination address generates two targets, one with source address 192.0.2.129 and one with source address 192.0.2.193, with the first prioritized before the second.

In this situation, the following DNS records result in an unordered pair of hosts, each with a prioritized pair of targets.

    sip.example.com.         A      192.0.2.1
    sip.example.com.         A      192.0.2.65

This situation results in this tree (where we now have to pay attention to the source addresses in the targets):

                               |
                -----------unordered-----------
                |                             |
         ----priority----              ----priority----
         |              |              |              |
         A              B              C              D
    192.0.2.129    192.0.2.193    192.0.2.129    192.0.2.193
    192.0.2.1      192.0.2.1      192.0.2.65     192.0.2.65

and this graph:

      A----B
     /      \
    *        *
     \      /
      C----D

There are six allowed target orders:

    A B C D

    A C B D

    C A B D

    A C D B

    C A D B

    C D A B

When the client is allowed to select any of the available source addresses for a send, each source address (combined with the destination address) generates a separate target Section 6.2. Of the targets, the one selected by the default source address selection rules is preferred, and the remainder are unordered. If there is only one destination address, it results in a tree with this form:

         |
    --priority--
    |          |
    |      -unordered-
    |      |    |    |
    A      B    C    D

with this graph:

                B
               / \
    *----A----*-C-*
               \ /
                D

There are eight allowed target orders:

    A B C D

    A B D C

    A C B D

    A C D B

    A D B C

    A D C B

4.5. Load-Balancing Nodes

The practical difficulty with load-balancing nodes is to ensure that the right proportion of traffic is sent to the descendants of each child node without having to maintain long-term records of the amount of traffic that has been sent to each child's descendants. A simple way to accomplish this is to generate a new priority ordering of the children for each new message to be processed, with the orderings randomized to provide the correct traffic distribution among the children. A randomization algorithm that achieves the correct traffic distribution is described in Appendix A.

With this method, load-balancing nodes are processed for each message instance by generating a suitably randomized ordering of the children, and then processing them in the same way as the children of a priority node.

5. Primary Procedure Modifications

The following modifications are specified for all baseline specifications:

5.1. Permitted to Reorder Targets

A client MAY send the message to the targets in an order that is not permitted by the baseline specifications.

5.2. Must Preserve Traffic Distribution

To state the next requirement, we must define what it means to say that a target is "responsive". Intuitively, a target is responsive if its response time to a message is not "too much" longer than the response time of any other target for the goal. Note that the responsiveness of a target is always defined relative to the set of targets for a particular goal; hence, a target may be responsive for one goal at the same time that it is not responsive for another goal.

We define RTT(T) to mean the round-trip response time of a target, the time it takes to receive confirmation of the receipt of a message sent to the target. If the target does not respond at all, we consider RTT(T) to be "infinity", which is larger than any number. Note that RTT(T) is a fact about reality at some instant, not a measure of the client's current knowledge about reality at some instant.

We define a function "Limit" that converts one time value into another: if RTT(T1) > Limit(RTT(T0)), then target T1 responds "too slowly" relative to the response time of target T0, and T1 is considered non-responsive. We define Limit(t) = m*t + f, where "m" and "f" are parameters defined below.

The parameter "m" limits the range of response times that we will allow among targets we consider responsive. We set m to be 2. (TBD: Is this a good choice?)

The parameter "f" is the length of time that we consider to be insignificant when comparing the response time of targets. We set f to be 2*T1, where T1 is the value of that name used in the procedures of [RFC3261]. That value is commonly the round-trip time estimate of the relevant network, and defaults to 500ms. (TBD: Is this a good choice?)

The result is that Limit(t) is "twice t, plus a little more to account for the inherent delays in the network".

A target T0 is defined to be "responsive" if

its response time, RTT(T0), is less than the relevant timeout (often Timer B or Timer F), and
for any other target T1 of the goal, RTT(T0) < Limit(RTT(T1)).

TBD: The following wording must be set so that the client can move non-responsive targets to any place in the order (or at least, any later place in the order) without violating this condition. It is not clear this has been accomplished yet. Later: I think we've accomplished this now.

We are now ready to specify the major new constraint on the client's behavior: The client's procedures MUST, over time, distribute the traffic for any particular goal among the responsive targets in the same proportions as are required by the baseline specifications. Specifically,

If the set of targets for a particular goal does not change over a period of time,
for any two subsets of targets, both of which contain at least one target which is responsive for the entire period of time, and
if the baseline specifications prescribe a proportion between the traffic to the two subsets, then
the proportion between the traffic to the responsive subsets of the two subsets MUST converge to the proportion specified by the baseline specifications.

Note that this is a "God's-eye view" requirement; the client has no direct way to determine whether it is satisfying the requirement because it cannot measure RTT() for every possible target at every moment. Nonetheless, the client's procedures must ensure that the requirement is satisfied under all circumstances.

6. Additional Requirements

6.1. Address Family Preference

Unless overridden by user configuration or by network configuration: If the host has a policy of preferring one address family, the client MUST prefer it. If the host's policy is unknown or not obtainable, the client MUST prefer IPv6 over IPv4. Thus, usually clients must give preference to IPv6 over IPv4.

Such a preference MUST be implemented in the following way: Consider the "initial" targets, which are the targets which the baseline specifications do not prioritize after any other targets. The client must additionally prioritize the initial targets which are of the preferred address family before the other initial targets.

TBD: Is this sufficient? We don't require address family preference to affect non-initial targets. Alternatively, if the server has a lot of IPv6 addresses, none of which are responsive, the only way to quickly send to an IPv4 address is to send probes to all of the initial IPv6 addresses and one (almost-initial) IPv4 addresses. This is recommended in the probing heuristics, but might require a lot of probes. This parallels the advice in RFC 6555 section 5.4 "web site operators with native IPv6 connectivity SHOULD NOT offer multiple AAAA records".

6.2. Address Selection

Clients SHOULD provide a mechanism by which the address selection configuration [RFC6724] can be customized for the client independently of any other application.

Clients SHOULD implement the destination address selection mechanism specified in [RFC6724]. Note that this mechanism provides a priority order among the set of A/AAAA records for a single server host name, whereas [RFC3263] assumes that such sets of A/AAAA records are unordered.

Clients SHOULD implement rule 5.5 of section 5 of [RFC6724], preferring to use a source address with a prefix assigned by the selected next-hop. This requires that the IPv6 stack remembers which next-hops advertised which prefixes.

Clients SHOULD by default use the source address selection mechanism specified in [RFC6724], which chooses one source TSAP for any particular destination TSAP.

Clients SHOULD also be configurable to use an alternative mechanism, in which for any destination TSAP, targets are generated for each source TSAP that could possibly communicate with the destination TSAP, with the source TSAP selected by [RFC6724] prioritized over the other source TSAPs and the other source TSAPs being unordered among themselves.

The alternative policy is useful in situations where the source address selection table prioritizes an interface which does not forward SIP traffic to the destination address. (For an example, when the source address selection table routes almost all destinations to an organizational VPN which has restricted connectivity.)

6.3. Vias

A client MUST provide a via-param in the first Via header field in requests that properly routes from the server to the client, regardless of the presence of NATs in the transportation path. This is necessary even when the request is sent via a connection-oriented protocol, because the connection may be terminated before the response is sent back to the client and the server may need to reestablish a connection. The client SHOULD provide the "rport" parameter on the via-param [RFC3581].

Additionally, to assist tracing and diagnosis, a client SHOULD provide the source TSAP that it used in the via-param. TBD: Is this too strict? Is it actually useful?

6.4. DNS Caching

The information a client uses to determine a target set must be up-to-date. In particular, the construction of a target set MUST NOT utilize information from DNS whose TTL has expired. However, once a target set has been constructed, it may be retained until the message has either been sent or failed, without regard to the TTL of the information used to construct it.

TBD: Should we allow a client to cache target set computations somewhat longer than the TTL to minimize disruption and DNS traffic? Phone calls typically take 3 minutes, so we could allow 5 minutes grace and thus ensure that target sets rarely have to be recomputed during a call.

6.5. Unused Flows

A flow that is created as a probe but not subsequently used (either to send the message or to maintain a SIP Outbound flow) SHOULD be terminated unless there is reason to believe that it will be used in the near future. This includes flows that are connection-oriented protocols as well as non-connection-oriented flows with security associations. Minimizing the number of unused connections reduces the load on the server and on stateful middleboxes. Also, if the abandoned connection is IPv4, this reduces IPv4 address sharing contention in any intermediate NATs.

6.6. Debugging and Troubleshooting

Happy Earballs is aimed at ensuring a reliable user experience regardless of connectivity problems affecting any single transport. However, this naturally means that applications employing these techniques are by default less useful for diagnosing issues with a particular address family or interface. To assist in that regard, an implementation MAY provide a mechanism to disable its Happy Earballs behavior via a user setting, and SHOULD provide data useful for debugging (e.g., a log or way to review current preferences).

7. Consequences of the New Requirements

In this section we explore some of the consequences of these requirements and describe possible approaches for designing clients that satisfy the modified requirements and provide shorter transmission latency.

A client may send the message to the targets in an order that is not permitted by the baseline specifications, but it may not omit any targets from its ordering. Thus, the client is required to send it to all the targets before it may declare failure of the send process. (TBD: Would it be better to 408 the message faster?) For example, if the client has cached information which indicates that a target is unreachable, the client may move that target to the end of the order, but if sending to all other targets is unsuccessful, the client must send to that target before declaring failure.

A client may cache measured RTT() values for targets and use this information to optimize target orderings; the cached value of RTT(T) for a target T we name CRTT(T). Because a single target may appear in the target set for multiple goals, the client should cache RTT() for targets (rather than judgments of responsiveness), and then when sending to a goal use CRTT() to determine whether a target T is responsive relative to the other targets of that particular goal.

A client may determine a target T1 to be "slow" (relative to a given goal) if its CRTT(T1) is greater than Limit(CRTT(T2)) for some other target T2 in the goal's target set. A target that is not slow is labeled "normal". Note that a target being slow is determined by the client via a combination of the information in the cache and the state of the network at the moments that the cached information was recorded. As it were, the client thinks a slow target is non-responsive, but the target may or may not actually be non-responsive at that moment, depending on whether the CRTT() values agree with current reality, which is the RTT() values.

If there is be an upper bound on the length of time that a client retains CRTT() values, then the client may assume that any slow target is non-responsive, in that it may place the target after the normal targets in the order. For this reason, the client is required to have an upper bound on the length of time that CRTT() values remain in the cache, after which they are either deleted or replaced by values based on more recent observations of RTT().

The client may act on its CRTT() values in this way because in doing so, it will not violate the traffic distribution requirement: If a target is responsive over a long period, the eventual delete/refresh of cached values from before that period ensures that the client will eventually see the target as normal. (2) If a target is not continuously responsive over a long period, the traffic distribution requirement places no restriction on whether the client sends traffic to it or not, and repositioning it after all normal targets does not affect the traffic distribution among the normal targets.

Implementations have flexibility regarding the lifetime of cache entries. The only absolute restriction is that as long as the configuration is stable, there is a maximum lifetime that CRTT() values are kept before being deleted or refreshed. The advice in [RFC6555] suggests that the upper bound on the lifetime of cache entries should on the order of 10 minutes.

The length of time that different CRTT() values are cached may differ from each other. If the RTT() observation is from a periodic event (e.g., registration or NAT binding refresh), the cache lifetime should be similar to but longer than the period of the event, as the event period is likely to be configured based on knowledge of the interval over which the network is likely to remain stable.

When the state of a source address is changed, or the state of the interface it is assigned to changes, or when the network it is connected to is re-initialized, cached RTT() values for targets with that source address should be deleted. Interfaces can determine network re-initialization by a variety of mechanisms (e.g., [RFC4436] and [RFC6059]).

When a client processes a message, the ordering of targets that it sends to must be an ordering permitted by the baseline specifications, with the exception that slow targets may be moved after all normal targets.

Since the client's goal is to deliver the message as quickly as possible, a client should always move slow targets to the end of the order, after the normal targets. Note that if a probe transmission during message processing discovers a target to be slow, the target can be moved at that time to after all normal targets.

A client obtains RTT(T) for a target T whenever it sends to the target. But it can also obtain RTT(T) by a probe, which is any transmission to T which requires a response but does not involve sending the message (and hence does not commit the client to possibly waiting for a timeout before sending to another target). A client may send probes to several targets simultaneously.

Probe operations include:

establishing a transport connection to the target (without sending the message as data on the connection)
sending a keep-alive, as specified by the transport protocol
sending a CR-LF-CR-LF keep-alive on a SIP Outbound flow [RFC5626]
sending a STUN keep-alive message on a SIP Outbound flow [RFC5626]
sending an OPTIONS request with "Max-Forwards: 0".

(Note that a probe using an OPTIONS request can be used with any protocol. If the OPTIONS reaches the target, the target is required to respond with either a 200 or 483 response [RFC3261], without forwarding it to another entity, unless the target demands authentication, which would be indicated by an immediate 401 or 407 response. Conveniently, a server can respond to such a request statelessly, so such requests are low-overhead. (Although the SIP Outbound keep-alive methods have even lower overhead.))

Similarly, if a client has a connection to a target T, and the connection has been idle for long enough, the client will not have a CRTT(T) for T, reflecting the fact that the connection may have failed without the client's knowledge. The client can refresh CRTT(T) by performing a suitable probe operation within the connection.

The client should use a flow or connection that is established to a target instead of establishing a new flow or connection to that target for sending either a probe or a message.

If the client initiates a probe of a target T, it may be able to decide that T is slow without waiting to determine the actual value of RTT(T), which may take as long as the timeout period: the true value of RTT(T) is always at least as large as the elapsed time since the probe was sent, and the determination of slowness depends on whether RTT(T) exceeds Limit(CRTT(Tfastest)) (where Tfastest is the target with smallest CRTT() value of any target in the target set).

If the client establishes a connection to a target without simultaneously sending the message, the connection establishment is a probe of the target, and after initiating the connection but before sending the message, if that probe reveals that the target is slow, the client may move that target to the end of the ordering, and utilize another target.

In order to minimize the chance that the client must wait for a timeout before sending to another target, the client may send probes to targets, and the RTT() values revealed by those probes can change what target the client will send to next. Because of this, the client's procedures do not simply convert the tree of targets into an ordering of the targets, which the client then follows -- Information discovered during the sequence of sends can affect the order targets are sent to.

There is little value sending a probe to a target unless all targets of higher priority (1) have been sent the message (and failed), (2) have been sent a probe, or (3) have a CRTT() value.

A target T0 is "quick" if its CRTT(T0) is less than Limit(0), that is, if T0 will be normal regardless of the CRTT() values of any other target. If an allowed next target has a cached RTT() value, and it is "quick", then it is never slow for any goal.

If the client has reason to believe that it will soon be asked to send a message to a goal with a target T, and if CRTT(T) does not exist or is likely to expire before then, the client may decide to preemptively refresh the cached value by probing T.

Similarly, a client may be in a situation where it has advance notice that it is likely to need to send a message to a particular target, for instance, if the user of a UA begins dialing an outgoing call which will be routed through a particular outgoing proxy. In such a situation, the client should consider preemptively probing the target.

Note that the use of probes increases the non-message traffic to the targets, and thus has a cost. A client minimizes the expected transmission time by initially probing all of the targets, but that strategy maximizes the additional traffic. A client should weigh the tradeoff between improved user experience and increased traffic. In particular, the client should be aware of which messages require rapid service for good user experience (e.g., INVITE and BYE) and which do not (e.g., REGISTER and re-SUBSCRIBE).

A client should avoid sending to a target which does not have a CRTT() value (unless it is the last remaining target), because the target might be non-responsive forcing the client to wait for a timeout. Instead, the client should probe the target before sending the message to it.

8. Generic Algorithm

Based on these observations, we can construct an event-driven algorithm that satisfies the revised requirements and avoids sending the message to slow targets. This algorithm has a number of places where the implementor has latitude.

The primary state of the algorithm is a graph whose nodes are the targets which have not yet been sent to and failed.

Each target is annotated with its CRTT() value, or the absence thereof.
Each target has an indicator of whether a probe has been sent to it, and if so, when the probe was sent.
The graph is initialized with the graph of targets as constructed from the target construction tree.
The initial graph is assumed for convenience to have a unique rightmost (final) target named Fin, which may be a dummy. Targets determined to be slow will be moved to after target Fin.

The algorithm uses some data derived from the graph.

S, which is Limit(CRTT(Tfastest)), where Tfastest is the target (that is or was in the graph) with the smallest CRTT(). Thus, any target T with CRTT(T) > S is slow.
The set of initial targets, targets which (at this time) have no targets prioritized before them, and thus could be chosen as the next target to send to.

The algorithm can be most easily described in an event-driven manner, where particular situations trigger particular actions:

When a target T is slow, that is, CRTT(T) > S and it is before Fin, it is removed from its position in the graph and moved to a new position, after Fin. (Note that additional arcs may need to be inserted where T used to be; if A is before T and T is before B, A remains before B.) (Note that a target can be discovered to be slow because a probe has been outstanding longer than S without responding, even if S is less than the timeout period.)
When there is no outstanding send, the client may send to any initial target whose CRTT() is known.
When there is no outstanding send and only one target in the graph, the client may send to it.
When there is an initial dummy target, it may be removed from the graph.
When a target T has no CRTT() and all targets that precede T either have a CRTT() or have a probe outstanding, the client may send a probe to T.
When a probe or a send succeeds, fails, or times out, the client records CRTT() for that target.
When the startup of a protocol (e.g., TCP SYN) probes the target without sending the message, the client should perform the startup as a probe, and send the message as a separate action.
When a send to a target succeeds, sending to the goal has succeeded (and the algorithm terminates).
When a send to a target fails, the target is removed from the graph.
When the graph is empty (has no nodes), sending to the goal has failed (and the algorithm terminates).

8.1. Policies and Implementor Latitude

The generic algorithm allows the implementor a considerable degree of latitude in making the allowed choices.

If multiple initial targets have a CRTT(), the client can choose which to send to, e.g., the one with the smallest CRTT().
The client chooses which targets to probe. The client should balance the value of information to be obtained by probing targets with the cost of doing so. The naive aggressive strategy is to probe all initial targets that do not have a CRTT().
When a probe is sent and a response is not received for a time period, the client may consider it "lost" and send another probe. This time period will likely be longer than the typical network RTT but significantly less than the send timeout period.
The client may choose to probe targets to exercise the maximum diversity of network paths, covering the range of interfaces and address families in the target set. E.g., a non-initial target that provides diversity may be chosen to be probed (which may require first probing all of its preceding targets).
If all initial targets with CRTT() have "large" values, the client may choose not to send to any of them, and instead send probes to other targets which might respond faster. Note that in this situation, probing non-initial targets might be useful; if such a probe responds quickly, it might cause the initial targets to become slow and to be moved to the end, causing the probed target to become initial (and thus available for sending).
The client may add prioritizations between targets which are not required to have them due to the baseline specifications (as long as they are consistent with the prior prioritizations, i.e., do not specify both that A is before B and that B is before A). Doing this judiciously may allow the algorithm to use a simpler data structure for recording prioritization Section 9

8.2. Examples

This section shows examples of the generic algorithm (with naive choices for the various "policies") in situations involving various combinations of targets with particular properties.

8.2.1. Two Unordered Targets, Both Cached

Suppose there are two unordered targets, both of which have cached RTT() values less than S.

      A (cached)
     /          \
    *            Fin
     \          /
      B (cached)

First, the initial dummy node is removed.

      A (cached)
                \
                 Fin
                /
      B (cached)

The client chooses the target with the smallest CRTT(), which we assume is A, and sends to it.

If the send succeeds, the algorithm terminates successfully.

In the unlikely case that sending to A fails, it is deleted from the graph. (Sending to A will likely succeed, because there is a current cached record of successful communication.)

                 Fin
                /
      B (cached)

The client chooses B to send to.

If the send succeeds, the algorithm terminates successfully.

If sending to B fails, it is deleted from the graph.

Fin

The initial dummy node Fin is deleted from the graph. The graph becomes empty, and the algorithm terminates unsuccessfully.

8.2.2. Two Unordered Targets, Both Cached, One Slow

Suppose there are two unordered targets, both of which have cached RTT() values, but the value for target A is greater than S.

      A (cached)
     /          \
    *            Fin
     \          /
      B (cached)

First, the initial dummy node is removed.

      A (cached)
                \
                 Fin
                /
      B (cached)

Target A is examined and found to be slow, so it is moved after Fin.

                 Fin----A (slow)
                /
      B (cached)

The client chooses B to send to.

If the send succeeds, the algorithm terminates successfully.

If sending to B fails, it is deleted from the graph.

                 Fin----A (slow)

The initial dummy node is deleted from the graph.

                        A (slow)

The client sends to A.

8.2.3. Two Unordered Targets, One Cached

Suppose there are two unordered targets, only one of which has a CRTT():

      A (cached)
     /          \
    *            Fin
     \          /
      ----B-----

The initial dummy node is deleted from the graph.

      A (cached)
                \
                 Fin
                /
          B-----

Only target A is initial and has a CRTT(), so the client sends to it.

Since B is initial and has no CRTT(), the client sends a probe to it.

If the send to A succeeds, the algorithm terminates successfully.

If the send to A fails, it is deleted from the graph.

                 Fin
                /
          B-----

Since B is the only remaining (non-dummy) target, the client sends to it.

8.2.4. Two Unordered Targets, Neither Cached

Suppose there are two unordered targets, neither of which have CRTT() values:

      A
     / \
    *   Fin
     \ /
      B

The initial dummy node is deleted from the graph.

The client cannot send to either target. The client sends probes to both targets. When one of the probes returns, let us say it is the probe to A, the graph indicates that one target is cached:

      A (cached)
                \
                 Fin
                /
          B-----

Target A now is initial and has a CRTT(), so the client sends to it.

In the unlikely case that sending to A fails, A is deleted from the graph, and the client immediately sends to B because B is the only remaining target.

This situation parallels the standard "Happy Eyeballs" situation in HTTP, where the client has two (or more) unordered addresses for the server, one IPv4 and one IPv6. The client requests connections with both addresses simultaneously, and the first connection that succeeds is used to send the HTTP request [RFC6555].

8.2.5. Three Unordered Targets, One Cached

Suppose there are three unordered targets, only one of which has a CRTT():

      A (cached)
     /          \
    *-----B------Fin
     \          /
      ----C-----

The initial dummy node is deleted from the graph.

      A (cached)
                \
          B------Fin
                /
          C-----

Only target A is initial and has a CRTT(), so the client sends to it.

Since B is initial and has no CRTT(), the client sends a probe to it. Similarly for C.

If the send to A succeeds, the algorithm terminates successfully.

If the send to A fails, it is deleted from the graph.

          B------Fin
                /
          C-----

The client waits until one of the probes (to B and C) either succeeds or fails. Suppose the probe to B succeeds.

          B (cached)------Fin
                         /
          C--------------

Target B is now initial and has a CRTT(), so the client sends to it.

8.2.6. Two Prioritized Targets, Both Cached

Suppose there are two prioritized targets, both of which have CRTT():

    *----A (cached)----B (cached)----Fin

If neither A nor B is slow, processing proceeds according to the baseline specification: the client must send to A first, and then if that fails, send to B. In detail:

The initial dummy node is deleted from the graph.

         A (cached)----B (cached)----Fin

The client sends to A.

If sending to A succeeds, the algorithm terminates successfully. If sending to A fails, A is removed from the graph.

                       B (cached)----Fin

Target B is now initial, and the client sends to B.

If sending to B succeeds, the algorithm terminates successfully. If sending to B fails, B is removed from the graph, then Fin is removed from the graph, and the algorithm terminates unsuccessfully.

However, as processing begins, it is possible that A is slow because CRTT(A) > S = Limit(CRTT(B)). In that case A is moved after Fin.

         B (cached)----Fin----A (slow)

Target B is now initial, and the client sends to B.

If sending to B succeeds, the algorithm terminates successfully. If sending to B fails, B is removed from the graph, then Fin is removed from the graph. Then A is initial, and the client sends to A.

8.2.7. Two Prioritized Targets, the Second Cached

Suppose there are two prioritized targets, but only the second has a CRTT():

    *----A----B (cached)----Fin

The initial dummy node is deleted from the graph.

         A----B (cached)----Fin

The client sends a probe to A. If the probe response (the new CRTT(A)) is less than S = Limit(CRTT(B), the the graph becomes:

         A (cached)----B (cached)----Fin

The client then sends to A.

But if the probe remains outstanding longer than S, A becomes slow and is moved to after Fin.

         B (cached)----Fin----A (slow)

At that point, the client sends to B.

8.2.8. Two Prioritized Targets, the First Cached

Suppose there are two prioritized targets, but only the first has a CRTT():

    *----A (cached)----B ----Fin

The initial dummy node is deleted from the graph.

         A (cached)----B----Fin

The client can send to A immediately. If CRTT(A) <= Limit(0), the client must send to A immediately since A is quick, and cannot be revealed to be slow by probing other targets.

But if RTT(A) is large enough that there is a reasonable chance that RTT(B) is smaller than Limit(RTT(A)) (which would make A slow), the client may choose to send a probe to B first. If the probe response returns quickly enough, CRTT(A) > Limit(CRTT(B)), which makes A slow. Then, A would be moved to after Fin, and the client would send to B.

At any point before the probe response arrives, the client can decide to send to A. But if the probe does not return within Limit(CRTT(B)), the client should immediately send to A, as a will not be shown to be slow.

8.2.9. Two Prioritized Targets, Neither Cached

Suppose there are two prioritized targets, neither of which have CRTT():

    *----A----B----Fin

The initial dummy node is deleted from the graph.

         A----B----Fin

The client must now send a probe to A before it can send the message to A. But if A and B use different network paths (as they likely will), it is probably advantageous to send probes to both A and B, in case B is sufficiently more responsive than A to make A slow and override the prioritization. If the client chooses to send both probes:

If the response to the probe of A returns first, the client immediately sends to A.

If the response to the probe of B returns, and the response to the probe of A returns before Limit(CRTT(B)), the client immediately sends to A.

If the response to the probe of B returns, and the response to the probe of A does not returns before Limit(CRTT(B)), the client moves A to after Fin, and then immediately sends to B.

8.2.10. Three Targets

A more complex case is when the client must choose between three source addresses for a destination address. One, the address selected by the usual rules rules, is prioritized and the other two are unordered. Consider the example where the destination address is 192.0.2.1 and the client has interfaces with addresses 192.0.2.65, 192.0.2.129, and 192.0.2.197, with the first of those being the interface on the same subnet as the default outbound gateway. Then the graph is:

                               B
                          192.0.2.129/
                          192.0.2.1
              A          /            \
    *----192.0.2.65/----*              Fin
         192.0.2.1       \     C      /
                          192.0.2.197/
                          192.0.2.1

Assuming that there are no CRTT() values, the client starts by probing A. Unless it is being very conservative with probes, the client also sends probes via B and C.

As the probe responses arrive, CRTT() values are recorded, which can cause targets to be declared slow and relocated after Fin. Eventually, some target will be both initial and have a CRTT(), and the client sends to it.

If the probe response from A arrives before Limit(RTT(B)) and Limit(RTT(C)), the graph is updated to:

                         B
                        / \
         A (cached)----*   Fin
                        \ /
                         C

At that point, the client sends to A.

On the other hand, if a probe response arrives from B but one does not arrive from A before Limit(CRTT(B)), then A becomes slow, and it is moved to after Fin. Then the dummy target after A and before B and C is deleted.

      B (cached)
                \
                 Fin----A (slow)
                /
      C---------

At this point, the client sends to B.

9. Using a Simplified Data Structure

Instead of maintaining a directed acyclic graph to control the client's operation, the client can replace the graph with a sequence of sets of targets based on their "rank". The rank of a target is defined as:

The rank of an initial target is 0.
The rank of a non-initial target is 1 more than the highest rank of any target that it is prioritized after.

In this scheme, a target is prioritized after all targets with lower ranks. As processing progresses, all targets in the lowest non-empty rank are initial, and all targets in higher ranks are non-initial.

This is compatible with the generic algorithm because it only adds to the prioritization relationships, it does not delete any of them. Thus, the behavior it causes is compatible with the Happy Earballs requirements.

For example, consider a server with two addresses, IPv6 and IPv4, with IPv6 prioritized via SRV records. Both addresses accept both TCP and UDP traffic:

    _sip._udp.example.com.    SRV    1 1 5060 sip1.example.com
    _sip._udp.example.com.    SRV    2 1 5060 sip2.example.com
    _sip._tcp.example.com.    SRV    1 1 5060 sip1.example.com
    _sip._tcp.example.com.    SRV    2 1 5060 sip2.example.com
    sip1.example.com.         AAAA   2001:DB8::1
    sip2.example.com.         A      192.0.2.1

Suppose a request can be transported using either TCP or UDP, and the situation does not specify a priority between them. The tree of targets is then:

                                    |
                   -------------unordered-----------
                   |                               |
           -----priority-----              -----priority-----
           |                |              |                |
    TCP 2001:DB8::1  TCP 192.0.2.1  UDP 2001:DB8::1  UDP 192.0.2.1

The graph, annotating each target with its rank, is:

      (0) TCP 2001:DB8::1----(1) TCP 192.0.2.1
     /                                        \
    *                                          *
     \                                        /
      (0) UDP 2001:DB8::1----(1) UDP 192.0.2.1

Which can be turned into a list of lists as:

    rank 0: TCP 2001:DB8::1, UDP 2001:DB8::1

    rank 1: TCP 192.0.2.1, UDP 192.0.2.1

The rank representation is functionally equivalent to the following graph, which is the original graph with additional lines, showing that the rank representation constrains the client's behavior more than the original graph does:

      (0) TCP 2001:DB8::1   (1) TCP 192.0.2.1
     /                   \ /                 \
    *                     *                   *
     \                   / \                 /
      (0) UDP 2001:DB8::1   (1) UDP 192.0.2.1

The rank lists can be built (without first constructing the graph) by walking the target tree from top to bottom and left to right (highest priority to lowest priority), with each node passing downward MRdown, the minimum rank that any descendant target is allowed, and each node passing upward MRup, the minimum rank allowed for any target prioritized after that node.

The root node's MRdown is 0.
For an unordered node:
- Each child's MRdown is the node's MRdown.
- The node's MRup is the maximum of the node's MRdown and all of its children's MRup's.
For a priority node (with the children ordered by priority):
- The first child's MRdown is the node's MRdown.
- A later child's MRdown is the preceding child's MRup.
- The node's MRup is the final child's MRup, or if there are no children, the node's MRdown.
For a load-balancing node, the children are first prioritized randomly ([RFC2782] and Appendix A), then processed as for a prioritized node.
For a target node:
- The target has rank MRdown.
- The node's MRup is MRdown + 1.

Here is the preceding example's tree, with each node annotated with its MRdown on the left of the node label and its MRup on the right of the node label:

                                    |
                   ---------(0) unordered (2)-------
                   |                               |
           -(0) priority (2)-              -(0) priority (2)-
           |                |              |                |
(0) TCP 2001:DB8::1 (1)     |   (0) UDP 2001:DB8::1 (1)     |
                            |                               |
                 (1) TCP 192.0.2.1 (2)           (1) UDP 192.0.2.1 (2)

Note that the target tree does not have to be explicitly constructed; it can be implicitly walked by a series of recursive function calls, with the functions passing MRdown and MRup values between themselves, and each target being inserted into the rank list-of-lists as it is generated.

The address family preference rule Section 6.1 can be implemented within the rank representation by first constructing the ranks based on the baseline specifications, and then splitting rank 0 into two ordered sub-ranks, 0.0 and 0.1, with 0.0 containing all rank 0 targets of the preferred address family and rank 0.1 containing all other rank 0 targets.

An example of address family preference processing is the ordinary case of two prioritized servers each with an IPv6 and IPv4 address:

    _sip._udp.example.com.    SRV    1 1 5060 sip1.example.com
    _sip._udp.example.com.    SRV    2 1 5060 sip2.example.com
    sip1.example.com.         AAAA   2001:DB8::1
    sip1.example.com.         A      192.0.2.1
    sip2.example.com.         AAAA   2001:DB8::2
    sip2.example.com.         A      192.0.2.2

                                       |
                      ----------(0) priority (2)--------
                      |                                |
             -(0) unordered (1)-              -(1) unordered (2)-
             |                 |              |                 |
    (0) 2001:DB8::1 (1)        |   (1) 2001:DB8::2 (2)          |
                               |                                |
                      (0) 192.0.2.1 (1)                (1) 192.0.2.2 (2)

After splitting rank 0 based on the address family preference, the the ranks are:

    rank 0.0: 2001:DB8::1

    rank 0.1: 192.0.2.1

    rank 1:   2001:DB8::2, 192.0.2.2

10. Security Considerations

This document changes the order in which a client will send to targets but does not change the set of targets that it will send to. There are no known SIP systems whose security depends on the order in which a client sends to targets. Given that network connectivity is unreliable, it is unlikely that the security of any SIP system depends on the reliable ordering of targets.

The specific security vulnerabilities, attacks and threat models of the various protocols mentioned in this document (SIP, DNS, SRV records, etc.) are well-documented in their respective specifications, and their effect on the security of SIP systems is unchanged.

11. IANA Considerations

This document does not require any actions by IANA.

12. History

Note to RFC Editor: Upon publication, please remove this section.

12.1. Changes from draft-worley-sip-happy-earballs-01 to draft-worley-sipcore-happy-earballs-00

Change name to draft-worley-sipcore-happy-earballs.

Minor fixes to the references.

12.2. Changes from draft-worley-sip-happy-earballs-00 to draft-worley-sip-happy-earballs-01

Overhaul of the exposition.

12.3. Changes from draft-worley-sip-he-connection-01 to draft-worley-sip-happy-earballs-00

Complete overhaul.

Changed "EarBalls" to "Earballs".

12.4. Changes from draft-worley-sip-he-connection-00 to draft-worley-sip-he-connection-01

Minor changes.

Add note that WebSocket is out of scope, because there is only one possible transport in WebSocket.

12.5. Changes from draft-johansson-sip-he-connection-01 to draft-worley-sip-he-connection-00

This version has a different name for technical reasons. It is, in reality, the successor to draft-johansson-sip-he-connection-01.

Move Acknowledgments after References, as that is the style the Editor prefers.

Updated Security Considerations: This increment of the H.E. work does not make normative changes in existing SIP.

Copy a lot of text from RFC 6555, as this I-D is parallel to RFC 6555.

Changed "hostname" to "host name", as the latter form is more common in RFCs by a moderate margin.

Revised some of the introduction text to parallel the introduction of RFC 7984.

Changed name of algorithm to "Happy EarBalls", added reference to Urban Dictionary.

Many expansions of the discussion and revisions of the wording.

13. References

13.1. Normative References

[RFC2119]	Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997.
[RFC2782]	Gulbrandsen, A., Vixie, P. and L. Esibov, "A DNS RR for specifying the location of services (DNS SRV)", RFC 2782, DOI 10.17487/RFC2782, February 2000.
[RFC3261]	Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., Peterson, J., Sparks, R., Handley, M. and E. Schooler, "SIP: Session Initiation Protocol", RFC 3261, DOI 10.17487/RFC3261, June 2002.
[RFC3263]	Rosenberg, J. and H. Schulzrinne, "Session Initiation Protocol (SIP): Locating SIP Servers", RFC 3263, DOI 10.17487/RFC3263, June 2002.
[RFC3581]	Rosenberg, J. and H. Schulzrinne, "An Extension to the Session Initiation Protocol (SIP) for Symmetric Response Routing", RFC 3581, DOI 10.17487/RFC3581, August 2003.
[RFC6555]	Wing, D. and A. Yourtchenko, "Happy Eyeballs: Success with Dual-Stack Hosts", RFC 6555, DOI 10.17487/RFC6555, April 2012.
[RFC6724]	Thaler, D., Draves, R., Matsumoto, A. and T. Chown, "Default Address Selection for Internet Protocol Version 6 (IPv6)", RFC 6724, DOI 10.17487/RFC6724, September 2012.
[RFC7984]	Johansson, O., Salgueiro, G., Gurbani, V. and D. Worley, "Locating Session Initiation Protocol (SIP) Servers in a Dual-Stack IP Network", RFC 7984, DOI 10.17487/RFC7984, September 2016.

13.2. Informative References

[I-D.johansson-sip-he-connection]	Johansson, O., Salgueiro, G. and D. Worley, "Setting up a SIP (Session Initiation Protocol) connection in a dual stack network using connection oriented transports", Internet-Draft draft-johansson-sip-he-connection-01, October 2016.
[RFC3319]	Schulzrinne, H. and B. Volz, "Dynamic Host Configuration Protocol (DHCPv6) Options for Session Initiation Protocol (SIP) Servers", RFC 3319, DOI 10.17487/RFC3319, July 2003.
[RFC3361]	Schulzrinne, H., "Dynamic Host Configuration Protocol (DHCP-for-IPv4) Option for Session Initiation Protocol (SIP) Servers", RFC 3361, DOI 10.17487/RFC3361, August 2002.
[RFC4213]	Nordmark, E. and R. Gilligan, "Basic Transition Mechanisms for IPv6 Hosts and Routers", RFC 4213, DOI 10.17487/RFC4213, October 2005.
[RFC4436]	Aboba, B., Carlson, J. and S. Cheshire, "Detecting Network Attachment in IPv4 (DNAv4)", RFC 4436, DOI 10.17487/RFC4436, March 2006.
[RFC5245]	Rosenberg, J., "Interactive Connectivity Establishment (ICE): A Protocol for Network Address Translator (NAT) Traversal for Offer/Answer Protocols", RFC 5245, DOI 10.17487/RFC5245, April 2010.
[RFC5626]	Jennings, C., Mahy, R. and F. Audet, "Managing Client-Initiated Connections in the Session Initiation Protocol (SIP)", RFC 5626, DOI 10.17487/RFC5626, October 2009.
[RFC6059]	Krishnan, S. and G. Daley, "Simple Procedures for Detecting Network Attachment in IPv6", RFC 6059, DOI 10.17487/RFC6059, November 2010.
[RFC6947]	Boucadair, M., Kaplan, H., Gilman, R. and S. Veikkolainen, "The Session Description Protocol (SDP) Alternate Connectivity (ALTC) Attribute", RFC 6947, DOI 10.17487/RFC6947, May 2013.
[UD]	"The Jews Who Stole Christmas", , "Urban Dictionary, entry 'Earballs'", December 2011.

Appendix A. Implementing Load Balancing

Load-balancing is specified by the "weight" field of DNS SRV records. The defining algorithm is specified in [RFC2782]. The same result can be obtained with a simpler algorithm: For each server, calculate a "score": If its weight is 0, its score is "infinity" (in practice, 100 suffices). If its weight is non-zero, its score is calculated by choosing a random number between 0 and 1, taking the negative of the logarithm of that number, and dividing the result by the weight. (Thus, the score is always a positive number.) (The resulting score has an exponential distribution whose parameter is the weight.) Then, sort the servers into order of increasing scores, so that the servers with the smallest scores are used first.

This alternative algorithm is analyzed and sample implementations are provided in the files in the directory sipXrouter/sipXtackLib/doc/developer/scores in the GitHub Sipfoundry project (https://github.com/sipfoundry/sipXrouter), among other repositories.

Acknowledgments

The authors would like to acknowledge the support and contribution of the SIP Forum IPv6 Working Group. This document is based on a lot of tests and discussions at SIPit events, organized by the SIP Forum.

The foundation of this document is the work done by Olle Johansson and Gonzalo Salgueiro in earlier documents, including [I-D.johansson-sip-he-connection]. In turn, the foundation of that work is [RFC6555], whose authors are Dan Wing and Andrew Yourtchenko.

Scott O. Bradner suggested that the formula for determining responsiveness should contain a constant term.

Roman Shpount described the need for configuration to override the source address selection mechanism.

Tolga Asveren suggested requiring "rport".

Author's Address

Dale R. Worley Ariadne Internet Services 738 Main St. Waltham, MA 02451 US EMail: worley@ariadne.com