Network Working Group J. Rosenberg Internet-Draft Five9 Expires: August 25, 2021 C. Jennings Cisco T. Asveren Ribbon Communications February 21, 2021 SIP Extensions for High Availability and Load Balancing for Public Cloud draft-rosenberg-dispatch-cloudsip-00 Abstract Software making use of the Session Initiation Protocol (SIP) faces challenges in achieving high availability, especially for call stateful applications like softswitches, Session Border Controllers (SBCs), and IP-based call centers applications. The state maintained in the SIP, SDP and SRTP layers changes frequently, and is difficult to replicate. For this reason, commercial systems have often relied on complex active-standby configurations making use of IP address takeover. These solutions are also ill-suited for usage in modern public cloud environments. This document defines a SIP extension facilitating HA, including keeping calls active, which is optimized for server-to-server communication where one or both sides are in public cloud. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on August 25, 2021. Copyright Notice Copyright (c) 2021 IETF Trust and the persons identified as the document authors. All rights reserved. Rosenberg, et al. Expires August 25, 2021 [Page 1] Internet-Draft Cloud SIP February 2021 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Applicability . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 4 4. Relationship to RIPT . . . . . . . . . . . . . . . . . . . . 5 5. Reference Architecture . . . . . . . . . . . . . . . . . . . 5 6. Solution Applicability . . . . . . . . . . . . . . . . . . . 6 7. Overview of Solution . . . . . . . . . . . . . . . . . . . . 7 8. Configuration . . . . . . . . . . . . . . . . . . . . . . . . 8 9. SIP Behavioral Requirements . . . . . . . . . . . . . . . . . 9 9.1. Calling Server . . . . . . . . . . . . . . . . . . . . . 9 9.1.1. Health Probing . . . . . . . . . . . . . . . . . . . 9 9.1.2. Utilization Measurement . . . . . . . . . . . . . . . 9 9.1.3. New Call Initiation . . . . . . . . . . . . . . . . . 10 9.1.4. Instance Failure . . . . . . . . . . . . . . . . . . 10 9.1.5. Instance to Inactive . . . . . . . . . . . . . . . . 10 9.1.6. Receiving a REFER . . . . . . . . . . . . . . . . . . 11 9.2. Cluster Instances . . . . . . . . . . . . . . . . . . . . 11 9.2.1. Sending Utilization Values . . . . . . . . . . . . . 11 9.2.2. Receiving INVITE w. Replaces . . . . . . . . . . . . 12 9.2.3. Graceful Shutdown with Migration . . . . . . . . . . 12 9.2.4. Graceful Shutdown without Migration . . . . . . . . . 12 9.3. Moving a Dialog . . . . . . . . . . . . . . . . . . . . . 13 10. Cloud SIP Trunk Configuration File . . . . . . . . . . . . . 13 11. Webhook Registration Object . . . . . . . . . . . . . . . . . 14 12. Instance-Utilization Header Field . . . . . . . . . . . . . . 14 13. Why not DNS . . . . . . . . . . . . . . . . . . . . . . . . . 14 14. TODO . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 15. Informative References . . . . . . . . . . . . . . . . . . . 15 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 1. Introduction The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119, BCP 14 Rosenberg, et al. Expires August 25, 2021 [Page 2] Internet-Draft Cloud SIP February 2021 [RFC2119] and indicate requirement levels for compliant CoAP implementations. Software making use of the Session Initiation Protocol (SIP) [RFC3261] faces challenges in achieving high availability, especially for call stateful applications like softswitches, Session Border Controllers (SBCs), and IP-based call centers applications. The state maintained in the SIP, Session Description Protocol (SDP) Offer/Answer [RFC3264] and Secure Real Time Transport Protocol (SRTP) [RFC3711] layers changes frequently, and is difficult to replicate. For this reason, commercial systems have often relied on complex active-standby configurations making use of IP address takeover. These solutions are also ill-suited for usage in modern public cloud environments. SIP assumed server-side components would not maintain call state, and thus it never had built-in mechanisms to facilitate server side HA. In practice, the vast majority of server deployments are B2BUAs and maintain call state. Besides the challenges in replicating call state, SIP also struggles in achieving HA in modern cloud deployments making use of elastic compute. In these environments, the underlying cloud platform (such as kubernetes), can automatically add and remove instances to a cluster based on usage. Similarly, they will remove elements from the cluster which fail health checks. This information needs to propagate quickly to upstream elements, in order to avoid sending calls to failed or overloaded instances. SIP envisioned that a DNS- based solution using SRV records, [RFC3263] would be sufficient. However, DNS changes are slow to propagate and unpredictable. Commercial implementations have made use of SIP OPTIONS probing to assess liveliness, without standardized behavior. There is also no standardized way to communicate or update the IP addresses used in a cluster of servers. This specification seeks to remedy these gaps. It defines a simple SIP extension, which is largely a definition of mandatory behaviors for SIP elements, that enable rapid detection and recovery from a failed instance while ensuring that calls do not drop. It also defines a small protocol for retrieving and pushing the set of instances in a cluster so support elastic expansion and contraction of a cluster in a fully automated fashion. 2. Applicability This extension is focused on server-to-server use cases, where one or both sides are a cluster of servers deployed in a public cloud environment. Examples of these situations include SIP trunks between a PSTN carrier and an enterprise, a PSTN carrier and a VOIP provider (such as a cloud contact center), or between VOIP providers providing Rosenberg, et al. Expires August 25, 2021 [Page 3] Internet-Draft Cloud SIP February 2021 peering. The extension also assumes usage in bilateral peering arrangements, and as such, provides no mechanism for discovery. Rather, it assumes both sides have agreed to use this extension as part of configuration provided through techniques outside the scope of this specification. 3. Requirements o The solution must enable a call to be recovered in less than 2 seconds. This time represents the amount of time before which a user would hangup because they cannot hear the other party. o A recovered call means that media continues to flow, and future signaling for features or call hangup, can be performed o The HA technique must not require servers in the cluster to replicate any SIP/SDP/RTP state beyond the dialog identifiers for calls o The solution should minimize the changes required to the SIP and RTP protocols and their respective implementations o The solution must support the case where the telco is using traditional SBCs and is not deploying kubernetes or using public cloud o The solution must enable fully automated elastic expansion and contraction of clusters o The solution must support availability, so that when an instance in a cluster fails, new calls are distributed across the remaining N instances o The solution must support availability, so that when an instance of a cluster fails, all of the active calls that were being handled by that instance are spread across the remaining nodes in the cluster, within 2 seconds o The solution must support clusters wherein each instance of a cluster has a differing amount of capacity for call handling o The solution must support the ability for instances of a cluster to gracefully shut down without dropping calls Rosenberg, et al. Expires August 25, 2021 [Page 4] Internet-Draft Cloud SIP February 2021 4. Relationship to RIPT This protocol is similar in goals to RIPT - enabling SIP servers to run in public cloud environments, and achieve HA through techniques employed by web applications. RIPT attempted to solve this problem by utilizing HTTP/3 and fully redefining SIP, repairing many of its problems in the process. This specification is less ambitious, focusing on the minimum changes to SIP required to facilitate HA. As such, this specification does not alleviate the value in a full- fledged replacement for SIP. 5. Reference Architecture Cloud SIP uses an assymetric relationship between peers. One side acts as the caller, and the other as the call recipient. New SIP calls can only be placed by the caller, not by the call recipient. If a deployment requires calls to flow in both directions, each side acts as both caller and call recipient. Caller Call Recipient SIP Server Cluster ...................... . +------------+ . +------.----| Instance 1 | . | . +------------+ . | . . +------------+ +---------+ | . +------------+ . | Downstream | | Calling |----+------.----| Instance 2 |--.---------| SIP | | Server | | . +------------+ . | UA | +---------+ | . . +------------+ | . +------------+ . +------.----| Instance 3 | . . +------------+ . ...................... | | | | +---------+ +---------+ +---------+ | Config |<------------->| Config | | Shared | | Sync | | Source | | DB | +---------+ +---------+ +---------+ The calling server wishes to send calls to a cluster, which has a set of instances. The calling server is a B2BUA, and is capable of initiating calls, typically in response to an upstream INVITE it receives. The calling server itself may be a member of a cluster. Rosenberg, et al. Expires August 25, 2021 [Page 5] Internet-Draft Cloud SIP February 2021 When the calling server wishes to generate a new INVITE for a new call, it load balances them amongst the instances in the cluster. Consider a specific call that was sent to instance 2, and was then forwarded through zero or more SIP proxies (not shown) before landing at a UA, referred to here as the downstream SIP UA. The downstream SIP UA may itself be another B2BUA that is a member of a cluster, or even be an end user client. This specification requires the downstream UA to implement the SIP Replaces header field [RFC3891]. When instance 2 fails, we wish to have the call taken over by one of the other instances in the cluster, which can then re-establish media with the downstream SIP UA using INVITE/Replaces. There is a logical function associated with the cluster, called the config source, which is aware of the configuration of the cluster. Specifically, it knows the IP/port of each instance, and whether that instance is healthy. This config source learns this information through non-standardized means, unique to the cloud environment in which the cluster resides. The config source communicates that information to a config sync associated with the upstream calling server. This communication is bidirectional, using HTTP requests. The config sync distributes this information to the calling server (and any other calling servers should they themselves be a cluster). There is a shared database of some sorts, accessible by all instances in the cluster. This is used to store the dialog state needed for operation of this extension. 6. Solution Applicability This specification is applicable in two scenarios: 1. The calling server (and other members in its cluster) and the config sync service are controlled by one entity, and the cluster and the downstream UA are controlled by a second. A common example of this is where the calling server and config sync are part of a telecom carrier, and the cluster and downstream UA are part of an enterprise or SaaS provider that has purchased SIP trunking services from the carrier. 2. The calling server, cluster, and downstream UA are all controlled by a single administrative entity. 3. The instances which make up the cluster are assumed to be provided by the same vendor. This allows for vendor-specific solutions to replicate state and messaging as required by this specification. Rosenberg, et al. Expires August 25, 2021 [Page 6] Internet-Draft Cloud SIP February 2021 This specification also requires that the calling server be a UA (including B2BUAs), and that the instances in the cluster are B2BUAs and the downstream UA is under the administrative control of the same entity that operates the cluster. 7. Overview of Solution The solution is pretty straightforward. The calling server will maintain, through the HTTP-based protocol described below, a list of instances in the cluster. These instances are identified by both IP and port. The inclusion of a port allows the instances to share a common IP but vary by port. Such a configuration is useful inside of public cloud environments which can be fronted by a network load balancer which allows each instance to actually have the same IP, but utilize different ports. The calling server continuously validates that each instance in the cluster is alive, every 250ms. New calls are delivered only to instances which are healthy based on the algorithm defined here. It can ascertain health via reverse RTP traffic, rapid RTCP receiver reports, or via SIP OPTIONS. If SIP OPTIONS are used, these are performed at a rate of a new transaction every 250ms. This is very fast, but it is critical for rapid detection of failures. If the calling server is itself a member of a cluster, the work of ascertaining the health of each instance can be distributed across the calling servers, in order to avoid a full-mesh of OPTION probing, and then the resulting state distributed through means outside of this specification. When a call is initially established - to instance 2 in this case - instance 2 will place an entry into the database which contains three pieces of information - (1) the dialogID of the SIP leg from the caller to itself, (2) the dialog ID of the downstream SIP leg from itself to the downstream SIP UA, (3) the IP address and port of the downstream SIP UA. If an instance transitions from healthy to unhealthy, the calling server 'moves' the existing instance 2 calls uniformly across to the remaining healthy instances in the cluster. To avoid a flood of instant traffic, it moves these calls over a window of at least 200ms and at most one second. To move the calls, the calling server sends an INVITE w. Replaces header field for each such call. Because this is a fresh SIP dialog, a new SDP offer/answer and SRTP is established. This is what avoids the need for replication of RTP state, SDP state and other lower-layer states across the instances in the cluster. When the INVITE/Replaces arrives at one of the Rosenberg, et al. Expires August 25, 2021 [Page 7] Internet-Draft Cloud SIP February 2021 instances in the cluster (say, instance 3), the instance takes the dialogID in the Replaces header field, and looks it up in the shared DB. It will find that there is a matching dialog, and it will retrieve the outbound dialogID and downstream SIP UA. Instance 3 sends an INVITE/Replaces to the downstream UA, using the dialogID it retrieved from the database. Establishment of a new SIP dialog between the calling server and instance 3 can take place in parallel with the establishment of the new dialog between instance 3 and the downstream UA. Thus the time required to failover the live call is equal to the time to detect instance failure, plus the time to establish a new SIP call. 8. Configuration A cloud SIP "trunk" is configured in the config sync service through means outside of the scope of this specification. Each such trunk is defined by an HTTPS URI - the trunk config URI - which points to the config source service representing that cluster. This is the only configuration required to establish a cloud SIP trunk. Once configured with this URI, the config sync MUST perform a GET against this URI. The config source MUST return a JSON document conformant to the schema defined in this specification. This document MUST provide the config sync with a list of instances, each with IP address and port. The JSON document MUST also contain a cluster name, formatted as a hostname, and a webhook registration URI. Once retrieved, the config sync MUST perform a POST against the webhook registration URI. The POST MUST contain a JSON document conformant to the schema defined in this specification. That document MUST contain an HTTPS webhook URI used by the config sync to receive webhook callbacks that push an updated cluster configuration from the config source. The config sync MUST refresh its webhook registration at least once a day to ensure that an up to date value for the webhook URI exists. The config source MUST perform a POST against the webhook URI whenever the cluster configuration changes, including when it detects, on its own, that an instance is unhealthy, removing it from the list. Rosenberg, et al. Expires August 25, 2021 [Page 8] Internet-Draft Cloud SIP February 2021 9. SIP Behavioral Requirements 9.1. Calling Server 9.1.1. Health Probing The calling server MUST be capable of detecting the failure of any instance of the cluster within 1.5 + RTT seconds. The specific means for doing this detection can vary by implementation. It is also expected that some implementations may have failure detection computed from one instance of a calling server, and the resulting state shared with other instances of the calling server through some means outside the scope of this specification. One suggested technique for detecting failure is to utilize a SIP OPTIONS probe. The OPTIONS request can be sent every 250ms, directed to each instance of the cluster. To facilitate high scale and determination of RTT, a single OPTIONS request can be sent for each transaction (since retransmits are largely useless due to the short timeout defined for this use case). With such an interval, the calling server can consider an instance unhealthy at time T if, at time T, zero OPTIONS responses have been received for a time equal to the RTT to the instance plus 6 * 250ms = 1.5s. The calling server can maintain the RTT in any fashion it desires. If the OPTIONS requests for a specific transaction are not retransmitted, the time between transmission of the request and receipt of the response can be used to measure RTT. A MUST strength for 1.5s + RTT is specified to ensure that the cluster can count on consistent and predictable behavior from the upstream calling server. An instance is considered healthy if it is not unhealthy. OPEN ISSUE: SHould we put a Require header field in the OPTIONS? Should we specify any other behaviors in the OPTIONS? 9.1.2. Utilization Measurement The instances can place a SIP header, Instance-Utilization, into all responses sent to the calling server. These values indicate the utilization of that instance, as an integral value from 0 to 100. They are used by the calling server to weight the traffic in proportion to utilization. The calling server MUST remove this header field before propagating it in any upstream responses, as they only have significance on the link between the calling server and cluster. Rosenberg, et al. Expires August 25, 2021 [Page 9] Internet-Draft Cloud SIP February 2021 If this header field is present in a response, the calling server MUST remember the most recent value received from that instance (ordered by the wall clock time at which the response is received). The calling server MUST NOT utilize the source IP of the response to identify the instance. Instead, it MUST correlate the response to a request, and remember the instance to which the request was sent. If no value has been received for 5 seconds, or no value was ever received, the default value of 50 MUST be used as the utilization. 9.1.3. New Call Initiation The calling server MUST NOT place a new SIP call to an instance in the cluster which is unhealthy at the time the call is to be placed. The calling server MUST select an instance for the call using a random function across the instances which are healthy. The calling server MUST weight the probability of selecting that instance in proportion to (100 - the utilization of that instance) . It MUST then direct the call to this instance, by sending the SIP INVITE to the IP address and port of this instance. As an example, if a cluster has three instances with utilizations at 50, 75 and 100, and all three instances are healthy, no INVITEs are sent to the third instance, 66% are sent to instance one, and 33% are sent to instance 2. Note that, in this case, since instance 3 is not handling new calls, further utilization values can only be learned via responses to OPTION pings, which the calling server MUST send for instances with over 90% utilization. 9.1.4. Instance Failure When the calling server detects the failure of an instance, it MUST identify all calls which are still active, which were sent to that instance. For each such call, it MUST select a new instance for that call, by choosing one using a uniformly distributed random function amongst the healthy instances. The calling server MUST generate a new INVITE (not a re-INVITE), establishing a new SIP dialog. This INVITE MUST contain a Replaces header field. The Replaces header field MUST contain the dialogID of the call which is being failed over. The INVITE requests MUST be sent uniformly across a 500ms window of time. 9.1.5. Instance to Inactive If the config sync receives an updated configuration file, and one of the instances from the cluster has been marked as inactive, the calling server MUST NOT send new calls to that instance. However, it Rosenberg, et al. Expires August 25, 2021 [Page 10] Internet-Draft Cloud SIP February 2021 MUST keep existing calls up, and MUST continue to send OPTIONS probes to that instance. 9.1.6. Receiving a REFER If the calling server receives a REFER request, and the Refer-To URI has a domain portion equal to the IP address of a cluster instance or the FQDN of the cluster, and the Refer-To URI contains an embedded Replaces header field containing a dialogID of a call managed by the calling server, then this REFER is meant to trigger a movement of the call. The calling server MUST authenticate that this request came from an instance in the cluster. The request is authorized if the domain portion of the Refer-To URI contains an IP address of a cluster instance, or the FQDN of the instance. Furthermore the dialogID in the embedded Refer-To header field matches a dialog that is in progress to that cluster. If the domain portion of the URI contains an IP address, the calling server MUST perform the requested INVITE/Replaces to that cluster instance. If the domain portion contains the FQDN of the cluster, the calling server MUST send the INVITE/Replacs to one of the other cluster instances, besides the one to which the dialog is currently connected. It MUST select amongst the other instances as if the currently connected instance were inactive, and then round robin using the utilization measures for the remaining instances. TODO: better explanation, more details 9.2. Cluster Instances 9.2.1. Sending Utilization Values It is RECOMMENDED that if any one instance of a cluster send values for Cluster-Utilization, all instances do. If none send it, calls will be uniformly balanced across the cluster. Thus, the usage of this header field is only meant for cases where uniform load balancing will not produce uniform utilization. If an instance is configured to send utilization, it MUST place an Instance-Utilization header field in all responses it sends to all transactions, and include its current measure of utilization. The utilization measure MUST be an integer between 0 and 100 inclusive. Since absolute ordering of responses cannot be guaranteed, the measure SHOULD NOT change more frequently than once a second. Rosenberg, et al. Expires August 25, 2021 [Page 11] Internet-Draft Cloud SIP February 2021 9.2.2. Receiving INVITE w. Replaces If an instance in the cluster receives an INVITE for a call, and that call has a Replaces header field containing a dialogID for a call that the instance knows is in progress within the cluster, it will know that this is a failover call. It may happen that the failover call is one being handled by the instance receiving the INVITE with Replaces. This is a race condition, but in this case the instance MUST still follow the procedures defined here. If this is a failover call, the instance MUST authenticate that the INVITE came from the upstream calling server. There may be cases where the cluster instance receives an INVITE with Replaces header field, but the dialogID does not match a dialog known to the cluster. In such a case, the INVITE MUST be treated as a normal INVITE with a Replaces header field as defined by [RFC3891]. In many cases this may be propagated downstream, or challenged for credentials, neither of which are done if the dialogID is a match for a dialog known to the cluster. Any downstream SIP dialogs associated with the call MUST be sent an INVITE with Replaces, moving the call to this instance. This will necessarily require the cluster to store the dialogIDs for all dialogs in and out of the cluster, along with any application state needed to reconstruct the dialogs at a new instance. 9.2.3. Graceful Shutdown with Migration In cases where an instance in the cluster wishes to shut down quickly (perhaps to facilitate a rolling upgrade across the cluster), it can do so by ceasing to respond to OPTIONS requests targeted to itself. The upstream caller will see this as a failure, and move all of the calls off of the instance, onto the remaining instances in the cluster. When the instance reboots, it will begin responding to the OPTIONS probes, enabling it to begin to receive new calls. 9.2.4. Graceful Shutdown without Migration Another common use case for graceful restart is to cease accepting new calls, but to allow the calls in progress to complete. Once all of the calls have completed, the instance can shut down and restart if desired. To accomplish this, the cluster config service will mark the instance as inactive in the config file, and pass the updated file to the config sync via webhook. This will cause the calling server to stop Rosenberg, et al. Expires August 25, 2021 [Page 12] Internet-Draft Cloud SIP February 2021 sending new calls to the instance. However, calls in progress will not be dropped. 9.3. Moving a Dialog Another common case is that an instance is overloaded and wishes to shed a few calls. To facilitate this, a cluster instance MAY send a REFER to the calling server, requesting it to send an INVITE with a Replaces header field. The Refer-To header field embedded in the Refer-To URI MUST contain the dialogID of the call from the calling server to that instance, which is to be moved. To move the call to a specific other instance in the cluster, the domain portion of the URI is set to be equal to the IP address of that instance. Note that the calling server will validate that this IP address is another member of the cluster before authorizing the REFER. Alternatively, the REFER can request the calling server to send the call to any one of the other instances in the cluster, not including itself. To do that, it sets the domain portion of the SIP URI equal to the cluster FQDN. TODO: Probably need examples and some more details on in or out of dialog REFER 10. Cloud SIP Trunk Configuration File Something like: { "cloud-sip-trunk-name" : "trunk32.acme.com", "uri" : "https://configs.sip.acme.com/trunk32", "version": 23, "webhook-registration" : "https://webhooks.sip.acme.com/trunk32", "instances" : [ { "IP" : "1.2.3.4", "port" : "5061", "status" : "active" }, { "IP" : "1.2.3.7", "port" : "5061", "status" : "inactive" } ] } Rosenberg, et al. Expires August 25, 2021 [Page 13] Internet-Draft Cloud SIP February 2021 11. Webhook Registration Object Something like: { "webhook" : "https://webhook-receipt.sip.acme.com" } 12. Instance-Utilization Header Field Something like: { Instance-Utilization: 34 } IANA registration and formal syntax TBD. 13. Why not DNS The usage of DNS - and specifically [RFC3263] - might appear to be an alternative to the mechanism in this specification for communicating the IP addresses for the instances of the cluster. However, DNS does not meet the requirements outlined above. Firstly, DNS is not fast enough to be responsive to the need to add or remove an instance from the cluster. Changes in DNS can take time to propagate. At the time [RFC3263] was conceived, the notion of elastic (and automated) expansion and contraction of clusters did not exist. Cluster instance IPs were extremely static and therefore DNS was sufficient. This is no longer the case. Secondly, DNS cannot convey state - in particular, information about whether the cluster instances are active or inactive. This is needed to facilitate graceful shutdown of instances. [RFC3263] did not have to concern itself with this problem, because at the time it was believed SIP servers would not contain call state, and therefore, we would not need to worry about this problem. In addition, because we need to failover extremely quickly - in under two seconds - the calling server needs to perform rapid health probing against all instances in the cluster. This requires the calling server to know all of the IP addresses of the all the instances in the cluster. Typically, DNS queries for an FQDN return one or perhaps a handful of A records, and not every single A record. We expect this specification to be used with clusters that have Rosenberg, et al. Expires August 25, 2021 [Page 14] Internet-Draft Cloud SIP February 2021 instances counts in the hundreds, which is wholly inappropriate to convey via DNS. 14. TODO Reconcile this with draft-kinamdar-dispatch-sip-audo-peer. 15. Informative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., Peterson, J., Sparks, R., Handley, M., and E. Schooler, "SIP: Session Initiation Protocol", RFC 3261, DOI 10.17487/RFC3261, June 2002, . [RFC3263] Rosenberg, J. and H. Schulzrinne, "Session Initiation Protocol (SIP): Locating SIP Servers", RFC 3263, DOI 10.17487/RFC3263, June 2002, . [RFC3264] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model with Session Description Protocol (SDP)", RFC 3264, DOI 10.17487/RFC3264, June 2002, . [RFC3711] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. Norrman, "The Secure Real-time Transport Protocol (SRTP)", RFC 3711, DOI 10.17487/RFC3711, March 2004, . [RFC3891] Mahy, R., Biggs, B., and R. Dean, "The Session Initiation Protocol (SIP) "Replaces" Header", RFC 3891, DOI 10.17487/RFC3891, September 2004, . Authors' Addresses Jonathan Rosenberg Five9 Email: jdrosen@jdrosen.net Rosenberg, et al. Expires August 25, 2021 [Page 15] Internet-Draft Cloud SIP February 2021 Cullen Jennings Cisco Email: fluffy@cisco.com Tolga Asveren Ribbon Communications Email: tasveren@rbbn.com Rosenberg, et al. Expires August 25, 2021 [Page 16]