QUIC B. Trammell Internet-Draft ETH Zurich Intended status: Informational August 20, 2018 Expires: February 21, 2019 Why do we need passive measurement of round trip time? draft-trammell-why-measure-rtt-00 Abstract This document describes the utility of passive two-way latency measurement, both for the generation of latency metrics, as well as for other measurement tasks, when passive latency measurement is the only facility available for measurement. It additionally discusses other metrics derivable from the transport-independent latency spin signal defined in [TSVWG-SPIN]. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on February 21, 2019. Copyright Notice Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of Trammell Expires February 21, 2019 [Page 1] Internet-Draft Why measure RTT? August 2018 the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. About This Document . . . . . . . . . . . . . . . . . . . 3 2. Direct Utility of Passive RTT Measurement . . . . . . . . . . 3 2.1. Inter-domain Troubleshooting . . . . . . . . . . . . . . 3 2.2. Bufferbloat Mitigation in Cellular Networks . . . . . . . 4 2.3. Locating WiFi Problems in Home Networks . . . . . . . . . 4 2.4. Internet Measurement Research . . . . . . . . . . . . . . 5 3. Indirect Utility of RTT Measurements . . . . . . . . . . . . 5 4. Additional Metrics Derivable from the Spin Bit . . . . . . . 6 4.1. Derived Loss and Reordering . . . . . . . . . . . . . . . 6 4.2. Two-Point Intradomain Measurement . . . . . . . . . . . . 7 5. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 7 6. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 8 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 8 7.1. Normative References . . . . . . . . . . . . . . . . . . 8 7.2. Informative References . . . . . . . . . . . . . . . . . 8 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 9 1. Introduction Latency is a key metric to understanding network operation and performance. Passive measurement allows inspection of latency on productive traffic, avoiding problems with different treatment of productive and measurement traffic, and enables opportunistic measurement of latency without active measurement overhead. Passive measurement of RTT, in particular, has both direct utility (see Section 2), generating RTT samples for the measurement of RTT for various use cases, as well as indirect utility (see Section 3), since RTT is correlated with other useful metrics. In addition, the passive latency signal proposed in [TSVWG-SPIN] provides other opportunities for metric generation which are a consequence of its design (see Section 4). This document describes these use cases in order to motivate why passive measurability of RTT on a per-flow basis is an interesting and useful feature for a transport protocol to have. In the absence of other directly observable metrics such as loss and retransmission, as is the case with protocols with mostly-encrypted wire images [WIRE-IMAGE] such as QUIC [QUIC]. Trammell Expires February 21, 2019 [Page 2] Internet-Draft Why measure RTT? August 2018 1.1. About This Document This document is maintained in the GitHub repository https://github.com/britram/draft-trammell-tsvwg-spin, and the editor's copy is available online at https://britram.github.io/draft- trammell-tsvwg-spin. Current open issues on the document can be seen at https://github.com/britram/draft-trammell-tsvwg-spin/issues. Comments and suggestions on this document can be made by filing an issue there, or by contacting the editor. This document is based in part on [QUIC-SPIN], however, aside from Section 4, it is not specific to the spin bit proposal. 2. Direct Utility of Passive RTT Measurement RTT measurement generates two-way latency metric samples; these samples are useful in many measurement tasks which directly require latency data. The measurement methodologies using two-way latency measurement samples follow one of a few basic variants: o The RTT evolution of a flow or a set of flows can be compared to baseline or expected RTT measurements for flows with the same characteristics in order to detect or localize latency issues in a specific network. o The RTT evolution of a single flow can also be examined in detail to diagnose performance issues with that flow. o Samples of RTT for a flow aggregate (e.g., all flows between two given networks) can be used without regard to temporal evolution of the RTT, in order to examine the distribution of RTTs for a group of flows that should have similar RTT (e.g., because they should share the same path(s)). 2.1. Inter-domain Troubleshooting Network access providers are often the first point of contact by their customers when network problems impact the performance of bandwidth-intensive and latency-sensitive applications such as video, regardless of whether the root cause lies within the access provider's network, the service provider's network, on the Internet paths between them, or within the customer's own network. Points on path can extract spatial delay metric samples [RFC6049] from fields of the transport layer (e.g. TCP) or application layer (e.g. RTP). The information is captured in the upper layer because neither the IP header nor the UDP layer includes fields allowing the measurement of upstream and downstream delay. Trammell Expires February 21, 2019 [Page 3] Internet-Draft Why measure RTT? August 2018 Local network performance problems are detected with monitoring tools which observe the variation of upstream latency and downstream latency. Inter-domain troubleshooting relies on the same metrics but is not a proactive task; instead, it is a recursive process which hones in on the domain and link responsible for the failure. In practice, inter- domain troubleshooting is a communication process between the Network Operations Center (NOC) teams of the networks on the path, because the root cause of a problem is rarely located on a single network, and requires cooperation and exchange of data between the NOCs. One example is the troubleshooting performance degradation resulting from a change of routing policy on one side of the path which increases queueing on the other side of the path. 2.2. Bufferbloat Mitigation in Cellular Networks Cellular networks consist of multiple Radio Access Networks (RAN) where mobile devices are attached to base stations. It is common that base stations from different vendors and different generations are deployed in the same cellular network. Due to the dynamic nature of RANs, base stations have typically been provisioned with large buffers to maximize throughput despite rapid changes in capacity. As a side effect, bufferbloat has become a common issue in such networks [WWMM-BLOAT]. An effective way of mitigating bufferbloat without sacrificing too much throughput is to deploy Active Queue Management (AQM) in bottleneck routers and base stations. However, due to the variation in deployed base-stations it is not always possible to enable AQM at the bottlenecks, without massive infrastructure investments. An alternative approach is to deploy AQM as a network function in a more centralized location than the traditional bottleneck nodes. Such an AQM monitors the RTT progression of flows and drops or marks packets when the measured latency is indicative of congestion. Such a function also has the possibility to detect misbehaving flows and reduce the negative impact they have on the network. 2.3. Locating WiFi Problems in Home Networks Many residential networks use WiFi (802.11) on the last segment, and WiFi signal strength degradation manifests in high first-hop delay, due to the fact that the MAC layer will retransmit packets lost at that layer. Measuring the RTT between endpoints on the customer network and parts of the service provider's own infrastructure (which Trammell Expires February 21, 2019 [Page 4] Internet-Draft Why measure RTT? August 2018 have predictable delay characteristics) can be used to isolate this cause of performance problems. The network provider can measure the RTT at the home gateway, or at an upstream point if there is no access to home gateway. A problem in the WiFi network is identified by seeing high delay and low packet loss. These measurements are particularly useful for traffic which is latency sensitive, such as interactive video applications. However, since high latency is often correlated with other network-layer issues such as chronic interconnect congestion [IMC-CONGESTION], it is useful for general troubleshooting of network layer issues in an interdomain setting. In this case, multiple RTT samples per flow are useful less for observing intraflow behavior, and more for generating sufficient samples for a given aggregate to make a high-quality measurement. 2.4. Internet Measurement Research As a large, distributed, engineered system with no centralized control, the Internet has emergent properties of interest to the research community not just for purely scientific curiosity, but also to provide applicable guidance to Internet engineering, Internet protocol design and development, network operations, and policy development. Latency measurements in particular are both an active area of research as well as an important tool for certain measurement studies (see, e.g. [IMC-TCPSIG], from the most recent Internet Measurement Conference). While much of this work is currently done with active measurements, the ability to generate latency samples passively or using a hybrid measurement approach (i.e., through passive observation of purpose-generated active measurement traffic; see [RFC7799]) can drastically increase the efficiency and scalability of these studies. 3. Indirect Utility of RTT Measurements In addition to the direct generation of RTT metric samples, RTT measurement can also be used for indirect generation of other metrics when more direct means are not available. A variety of tools are used for detailed troubleshooting of the performance of single flows, both for debugging transport- and application-layer protocol implementations, as well as to determine whether a particular end-to-end performance issue is related to particular network conditions. One common type of visualization used for TCP (implemented, for example, in the TCP Stream Graphs feature Trammell Expires February 21, 2019 [Page 5] Internet-Draft Why measure RTT? August 2018 of Wireshark, https://www.wireshark.org/) shows the development over time of the sequence and acknowledgment numbers, including retransmissions, and the evolution of the inflight and receiver flow control windows over time. By analyzing the relationship among loss, latency, and throughput, the precise cause of an observed performance on a given flow can be determined. While RTT measurements on their own are not enough to drive such a visualization, many similar techniques can be built on high- resolution time series RTT data. Here we exploit two properties of transport protocols: o The size of the inflight window is equal to the number of bytes/ packets sent per RTT, so inflight window evolution can be generated at each RTT sample r at t, and summing the number of bytes/packets sent between t - r and t. o Changes in the inflight window can be related to sender reactions to congestion. For common loss- and ECN-based congestion control protocols such as NewReno [RFC6582] and Cubic [RFC8312], inflight window reductions are correlated with sender-experienced congestion or loss. Inflight window evolution over time, together with heuristic assumptions about server behavior, can go a long way toward replacing direct visibility of transport protocol dynamics (sequence and acknowledgment number seqence over time) for encrypted transports; the exact details of this are a subject of present and future research. 4. Additional Metrics Derivable from the Spin Bit The latency spin signal mechanism itself [TSVWG-SPIN] has additional measurement utility; these observations do not apply to other methodologies for measuring RTT. 4.1. Derived Loss and Reordering When used alone (as a one-bit signal), measurement systems using the latency spin bit must use heuristics to reject samples which are potentially-lost, potentially-reordered, or potentially-delayed. When these heuristics are instrumented to note their sample rejection rate, this rate itself is a potentially-useful proxy metric for "difficulty" (vaguely defined) experienced by a flow. When the latency signal is used with the Valid Edge Counter (VEC), additional information is available in the wire image to reject samples due to loss, delay, or reordering. Analysis of the VEC Trammell Expires February 21, 2019 [Page 6] Internet-Draft Why measure RTT? August 2018 together with the series of spin bit values can be used to recognize single loss and reordering events, which can be used to generate loss and reordering metrics at the resolution of the flow's round trip time. Optimal use of the VEC signal to generate loss and reordering metric signals is a subject of ongoing research. 4.2. Two-Point Intradomain Measurement The spin bit is also useful as a basic signal for instantaneous measurement of the treatment of traffic carrying the latency spin signal within a single network. Though the primary design goal of the spin bit signal is to enable single-observer on-path measurement of end-to-end RTT, the spin bit can also be used by two cooperating observers with access to traffic flowing in the same direction as an alternate marking signal, as described in [ALT-MARK]. The only difference from alternate marking with a generated signal is that the size of the alternation will change with the flight size each RTT. However, these changes do not affect the applicability of the method that works for each marking batch separately applied between two measurement points on the same direction. This two point measurement is an additional feature enabled "for free" by the spin bit signal. So, with more than one observer on the same direction, it can be useful to segment the RTT and deduce the contribution to the RTT of the portion of the network between two on-path observers. This can be easily performed by calculating the delay between two or more measurement points on a single direction by applying [ALT-MARK]. In this way, packet loss, delay and delay variation can be measured for each segment of the network depending on the number and distribution of the available on-path observation points. When these observation points are applied at network borders, the alternate-marking signal can be used to measure the performance of QUIC traffic within a network operator's own domain of responsibility. own portion of the network. 5. Contributors This document contains text from [QUIC-SPIN], which is the work of the following authors in addition to the editor of this document: o Piet De Vaere, ETH Zurich o Roni Even, Huawei o Giuseppe Fioccola, Telecom Italia o Thomas Fossati, Nokia Trammell Expires February 21, 2019 [Page 7] Internet-Draft Why measure RTT? August 2018 o Marcus Ihlar, Ericsson o Al Morton, AT&T Labs o Emile Stephan, Orange 6. Acknowledgments Thanks to Mark Nottingham for suggesting that this document should exist. This work is partially supported by the European Commission under Horizon 2020 grant agreement no. 688421 Measurement and Architecture for a Middleboxed Internet (MAMI), and by the Swiss State Secretariat for Education, Research, and Innovation under contract no. 15.0268. This support does not imply endorsement. 7. References 7.1. Normative References [TSVWG-SPIN] Trammell, B., "A Transport-Independent Explicit Signal for Hybrid RTT Measurement", draft-trammell-tsvwg-spin-00 (work in progress), July 2018. 7.2. Informative References [ALT-MARK] Fioccola, G., Capello, A., Cociglio, M., Castaldelli, L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi, "Alternate Marking method for passive and hybrid performance monitoring", draft-ietf-ippm-alt-mark-14 (work in progress), December 2017. [IMC-CONGESTION] Luckie, M., Dhamdhere, A., Clark, D., Huffaker, B., and k. claffy, "Challenges in Inferring Internet Interdomain Congestion (in Proc. ACM IMC 2014)", November 2014. [IMC-TCPSIG] Sundaresan, S., Dhamdhere, A., Allman, M., and . k claffy, "TCP Congestion Signatures (in Proc. ACM IMC 2017)", n.d.. [QUIC] Iyengar, J. and M. Thomson, "QUIC: A UDP-Based Multiplexed and Secure Transport", draft-ietf-quic-transport-14 (work in progress), August 2018. Trammell Expires February 21, 2019 [Page 8] Internet-Draft Why measure RTT? August 2018 [QUIC-SPIN] Trammell, B., Vaere, P., Even, R., Fioccola, G., Fossati, T., Ihlar, M., Morton, A., and S. Emile, "Adding Explicit Passive Measurability of Two-Way Latency to the QUIC Transport Protocol", draft-trammell-quic-spin-03 (work in progress), May 2018. [RFC6049] Morton, A. and E. Stephan, "Spatial Composition of Metrics", RFC 6049, DOI 10.17487/RFC6049, January 2011, . [RFC6582] Henderson, T., Floyd, S., Gurtov, A., and Y. Nishida, "The NewReno Modification to TCP's Fast Recovery Algorithm", RFC 6582, DOI 10.17487/RFC6582, April 2012, . [RFC7799] Morton, A., "Active and Passive Metrics and Methods (with Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799, May 2016, . [RFC8312] Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and R. Scheffenegger, "CUBIC for Fast Long-Distance Networks", RFC 8312, DOI 10.17487/RFC8312, February 2018, . [WIRE-IMAGE] Trammell, B. and M. Kuehlewind, "The Wire Image of a Network Protocol", draft-trammell-wire-image-04 (work in progress), April 2018. [WWMM-BLOAT] Alfredsson, S., Giudice, G., Garcia, J., Brunstrom, A., Cicco, L., and S. Mascolo, "Impact of TCP Congestion Control on Bufferbloat in Cellular Networks (in Proc. IEEE WoWMoM 2013)", June 2013. Author's Address Brian Trammell ETH Zurich Email: ietf@trammell.ch Trammell Expires February 21, 2019 [Page 9]