ALTO WG Q. Xiang Internet-Draft Y. Yang Intended status: Standards Track Tongji/Yale University Expires: June 20, 2018 December 17, 2017 Unicorn: Resource Orchestration for Large-Scale, Multi-Domain Data Analytics draft-xiang-alto-multidomain-analytics-00.txt Abstract This document presents the design of Unicorn, a multi-domain, geographically-distributed, data-intensive analytics system. The setting of such a system includes edge science networks, which provide storage and computation resources for collecting, sharing and analyzing extremely large amounts of data, and transit networks, which provide networking resources to connects edge science networks for transmitting large science datasets. The key design challenge is to accurately discover and represent resource information from different domains. Unicorn leverages multiple ALTO services, including ALTO-Path Vector, ALTO-Routing State Abstraction, ALTO-Server-Side Event and ALTO-Flow Cost Service to address this challenge. In particular, Unicorn decomposes the resource discovery into three phases. The first phase is to identify endpoint resource, e.g., dataset storage location, computation resource location and output storage resource location. The second phase is to identify the reachability information between the locations of storage and computation resources. The third phase is to identify the available networking resource connecting different storage and computation resources. All information collected through these three phases can be used by a logically centralized scheduling system to orchestrate the resources usage. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any Xiang & Yang Expires June 20, 2018 [Page 1] Internet-Draft Unicorn Design December 2017 time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on June 20, 2018. Copyright Notice Copyright (c) 2017 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Settings . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 3. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4. Storage and Computation Resource Discovery . . . . . . . . . 6 5. Path Discovery . . . . . . . . . . . . . . . . . . . . . . . 6 5.1. Using SDN to get flow-based site-path . . . . . . . . . . 7 5.2. Path Discovery Example . . . . . . . . . . . . . . . . . 7 6. Networking Resource Discovery . . . . . . . . . . . . . . . . 8 6.1. Networking Resource Discovery Example . . . . . . . . . . 8 6.2. A Secure Multiparty Computation Protocol to Compute Minimal, Cross-Domain RSA . . . . . . . . . . . . . . . . 8 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 7.1. Normative References . . . . . . . . . . . . . . . . . . 9 7.2. Informative References . . . . . . . . . . . . . . . . . 9 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 1. Introduction As the data volume increases exponentially over time, data intensive analytics is transiting from single-domain computing to multi- organizational, geographically-distributed, collaborative computing, where different organizations contribute various resources, e.g., computation, storage and networking resources, to collaboratively collect, share and analyze extremely large amounts of data. One leading example is the Large Hadron Collider (LHC) high energy Xiang & Yang Expires June 20, 2018 [Page 2] Internet-Draft Unicorn Design December 2017 physics (HEP) program, which aims to find new particles and interactions in a previously inaccessible range of energies. The scientific collaborations that have built and operate large HEP experimental facilities at the LHC, such as the Compact Muon Solenoid (CMS) and A Toroidal LHC ApparatuS (ATLAS), currently have more than 300 petabytes of data under management at hundreds of sites around the world, and this volume is expected to grow to one exabyte by approximately 2018. This document presents Unicorn, a generic design for resource orchestration for large-scale, multi-domain data analytics. The key design challenge for such a resource orchestration system is to accurately discover and represent the resource information from different domains. Our design resorts to the Application-Layer Traffic Optimization Protocol (ALTO) [RFC7285] to address this challenge. In particular, several ALTO extension services, including ALTO-Path Vector, ALTO-Routing State Abstraction, ALTO-Server-Side Event and ALTO-Flow Cost Service, are integrated in the proposed design. This document focuses on the design details of Unicorn. We present the implementation and deployment experience of Unicorn in another document [DRAFT-UNICORN-INFO]. 1.1. Settings The targeting scenario is as follows. There are two types of networks in the whole system. The first type is the edge science network. An edge science networks is usually a cluster residing in a campus network. It provides storage resources to store large scientific datasets and computation resources to analyze these datasets. The second type is the transit network. A transit network does not provide any storage or computation resources. It only provides networking resources to inter-connect different edge science networks so that datasets can be moved and shared between different edge science networks. Edge science networks do not directly connect to each other, but are connected through transit networks. Without loss of generality, a data analytics task is defined as a 3-tuple: (input dataset, program, output site). A task can be further decomposed into a set of jobs, who have a precedence relation defined by a directed acyclic graph (DAG). And each job can also be defined as a 3-tuple: (input dataset, program, output site). Xiang & Yang Expires June 20, 2018 [Page 3] Internet-Draft Unicorn Design December 2017 2. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 3. Overview The key design challenge for multi-domain data analytics system is to accurately discover the resource information from different sites while preserving the autonomy and privacy of each site. In order to address this challenge, the design needs to strike a balance between the information accuracy, the efficiency of resource discovery and the privacy of each site. In particular, we propose the following architecture in the Figure 1. .---------. | Users | '---------' | Tasks .- - - - - - - - - - - - - - -|- - - - - - - - - - - - - - - - - - . | | | | .-----------------------. 1 .------------------------.| | | Resource Orchestrator | -----|Storage/Computation Pool|| | '-----------------------' \ '------------------------'| | / | | 4 \ \ | | 2 / 3 | | 3\ \ 2 | | .-------------. .-----------. .-------------. | | | ALTO Server | | Execution | | ALTO Server | | | '-------------' | Agents | '-------------' | | | '-----------' | | | | / \ | | | .----------------./ \ .----------------. | | | Site 1 | . . . | Site N | | | '----------------' '----------------' | '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ' Figure 1: Resource Orchestration for Large-Scale, Multi-Domain Data Analytics: Architecture. In the proposed design, each site deploys an ALTO server. Within the site, the ALTO server collects resource information, such as dataset locations, storage resources, computation resources, networking resources and so on, and announces its capability, i.e., the type of information it is willing to share with other ALTO servers or the data analytics resource orchestrator. Xiang & Yang Expires June 20, 2018 [Page 4] Internet-Draft Unicorn Design December 2017 In this system, users submit analytics tasks to the resource orchestrator. For a data analytics task, a user needs to at least provide the analytics program. If the input dataset is not specified by the user, it means this task does not require a dataset as input. Similarly, if the user does not specify the computation resource or the output dataset storage site, the system will try to allocate default computation resources to this task, and will return the output dataset directly to the user. After getting a (set of) task(s) from the user(s), the orchestrator discover the available resources for executing the submitted task(s) in three steps, labeled in the figure. The first step is called storage/computation resource discovery. In this step, the orchestrator sends requests to a centralized storage and computation resource pool to find the location of candidate input storage resources, the computation resources and the output storage resources. The second step is called path discovery, in which the resource orchestrator sends endpoint or flow cost service queries to the ALTO servers at the site holding the candidate input storage resources and the site holding the candidate computation resources to ask about the connectivity from input dataset site to the computation site and that from the computation site to the output site. The cost type of such queries is path vector defined in [DRAFT-PV]. The response sent back from the ALTO server to the orchestrator is a vector. Each element in this vector is the IP address of the ingress gateway switch/router that the candidate flow will pass along the AS-path. This vector is called the site-path in this document. After collecting the site-path of all the candidate (storage, computation) flows, for each site X, the orchestrator derives F_X, the set of candidate flows that will consume networking resources in site X. Then the orchestrator will send endpoint/flow cost service queries to the ALTO server at each site X to ask about the networking resource sharing of the flow set F_X in site X. The returned response is a set of linear inequalities called resource state abstraction. Using the resource information collected from the three-phase resource discovery process, the resource orchestrator can run an scheduling algorithm to make the resource allocation decisions to execute the submitted tasks. The decisions include job decomposition (DAG construction), task concatenation, job placement, network resource allocation for input dataset movement and output movement. These decisions will be sent to the corresponding execution agents at different sites, which will practice these decisions and send feedback to the orchestrator. Xiang & Yang Expires June 20, 2018 [Page 5] Internet-Draft Unicorn Design December 2017 When resource state changes, e.g., a network link is broken, the ALTO server at the scene will check whether the results of existing path discovery and networking resource discovery are affected by this event, and sends updated resource information using the ALTO-SSE service. In the next few sections, we present the detailed design of the three-phase resource discovery. 4. Storage and Computation Resource Discovery In order to allocation resources for a (set of) data analytic tasks, the scheduling system must first know the availability of the resources explicitly specified in the task, i.e., the storage resource storing the input dataset, the computation resources to run the analytics program and the storage resource that will be used to store the output dataset. Such resources are only provided by the edge science networks. Therefore, a strawman design is for the scheduling system to send requests to the resource information servers of all the edge science networks and to get such information. However, this solution is inefficient in that the scheduling system needs to query all the edge science networks to get the complete information. This document adopts an alternative design, in which all the resource information servers proactively send all their information about the storage and computation resources to a centralized resource pool. This resource pool can be a DNS server or a traditional database. Different techniques are under investigation to improve the scalability of this design, including sharding and distributed hashing table (DHT). 5. Path Discovery Having identified the locations of input dataset storage nodes, the locations of candidate computation nodes and the locations of candidate output dataset storage nodes, the scheduling system next needs to find out the connectivity information between storage nodes and computation nodes. The first connectivity information is the reachability between storage nodes and the computation nodes. A input storage node, a computation node and a output storage node can be allocated to execute a job only if data movement is allowed between the input storage node and the computation node, and between the computation node and the output storage node. Because edge science networks are connected through transit networks, the data movement between candidate storage nodes and computation nodes need to consume networking resources of multiple networks if Xiang & Yang Expires June 20, 2018 [Page 6] Internet-Draft Unicorn Design December 2017 these nodes are located at different edge science networks. In order to find the networking resource sharing between different (storage, computation) pair, the scheduling system also needs to know which networks are involved in the data movement of each (storage, computation) node pair. To retrieve both the types of information, the scheduling system issues endpoint cost service queries to the ALTO servers at edge science networks. For the ALTO server at an edge science network X, the scheduling system issues endpoint cost service defined in [RFC7285] or the extension flow cost service defined in [DRAFT-FCS] queries for all the (input storage node, computation node) pairs where the input storage node is located in X, and all the (computation node, output storage node) pairs where the computation node is located in X. The cost type of such queries is the new path vector cost type introduced in [DRAFT-PV]. For each (storage, computation) pair, the response sent by the ALTO servers at edge science networks is a path vector providing the information about the AS-level path for the data movement of this pair. Different from the traditional path vector where each element is an AS name/number, each element in the path vector sent by the ALTO servers also includes the ingress IP address of the gateway switch/router of the corresponding network. We call this path vector the "site-path", to differentiate it from the traditional AS-path. 5.1. Using SDN to get flow-based site-path ALTO servers can compute the site-path for a given (storage, computation) pair using the information provided by BGP and traceroute. However, BGP only supports destination-IP based routing and limits each network's ability to make fine-grained flow-based routing decisions. We are investigating the usage of SDN technique to allow different networks in the multi-domain data analytics system to exchange and make fine-grained flow-based inter-domain routing decisions. To avoid the route advertisement explosion brought by flow-based routing, we design use a sub/pub system that allows an ALTO server to send routing information queries of a set of flows, instead of the whole flow space, to other ALTO servers at other domains. 5.2. Path Discovery Example The following is an example of path discovery query made by the orchestrator. Xiang & Yang Expires June 20, 2018 [Page 7] Internet-Draft Unicorn Design December 2017 { "cost-type": { "cost-mode": "array", "cost-metric": "ane-path" }, "endpoint-flows": { "srcs": [ "ipv4:172.0.0.1", "ipv4:172.0.1.1"], "dsts": [ "ipv4:172.0.2.1", "ipv4:172.0.3.1"]} } And the following is the response sent from the ALTO server. {"endpoint-cost-map": "ipv4: 172.0.0.1 ": { "ipv4: 172.0.2.1 ": ["ane:172.1.0.1", "ane:172.0.2.0"], "ipv4: 172.0.3.1 ": ["ane:172.1.0.1", "ane:172.0.3.0"]}, "ipv4: 172.0.1.1 ": { "ipv4: 172.0.2.1 ": ["ane:172.1.0.1", "ane:172.0.2.0"], "ipv4: 172.0.3.1 ": ["ane:172.1.0.1", "ane:172.0.3.0"]} } 6. Networking Resource Discovery The responses from ALTO servers during the path discovery provides the connectivity information for every pair of candidate input dataset storage node and computation node, and that of every pair of candidate computation node to output storage node in the form of site-path. With such information, the scheduling system can further discover the networking resource sharing between candidate (storage, computation) data movement flows. In particular, for each network X, both edge science networks and transit network, we can easily derive the whole set of candidate data movement flows F_X that will enter network X from the site-path information of all candidate (storage, computation) data movement flows. After deriving F_X for each network X, the scheduling system will send endpoint cost services or flow cost services to retrieve the resource state abstraction [DRAFT-RSA] for the flow set F_X. 6.1. Networking Resource Discovery Example TBA. 6.2. A Secure Multiparty Computation Protocol to Compute Minimal, Cross-Domain RSA The current design of ALTO-RSA can only compute the minimal resource state abstraction for a single network. In Unicorn, we design a secure multiparty computation protocol to support the computation of minimal, cross-domain routing state abstraction. This protocol contains each network's exposure of its redundant linear inequalities Xiang & Yang Expires June 20, 2018 [Page 8] Internet-Draft Unicorn Design December 2017 to a small number of other networks, and ensures that the orchestrator only gets the minimal, cross-domain resource state abstraction. The overhead of this SMPC process is reasonable due to the adoption of state-of-the-art secure scalar product protocol. 7. References 7.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . 7.2. Informative References [DRAFT-CC] Randriamasy, S., Yang, R., Wu, Q., Deng, L., and N. Schwan, "ALTO Cost Calendar", 2017, . [DRAFT-DC] Lee, Y., Bernstein, G., Dhody, D., and T. Choi, "ALTO Extensions for Collecting Data Center Resource Information", 2014, . [DRAFT-FCS] Zhang, J., Gao, K., Wang, J., Xiang, Q., and Y. Yang, "ALTO Extension: Flow-based Cost Query", 2017, . [DRAFT-MC] Randriamasy, S., Roome, W., and N. Schwan, "Multi-Cost ALTO", 2017, . [DRAFT-NETGRAPH] Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y. Yang, "ALTO Topology Extensions: Node-Link Graphs", 2015, . [DRAFT-PM] Roome, W. and Y. Yang, "Extensible Property Maps for the ALTO Protocol", 2015, . Xiang & Yang Expires June 20, 2018 [Page 9] Internet-Draft Unicorn Design December 2017 [DRAFT-PV] Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y. Yang, "ALTO Extension: Abstract Path Vector as a Cost Mode", 2015, . [DRAFT-RSA] Gao, K., Wang, X., Xiang, Q., Gu, C., Yang, Y., and G. Chen, "A Recommendation for Compressing ALTO Path Vectors", 2017, . [DRAFT-SSE] Roome, W. and Y. Yang, "ALTO Incremental Updates Using Server-Sent Events (SSE)", 2015, . [DRAFT-UNICORN-INFO] Xiang, Q., Newman, H., Bernstein, G., Du, H., Gao, K., Mughal, A., Balcas, J., Zhang, J., and Y. Yang, "Implementation and Deployment of A Resource Orchestration System for Multi-Domain Data Analytics", 2017, . [HTCondor] Thain, D., Tannenbaum, T., and M. Livny, "Distributed computing in practice: the Condor experience", 2005, . [RFC7285] Alimi, R., Ed., Penno, R., Ed., Yang, Y., Ed., Kiesel, S., Previdi, S., Roome, W., Shalunov, S., and R. Woundy, "Application-Layer Traffic Optimization (ALTO) Protocol", RFC 7285, DOI 10.17487/RFC7285, September 2014, . Authors' Addresses Qiao Xiang Tongji/Yale University 51 Prospect Street New Haven, CT USA Email: qiao.xiang@cs.yale.edu Xiang & Yang Expires June 20, 2018 [Page 10] Internet-Draft Unicorn Design December 2017 Y. Richard Yang Tongji/Yale University 51 Prospect Street New Haven, CT USA Email: yry@cs.yale.edu Xiang & Yang Expires June 20, 2018 [Page 11]