ALTO WG                                                         Q. Xiang
Internet-Draft                                                   Y. Yang
Intended status: Standards Track                  Tongji/Yale University
Expires: June 20, 2018                                 December 17, 2017


   Unicorn: Resource Orchestration for Large-Scale, Multi-Domain Data
                               Analytics
             draft-xiang-alto-multidomain-analytics-00.txt

Abstract

   This document presents the design of Unicorn, a multi-domain,
   geographically-distributed, data-intensive analytics system.  The
   setting of such a system includes edge science networks, which
   provide storage and computation resources for collecting, sharing and
   analyzing extremely large amounts of data, and transit networks,
   which provide networking resources to connects edge science networks
   for transmitting large science datasets.

   The key design challenge is to accurately discover and represent
   resource information from different domains.  Unicorn leverages
   multiple ALTO services, including ALTO-Path Vector, ALTO-Routing
   State Abstraction, ALTO-Server-Side Event and ALTO-Flow Cost Service
   to address this challenge.  In particular, Unicorn decomposes the
   resource discovery into three phases.  The first phase is to identify
   endpoint resource, e.g., dataset storage location, computation
   resource location and output storage resource location.  The second
   phase is to identify the reachability information between the
   locations of storage and computation resources.  The third phase is
   to identify the available networking resource connecting different
   storage and computation resources.  All information collected through
   these three phases can be used by a logically centralized scheduling
   system to orchestrate the resources usage.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any


Xiang & Yang              Expires June 20, 2018                 [Page 1]

Internet-Draft               Unicorn Design                December 2017


   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on June 20, 2018.

Copyright Notice

   Copyright (c) 2017 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Settings  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Requirements Language . . . . . . . . . . . . . . . . . . . .   4
   3.  Overview  . . . . . . . . . . . . . . . . . . . . . . . . . .   4
   4.  Storage and Computation Resource Discovery  . . . . . . . . .   6
   5.  Path Discovery  . . . . . . . . . . . . . . . . . . . . . . .   6
     5.1.  Using SDN to get flow-based site-path . . . . . . . . . .   7
     5.2.  Path Discovery Example  . . . . . . . . . . . . . . . . .   7
   6.  Networking Resource Discovery . . . . . . . . . . . . . . . .   8
     6.1.  Networking Resource Discovery Example . . . . . . . . . .   8
     6.2.  A Secure Multiparty Computation Protocol to Compute
           Minimal, Cross-Domain RSA . . . . . . . . . . . . . . . .   8
   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   9
     7.1.  Normative References  . . . . . . . . . . . . . . . . . .   9
     7.2.  Informative References  . . . . . . . . . . . . . . . . .   9
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  10

1.  Introduction

   As the data volume increases exponentially over time, data intensive
   analytics is transiting from single-domain computing to multi-
   organizational, geographically-distributed, collaborative computing,
   where different organizations contribute various resources, e.g.,
   computation, storage and networking resources, to collaboratively
   collect, share and analyze extremely large amounts of data.  One
   leading example is the Large Hadron Collider (LHC) high energy


Xiang & Yang              Expires June 20, 2018                 [Page 2]

Internet-Draft               Unicorn Design                December 2017


   physics (HEP) program, which aims to find new particles and
   interactions in a previously inaccessible range of energies.  The
   scientific collaborations that have built and operate large HEP
   experimental facilities at the LHC, such as the Compact Muon Solenoid
   (CMS) and A Toroidal LHC ApparatuS (ATLAS), currently have more than
   300 petabytes of data under management at hundreds of sites around
   the world, and this volume is expected to grow to one exabyte by
   approximately 2018.

   This document presents Unicorn, a generic design for resource
   orchestration for large-scale, multi-domain data analytics.  The key
   design challenge for such a resource orchestration system is to
   accurately discover and represent the resource information from
   different domains.  Our design resorts to the Application-Layer
   Traffic Optimization Protocol (ALTO) [RFC7285] to address this
   challenge.  In particular, several ALTO extension services, including
   ALTO-Path Vector, ALTO-Routing State Abstraction, ALTO-Server-Side
   Event and ALTO-Flow Cost Service, are integrated in the proposed
   design.

   This document focuses on the design details of Unicorn.  We present
   the implementation and deployment experience of Unicorn in another
   document [DRAFT-UNICORN-INFO].

1.1.  Settings

   The targeting scenario is as follows.  There are two types of
   networks in the whole system.  The first type is the edge science
   network.  An edge science networks is usually a cluster residing in a
   campus network.  It provides storage resources to store large
   scientific datasets and computation resources to analyze these
   datasets.  The second type is the transit network.  A transit network
   does not provide any storage or computation resources.  It only
   provides networking resources to inter-connect different edge science
   networks so that datasets can be moved and shared between different
   edge science networks.  Edge science networks do not directly connect
   to each other, but are connected through transit networks.

   Without loss of generality, a data analytics task is defined as a
   3-tuple: (input dataset, program, output site).  A task can be
   further decomposed into a set of jobs, who have a precedence relation
   defined by a directed acyclic graph (DAG).  And each job can also be
   defined as a 3-tuple: (input dataset, program, output site).


Xiang & Yang              Expires June 20, 2018                 [Page 3]

Internet-Draft               Unicorn Design                December 2017


2.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

3.  Overview

   The key design challenge for multi-domain data analytics system is to
   accurately discover the resource information from different sites
   while preserving the autonomy and privacy of each site.  In order to
   address this challenge, the design needs to strike a balance between
   the information accuracy, the efficiency of resource discovery and
   the privacy of each site.  In particular, we propose the following
   architecture in the Figure 1.

                             .---------.
                             |  Users  |
                             '---------'
                                  | Tasks
    .- - - - - - - - - - - - - - -|- - - - - - - - - - - - - - - - - - .
    |                             |                                    |
    |         .-----------------------.   1  .------------------------.|
    |         | Resource Orchestrator | -----|Storage/Computation Pool||
    |         '-----------------------' \    '------------------------'|
    |             /   |        | 4     \ \                             |
    |          2 /  3 |        |       3\ \ 2                          |
    |    .-------------.  .-----------. .-------------.                |
    |    | ALTO Server |  | Execution | | ALTO Server |                |
    |    '-------------'  |  Agents   | '-------------'                |
    |          |          '-----------'        |                       |
    |          |          /             \      |                       |
    |  .----------------./               \ .----------------.          |
    |  |     Site 1     |       . . .      |     Site N     |          |
    |  '----------------'                  '----------------'          |
    '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - '


    Figure 1: Resource Orchestration for Large-Scale, Multi-Domain Data
                         Analytics: Architecture.

   In the proposed design, each site deploys an ALTO server.  Within the
   site, the ALTO server collects resource information, such as dataset
   locations, storage resources, computation resources, networking
   resources and so on, and announces its capability, i.e., the type of
   information it is willing to share with other ALTO servers or the
   data analytics resource orchestrator.


Xiang & Yang              Expires June 20, 2018                 [Page 4]

Internet-Draft               Unicorn Design                December 2017


   In this system, users submit analytics tasks to the resource
   orchestrator.  For a data analytics task, a user needs to at least
   provide the analytics program.  If the input dataset is not specified
   by the user, it means this task does not require a dataset as input.
   Similarly, if the user does not specify the computation resource or
   the output dataset storage site, the system will try to allocate
   default computation resources to this task, and will return the
   output dataset directly to the user.

   After getting a (set of) task(s) from the user(s), the orchestrator
   discover the available resources for executing the submitted task(s)
   in three steps, labeled in the figure.  The first step is called
   storage/computation resource discovery.  In this step, the
   orchestrator sends requests to a centralized storage and computation
   resource pool to find the location of candidate input storage
   resources, the computation resources and the output storage
   resources.

   The second step is called path discovery, in which the resource
   orchestrator sends endpoint or flow cost service queries to the ALTO
   servers at the site holding the candidate input storage resources and
   the site holding the candidate computation resources to ask about the
   connectivity from input dataset site to the computation site and that
   from the computation site to the output site.  The cost type of such
   queries is path vector defined in [DRAFT-PV].  The response sent back
   from the ALTO server to the orchestrator is a vector.  Each element
   in this vector is the IP address of the ingress gateway switch/router
   that the candidate flow will pass along the AS-path.  This vector is
   called the site-path in this document.

   After collecting the site-path of all the candidate (storage,
   computation) flows, for each site X, the orchestrator derives F_X,
   the set of candidate flows that will consume networking resources in
   site X.  Then the orchestrator will send endpoint/flow cost service
   queries to the ALTO server at each site X to ask about the networking
   resource sharing of the flow set F_X in site X.  The returned
   response is a set of linear inequalities called resource state
   abstraction.

   Using the resource information collected from the three-phase
   resource discovery process, the resource orchestrator can run an
   scheduling algorithm to make the resource allocation decisions to
   execute the submitted tasks.  The decisions include job decomposition
   (DAG construction), task concatenation, job placement, network
   resource allocation for input dataset movement and output movement.
   These decisions will be sent to the corresponding execution agents at
   different sites, which will practice these decisions and send
   feedback to the orchestrator.


Xiang & Yang              Expires June 20, 2018                 [Page 5]

Internet-Draft               Unicorn Design                December 2017


   When resource state changes, e.g., a network link is broken, the ALTO
   server at the scene will check whether the results of existing path
   discovery and networking resource discovery are affected by this
   event, and sends updated resource information using the ALTO-SSE
   service.

   In the next few sections, we present the detailed design of the
   three-phase resource discovery.

4.  Storage and Computation Resource Discovery

   In order to allocation resources for a (set of) data analytic tasks,
   the scheduling system must first know the availability of the
   resources explicitly specified in the task, i.e., the storage
   resource storing the input dataset, the computation resources to run
   the analytics program and the storage resource that will be used to
   store the output dataset.  Such resources are only provided by the
   edge science networks.  Therefore, a strawman design is for the
   scheduling system to send requests to the resource information
   servers of all the edge science networks and to get such information.
   However, this solution is inefficient in that the scheduling system
   needs to query all the edge science networks to get the complete
   information.

   This document adopts an alternative design, in which all the resource
   information servers proactively send all their information about the
   storage and computation resources to a centralized resource pool.
   This resource pool can be a DNS server or a traditional database.
   Different techniques are under investigation to improve the
   scalability of this design, including sharding and distributed
   hashing table (DHT).

5.  Path Discovery

   Having identified the locations of input dataset storage nodes, the
   locations of candidate computation nodes and the locations of
   candidate output dataset storage nodes, the scheduling system next
   needs to find out the connectivity information between storage nodes
   and computation nodes.  The first connectivity information is the
   reachability between storage nodes and the computation nodes.  A
   input storage node, a computation node and a output storage node can
   be allocated to execute a job only if data movement is allowed
   between the input storage node and the computation node, and between
   the computation node and the output storage node.

   Because edge science networks are connected through transit networks,
   the data movement between candidate storage nodes and computation
   nodes need to consume networking resources of multiple networks if


Xiang & Yang              Expires June 20, 2018                 [Page 6]

Internet-Draft               Unicorn Design                December 2017


   these nodes are located at different edge science networks.  In order
   to find the networking resource sharing between different (storage,
   computation) pair, the scheduling system also needs to know which
   networks are involved in the data movement of each (storage,
   computation) node pair.

   To retrieve both the types of information, the scheduling system
   issues endpoint cost service queries to the ALTO servers at edge
   science networks.  For the ALTO server at an edge science network X,
   the scheduling system issues endpoint cost service defined in
   [RFC7285] or the extension flow cost service defined in [DRAFT-FCS]
   queries for all the (input storage node, computation node) pairs
   where the input storage node is located in X, and all the
   (computation node, output storage node) pairs where the computation
   node is located in X.  The cost type of such queries is the new path
   vector cost type introduced in [DRAFT-PV].

   For each (storage, computation) pair, the response sent by the ALTO
   servers at edge science networks is a path vector providing the
   information about the AS-level path for the data movement of this
   pair.  Different from the traditional path vector where each element
   is an AS name/number, each element in the path vector sent by the
   ALTO servers also includes the ingress IP address of the gateway
   switch/router of the corresponding network.  We call this path vector
   the "site-path", to differentiate it from the traditional AS-path.

5.1.  Using SDN to get flow-based site-path

   ALTO servers can compute the site-path for a given (storage,
   computation) pair using the information provided by BGP and
   traceroute.  However, BGP only supports destination-IP based routing
   and limits each network's ability to make fine-grained flow-based
   routing decisions.  We are investigating the usage of SDN technique
   to allow different networks in the multi-domain data analytics system
   to exchange and make fine-grained flow-based inter-domain routing
   decisions.  To avoid the route advertisement explosion brought by
   flow-based routing, we design use a sub/pub system that allows an
   ALTO server to send routing information queries of a set of flows,
   instead of the whole flow space, to other ALTO servers at other
   domains.

5.2.  Path Discovery Example

   The following is an example of path discovery query made by the
   orchestrator.


Xiang & Yang              Expires June 20, 2018                 [Page 7]

Internet-Draft               Unicorn Design                December 2017


   { "cost-type":
       { "cost-mode": "array",
         "cost-metric": "ane-path" },
     "endpoint-flows":
        { "srcs": [ "ipv4:172.0.0.1", "ipv4:172.0.1.1"],
          "dsts": [ "ipv4:172.0.2.1", "ipv4:172.0.3.1"]}
   }

   And the following is the response sent from the ALTO server.

   {"endpoint-cost-map":
        "ipv4: 172.0.0.1 ": {
           "ipv4: 172.0.2.1 ": ["ane:172.1.0.1", "ane:172.0.2.0"],
           "ipv4: 172.0.3.1 ": ["ane:172.1.0.1", "ane:172.0.3.0"]},
        "ipv4: 172.0.1.1 ": {
           "ipv4: 172.0.2.1 ": ["ane:172.1.0.1", "ane:172.0.2.0"],
           "ipv4: 172.0.3.1 ": ["ane:172.1.0.1", "ane:172.0.3.0"]}
   }

6.  Networking Resource Discovery

   The responses from ALTO servers during the path discovery provides
   the connectivity information for every pair of candidate input
   dataset storage node and computation node, and that of every pair of
   candidate computation node to output storage node in the form of
   site-path.  With such information, the scheduling system can further
   discover the networking resource sharing between candidate (storage,
   computation) data movement flows.  In particular, for each network X,
   both edge science networks and transit network, we can easily derive
   the whole set of candidate data movement flows F_X that will enter
   network X from the site-path information of all candidate (storage,
   computation) data movement flows.  After deriving F_X for each
   network X, the scheduling system will send endpoint cost services or
   flow cost services to retrieve the resource state abstraction
   [DRAFT-RSA] for the flow set F_X.

6.1.  Networking Resource Discovery Example

   TBA.

6.2.  A Secure Multiparty Computation Protocol to Compute Minimal,
      Cross-Domain RSA

   The current design of ALTO-RSA can only compute the minimal resource
   state abstraction for a single network.  In Unicorn, we design a
   secure multiparty computation protocol to support the computation of
   minimal, cross-domain routing state abstraction.  This protocol
   contains each network's exposure of its redundant linear inequalities


Xiang & Yang              Expires June 20, 2018                 [Page 8]

Internet-Draft               Unicorn Design                December 2017


   to a small number of other networks, and ensures that the
   orchestrator only gets the minimal, cross-domain resource state
   abstraction.  The overhead of this SMPC process is reasonable due to
   the adoption of state-of-the-art secure scalar product protocol.

7.  References

7.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

7.2.  Informative References

   [DRAFT-CC]
              Randriamasy, S., Yang, R., Wu, Q., Deng, L., and N.
              Schwan, "ALTO Cost Calendar", 2017,
              <https://datatracker.ietf.org/doc/
              draft-ietf-alto-cost-calendar>.

   [DRAFT-DC]
              Lee, Y., Bernstein, G., Dhody, D., and T. Choi, "ALTO
              Extensions for Collecting Data Center Resource
              Information", 2014, <https://datatracker.ietf.org/doc/
              draft-lee-alto-ext-dc-resource/>.

   [DRAFT-FCS]
              Zhang, J., Gao, K., Wang, J., Xiang, Q., and Y. Yang,
              "ALTO Extension: Flow-based Cost Query", 2017,
              <https://datatracker.ietf.org/doc/draft-gao-alto-fcs/>.

   [DRAFT-MC]
              Randriamasy, S., Roome, W., and N. Schwan, "Multi-Cost
              ALTO", 2017, <https://datatracker.ietf.org/doc/
              draft-ietf-alto-multi-cost/>.

   [DRAFT-NETGRAPH]
              Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y.
              Yang, "ALTO Topology Extensions: Node-Link Graphs", 2015,
              <https://tools.ietf.org/html/draft-yang-alto-topology-06>.

   [DRAFT-PM]
              Roome, W. and Y. Yang, "Extensible Property Maps for the
              ALTO Protocol", 2015, <https://datatracker.ietf.org/doc/
              draft-roome-alto-unified-props-new/>.


Xiang & Yang              Expires June 20, 2018                 [Page 9]

Internet-Draft               Unicorn Design                December 2017


   [DRAFT-PV]
              Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y.
              Yang, "ALTO Extension: Abstract Path Vector as a Cost
              Mode", 2015, <https://tools.ietf.org/html/
              draft-yang-alto-path-vector-01>.

   [DRAFT-RSA]
              Gao, K., Wang, X., Xiang, Q., Gu, C., Yang, Y., and G.
              Chen, "A Recommendation for Compressing ALTO Path
              Vectors", 2017, <https://datatracker.ietf.org/doc/
              draft-gao-alto-routing-state-abstraction/>.

   [DRAFT-SSE]
              Roome, W. and Y. Yang, "ALTO Incremental Updates Using
              Server-Sent Events (SSE)", 2015,
              <https://datatracker.ietf.org/doc/
              draft-ietf-alto-incr-update-sse/>.

   [DRAFT-UNICORN-INFO]
              Xiang, Q., Newman, H., Bernstein, G., Du, H., Gao, K.,
              Mughal, A., Balcas, J., Zhang, J., and Y. Yang,
              "Implementation and Deployment of A Resource Orchestration
              System for Multi-Domain Data Analytics", 2017,
              <https://datatracker.ietf.org/doc/
              draft-xiang-alto-exascale-network-optimization/>.

   [HTCondor]
              Thain, D., Tannenbaum, T., and M. Livny, "Distributed
              computing in practice: the Condor experience", 2005,
              <http://dl.acm.org/citation.cfm?id=1064336>.

   [RFC7285]  Alimi, R., Ed., Penno, R., Ed., Yang, Y., Ed., Kiesel, S.,
              Previdi, S., Roome, W., Shalunov, S., and R. Woundy,
              "Application-Layer Traffic Optimization (ALTO) Protocol",
              RFC 7285, DOI 10.17487/RFC7285, September 2014,
              <https://www.rfc-editor.org/info/rfc7285>.

Authors' Addresses

   Qiao Xiang
   Tongji/Yale University
   51 Prospect Street
   New Haven, CT
   USA

   Email: qiao.xiang@cs.yale.edu


Xiang & Yang              Expires June 20, 2018                [Page 10]

Internet-Draft               Unicorn Design                December 2017


   Y. Richard Yang
   Tongji/Yale University
   51 Prospect Street
   New Haven, CT
   USA

   Email: yry@cs.yale.edu


Xiang & Yang              Expires June 20, 2018                [Page 11]