Internet Engineering Task Force C. Yang, Ed.
Internet-Draft Y. Liu, Ed.
Intended status: Standards Track South China University of Technology
Expires: December 14, 2018 C. Chen
Inspur
G. Chen
GSTA
Y. Wei
Huawei
June 12, 2018

A Massive Data Migration Framework
draft-yangcan-ietf-data-migration-standards-00

Abstract

This document describes a standardized framework for implementing the massive data migration between traditional databases and big-data platforms on the cloud via Internet, especially for an instance of Hadoop data architecture. The main goal of the framework is to provide more concise and friendly interfaces for users more easily and quickly migrate the massive data from a relational database to a distributed platform for a variety of requirements, in order to make full use of distributed storage resource and distributed computing capability to solve the bottleneck problems of both storage and computing performance in traditional enterprise-level applications. This document covers the fundamental architecture, data elements specification, operations, and interface related to massive data migration.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on December 14, 2018.

Copyright Notice

Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.


Table of Contents

1. Introduction

With the widespread popularization of cloud computing and big data technology, the scale of data is increasing rapidly, and the distribution computing requirements are more significant than before. For a long time, a majority of companies have usually use relational databases to store and manage their data, a great amount of structured data exist still and accumulate with the business development in legacies. With the dairy growth of data size, the storage bottleneck and the performance degradation for the data when analyzing and processing have become pretty serious and need to be solved in globe enterprise-level applications. This distributed platform refers to a software platform that builds data storage, data analysis, and calculations on a cluster of multiple hosts. Its core architecture involves in distributed storage and distributed computing. In terms of storage, it is theoretically possible to expand capacity indefinitely, and storage can be dynamically expanded horizontally with the increasing data. In terms of computing, some key computing frameworks as mapreduce can be used to perform parallel computing on large-scale datasets to improve the efficiency of massive data processing. Therefore, when the data size exceeds the storage capacity of a single-system or the computation exceeds the computing capacity of a stand-alone system, massive data can be migrated to a distributed platform. The ability of resource sharing and collaborative computing provided by a distributed platform can well solve large-scale data processing problems. The document focuses on putting forward a standard for implementing a big data migration framework through web access via Internet and considering how to help users more easily and quickly migrate the massive data from a traditional relational database to a cloud platform from multiple requirements. Using the distributed storage and distributed computing technologies highlighted by the cloud platform, on the one hand, it solves the storage bottleneck and the problem of low data analyzing and processing performance of relational databases. Based on the access by web, the framework supports open work state and promotes globe applications for data migration.

Note: It is also permissible to implement this framework in non-web.

2. Definitions and Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

The following definitions are for terms used in the context of this document.

3. Specific Framework Implementation Standards

The main goal of this data migration framework is to help companies migrate their massive data stored in relational databases to cloud platforms through web access. We propose a series of rules and constraints on the implementation of the framework, by which the users can conduct massive data migration with a multi-demand perspective.

Note: The cloud platforms mentioned in the document refer to the Hadoop platform by default. All standards on the operations and the environment of the framework refer to web state by default.

3.1. System Architecture Diagram

Figure 1 shows the working diagram of the framework.

        
    +---------+         +----------------+
    |         |   (1)   |    WebServer   |
    | Browser |-------->|                |---------------------
    |         |         |  +-----------+ |                    |
    +---------+         |  |   DMOW    | |                    |
                        |  +-----------+ |                    |
                        +----------------+                    |
                                                              |(2)
                                                              |
                                                              |
        +-------------+          +-----------------------+    |
        |             |   (3)    |                       |    |
        | Data Source |--------> |     Cloud Platform    |    |
        |             |          |  +-----------------+  |<----
        +-------------+          |  | Migration Engine|  |
                                 |  +-----------------+  |
                                 +-----------------------+

        
        

Figure 1:Reference Architecture

The workflow of the framework is as follows:

3.2. Source and Target of Migration

3.2.1. The Data Sources of Migration

This framework MUST support data migration between relational databases and cloud platforms on web, and MUST meet the following requirements:

  1. The framework supports to connect data sources in relational databases. The relational database MUST be at least one of the following:

  2. This framework MUST support the dynamic perception of data information in relational databases under a normal connection, in other words :

3.2.2. The Connection Testing of Relational Data Sources

Before conducting data migration, the framework MUST support testing the connection to the data sources that will be migrated, and then decide whether to migrate.

3.2.3. The Target Storage Container of Data Migration

This framework MUST allow users to migrate large amounts of data from a relational database to the following at least two types of target storage containers:

3.2.4. Specifying Target Cloud Platform

This framework MUST allow an authorized user to specify the target cloud platform to which the data will be migrated.

3.2.5. Data Migration to third-party Web Applications

This framework SHALL support the migration of large amounts of data from relational databases to one or multiple data containers for third-party Web applications. The target storage containers of the third-party Web application systems can be:

3.3. Type of Migrated Database

This framework is needed to meet the following requirements:

3.4. Scale of Migrated Table

3.4.1. Full Table Migration

This framework MUST support the migration of all tables in a relational database to at least two types of target storage containers:

3.4.2. Single Table Migration

This framework MUST allow users to specify a single table in a relational database and migrate it to at least two types of target storage containers:

3.4.3. Multi-table migration

This framework MUST allow users to specify multiple tables in a relational database and migrate to at least two types of target storage containers:

3.5. Split-by

This framework is needed to meet the following requirements on split-by.

3.5.1. Single Column

  1. The framework MUST allow the user to specify a single column of the data table (usually the table's primary key), then slice the data in the table into multiple parallel tasks based on this column, and migrate the sliced data to one or more of the following target data containers respectively:

  2. The framework SHALL allow the user to query the boundaries of the specified column in the split-by, then slice the data into multiple parallel tasks and migrating the data to one or more of the following target data containers:

3.5.2. Multiple Column

This framework MAY allow the user to specify multiple columns in the data table to slice the data linearly into multiple parallel tasks and then migrate the data to one or more of the following target data containers:

3.5.3. Non-linear Segmentation

It's OPTIONAL that this framework is needed to support non-linear intelligent segmentations of data for one or more columns and then migrate the data to one or more of the following target data containers:

3.6. Conditional Query Migration

This framework SHALL allow users to specify the query conditions, then querying out the corresponding data records and migrating them.

3.7. Dynamic Detection of Data Redundancy

It's OPTIONAL that the framework is needed to allow users to add data redundancy labels and label communication mechanisms, then it detects redundant data dynamically during data migration to achieve non-redundant migration.

The detection of data redundancy can be based on the following methods:

3.8. Data Migration with Compression

During the data migration process, the data is not compressed by default. This framework MUST support at least one of the following data compression encoding formats, allowing the user to compress and migrate the data:

3.9. Updating Mode of Data Migration

3.9.1. Appending Migration

This framework SHALL support the migration of appending data to existing datasets in HDFS.

3.9.2. Overwriting the Import

When importing data into HIVE, the framework SHALL support overwriting the original dataset and saving it.

3.10. The Encryption and Decryption of Data Migration

This framework is needed to meet the following requirements:

3.11. Incremental Migration

The framework SHOULD support incremental migration of table records in a relational database, and it MUST allow the user to specify a field value as "last_value" in the table in order to characterize the row record increment. Then, the framework SHOULD migrate those records in the table whose field value is greater than the specified "last_value", and then update the last_value.

3.12. Real-Time Synchronization Migration

The framework SHALL support real-time synchronous migration of updated data and incremental data from a relational database to one or many of the following target data containers:

3.13. The Direct Mode of Data Migration

This framework MUST support data migration in direct mode, which can increase the data migration rate.

Note:This mode supports only for MYSQL and POSTGRESQL.

3.14. The Storage Format of Data files

This framework MUST support to save the migrated data within at least one of following data file formats:

3.15. The Number of Map Tasks

This framework MUST allow the user to specify a number of map tasks to start a corresponding number of map tasks for migrating large amounts of data in parallel.

3.16. The selection on the elements in a table to be migrated column

3.17. Visualization of Migration

3.17.1. Dataset Visualization

After the framework has migrated the data in the relational database,,it MUST support the visualization of the dataset in the cloud platform.

3.17.2. Visualization of Data Migration Progress

The framework SHOULD support to show dynamically the progress to users in graphical mode when migrating.

3.18. Smart Analysis of Migration

The framework MAY provide automated migration proposals to facilitate the user's estimation of migration workload and costs.

3.19. Task Scheduling

The framework SHALL support the user to set various migration parameters(such as map tasks,the storage format of data files,the type of data compression and so on) and task execution time, and then to perform the schedule off-line/online migration tasks.

3.20. The Alarm of Task Error

When the task fails, the framework MUST at least support to notify stakeholders through a predefined way.

3.21. Data Export From Cloud to RDBMS

3.21.1. Data Export Diagram

Figure 2 shows the framework's working diagram of exporting data.

        
    +---------+         +----------------+
    |         |   (1)   |    WebServer   |
    | Browser |-------->|                |---------------------
    |         |         |  +-----------+ |                    |
    +---------+         |  |   DMOW    | |                    |
                        |  +-----------+ |                    |
                        +----------------+                    |
                                                              |(2)
                                                              |
                                                              |
        +-------------+          +-----------------------+    |
        |             |   (3)    |                       |    |
        | Data Source |<-------- |     Cloud Platform    |    |
        |             |          |  +-----------------+  |<----
        +-------------+          |  | Migration Engine|  |
                                 |  +-----------------+  |
                                 +-----------------------+

        
        

Figure 2:Reference Diagram

The workflow of exporting data through the framework is as follows:

3.21.2. Full Export

The framework MUST at least support exporting data from HDFS to one of following relational databases:

The framework SHALL support exporting data from HBASE to one of following relational databases:

The framework SHALL support exporting data from HIVE to one of following relational databases:

3.21.3. Partial Export

The framework SHALL allow the user to specify data range of keys on the cloud platform and export the elements in the specified range to a relational database. Exporting into A Subset of Columns.

3.22. The Merger of Data

The framework SHALL support merging data in different directories in HDFS and store them in a specified directory.

3.23. Column Separator

The framework MUST allow the user to specify the separator between fields in the migration process.

3.24. Record Line Separator

The framework MUST allow the user to specify the separator between the record lines after the migration is complete.

3.25. The Mode of Payment

  1. One-way payment mode

  2. Two-way payment mode

3.26. Web Shell for Migration

The framework provides following shells for character interface to operate through web access.

3.26.1. Linux Web Shell

The framework SHALL support Linux shell through web access, which allows users to perform basic Linux command instructions for the configuration management of the data migrated on web.

3.26.2. HBase Shell

The framework SHALL support hbase shell through web access, which allows users to perform basic operations such as adding, deleting, and deleting to the data migrated to hbase through the web shell.

3.26.3. Hive Shell

The framework SHALL support hive shell through web access, which allows users to perform basic operations such as adding, deleting, and deleting to the data migrated to hive through the web shell.

3.26.4. Hadoop Shell

The framework SHALL support the Hadoop shell through web access so that users can perform basic Hadoop command operations through the web shell.

3.26.5. Spark Shell

The framework SHALL support spark shell through web access and provide an interactive way to analyze and process the data in the cloud platform.

3.26.6. Spark Shell Programming Language

In spark web shell, the framework SHALL support at least one of the following programming languages:

4. Security Considerations

The framework SHOUD support for the security of the data migration process. During the data migration process, it should support encrypt the data before transmission, and then decrypt it for storage in target after the transfer is complete. At the same time, it must support the authentication when getting data migration source data and it shall support the verification of identity and permission when accessing the target platform.

5. IANA Considerations

This memo includes no request to IANA.

6. References

6.1. Normative References

[RFC2026] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, DOI 10.17487/RFC2026, October 1996.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997.
[RFC2578] McCloghrie, K., Perkins, D. and J. Schoenwaelder, "Structure of Management Information Version 2 (SMIv2)", STD 58, RFC 2578, DOI 10.17487/RFC2578, April 1999.

6.2. Informative References

[RFC2629] Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629, DOI 10.17487/RFC2629, June 1999.
[RFC4710] Siddiqui, A., Romascanu, D. and E. Golovinsky, "Real-time Application Quality-of-Service Monitoring (RAQMON) Framework", RFC 4710, DOI 10.17487/RFC4710, October 2006.
[RFC5694] Camarillo, G. and IAB, "Peer-to-Peer (P2P) Architecture: Definition, Taxonomies, Examples, and Applicability", RFC 5694, DOI 10.17487/RFC5694, November 2009.

6.3. URL References

[hadoop] The Apache Software Foundation, "http://hadoop.apache.org/"
[hbase] The Apache Software Foundation, "http://hbase.apache.org/"
[hive] The Apache Software Foundation, "http://hive.apache.org/"
[idguidelines] IETF Internet Drafts editor, "http://www.ietf.org/ietf/1id-guidelines.txt"
[idnits] IETF Internet Drafts editor, "http://www.ietf.org/ID-Checklist.html"
[ietf] IETF Tools Team, "http://tools.ietf.org"
[ops] the IETF OPS Area, "http://www.ops.ietf.org"
[spark] The Apache Software Foundation, "http://spark.apache.org/"
[sqoop] The Apache Software Foundation, "http://sqoop.apache.org/"
[xml2rfc] XML2RFC tools and documentation, "http://xml.resource.org"

Authors' Addresses

Can Yang (editor) South China University of Technology 382 Zhonghuan Road East Guangzhou Higher Education Mega Centre Guangzhou, Panyu District P.R.China Phone: +86 18602029601 EMail: cscyang@scut.edu.cn
Yu Liu (editor) South China University of Technology 382 Zhonghuan Road East Guangzhou Higher Education Mega Centre Guangzhou, Panyu District P.R.China EMail: 201621032214@scut.edu.cn
Cong Chen Inspur 163 Pingyun Road Guangzhou, Tianhe District P.R.China EMail: chen_cong@insour.com
Ge Chen GSTA No. 109 Zhongshan Road West, Guangdong Telecom Technology Building Guangzhou, Tianhe District P.R.China EMail: cheng@gsta.com
Yukai Wei Huawei Putian Huawei base Shenzhen, Longgang District P.R.China EMail: weiyukai@huawei.com