Internet Engineering Task Force Y. Zhang Internet Draft S. He Intended status: Informational Y. Chen Expires: February 18, 2022 Z. Wang H. Xia Xi'an University of Posts & Telecommunications August 18, 2021 Unified representation method of heterogeneous data in industrial Internet draft-zhang-ietf-heterogeneous-data-representation-00.txt Abstract With the advent of 5G era, sensing devices and mobile Internet devices in smart factories are everywhere, and a variety of industrial data from different spatial devices becomes widely available and interwoven. These data are usually generated by streaming, with huge differences in data sources and structures, massive scale, strong correlation and complicated relationship. The great richness of data makes the problem of how to quickly, accurately and deeply dig the hidden value behind the data more complicated than ever. The data generated in different fields are distributed in a variety of business systems, and these data have different structures and forms, so it is difficult to use an efficient form of unified analysis. Based on the data characteristics of heterogeneous data, the multi-source heterogeneous data fusion method is studied based on tensor. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on February 18, 2022. Zhang, et al. Expires February 18, 2022 [Page 1] Internet-Draft representation method of heterogeneous data August 2021 Copyright Notice Copyright (c) 2021 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction...................................................2 2. Unified data representation method.............................3 2.1. Data Tensor Representation in industrial Internet.........3 2.1.1. Unstructured Data Tensor Representation..............4 2.1.2. Semi-structured Data Tensor Representation...........4 2.1.3. Structured Data Tensor Representation................4 2.1.4. Subtensor Fusion Method..............................5 2.1.5. Unified Tensor Fusion Model..........................5 3. Security Considerations........................................5 4. IANA Considerations............................................5 5. Conclusions....................................................5 6. References.....................................................6 7. Acknowledgments................................................6 Authors' Addresses................................................6 1. Introduction Driven by emerging technologies such as big data, cloud computing and the Internet of Things, intelligent manufacturing is the manufacturing industry in today's world, in which intelligent factory and intelligent production are the core content of intelligent manufacturing. Smart factories that rely on big data and Internet of Things technologies provide services for a large number of industrial big data. Industrial big data has the 4V characteristics of broad big data, namely volume, variety, value and velocity. Besides, there are some other characteristics[1]: Zhang, et al. Expires February 18, 2022 [Page 2] Internet-Draft representation method of heterogeneous data August 2021 1. The data sources are extensive, and the proportion of semi- structured and unstructured data increases. 2. There is a high correlation between the data. 3. Data analysis should consider the characteristics of time and space. 4. These data are specific to industrial scenarios. Big data mining aims to gain knowledge from large amounts of complex data and create new value. In the industrial production process, data comes from sensors, intelligent devices, workstations, production process data from production control systems, operation monitoring data, and log records of various production workshops distributed in different geographical locations. This data is part of the structured data. At the same time, there is a large amount of unstructured and semi-structured data, such as text, log files, and sound, images, and video. Data structure is different, attributes and standards are different, in order to transform data into knowledge, data warehouse, online analysis and processing and data mining techniques are needed. Traditional data storage and management methods are oriented to relational structured data. In this case, it is difficult to meet such a large and diversified demand for unstructured data analysis. And the data systems are independent of each other, due to different data sources, production equipment and software. Various factors such as diversity of component providers lead to different data formats, which makes information integration difficult to achieve[2]. Based on the above challenges, uniform data format specifications are required. 2. Unified data representation method 2.1. Data Tensor Representation in industrial Internet In the process of big data acquisition, a variety of sensing devices collect unstructured data, semi-structured data and structured data in different fields, form a data stream and submit it to the edge computing layer for tension-quantization representation. During the submission process, the source data format is not changed. The edge computing layer consists of edge computing nodes with computing performance. For example, camera network in smart factory, sensor network, smart air conditioning, smart TV and so on. These edge nodes collect data from the Internet of Things terminals and provide a certain amount of data computing power, as well as a certain amount of data storage. Based on the work of Ref[3], different types of data from different spaces at the edge nodes can be constructed into corresponding sub-tensors by tensor model. The sub-tensors mentioned above are usually different in dimensions and Zhang, et al. Expires February 18, 2022 [Page 3] Internet-Draft representation method of heterogeneous data August 2021 characteristics, and are independent of each other. In order to carry out association analysis and deep mining of overall data, the above different low-order sub-tensors are combined by using tensor extension operator, and different data features are arranged into tensor spaces of different orders. Finally, a unified representation model of high-order big data is established. The following takes the detection video (unstructured data), related detection logs (semi-structured data) and relevant index table (structured data) recorded in the spot defect detection platform of SMT surface mount technology based on vision as an example to introduce the data tensor representation method in edge calculation. 2.1.1. Unstructured Data Tensor Representation In the sub-perception representation method of unstructured video data, take solder joint defect detection video based on SMT surface mount technology of intelligent terminal platform as an example, the main features of video data are time frame, frame image width, frame image height and frame image color. Therefore, video data in MP4 format can be represented as fourth-order sub-tensor data in low- order space. The element values in the tensors are encoded videos. The video frame, the width of each frame, the length of each frame and the gamut of the image are respectively converted into different orders. 2.1.2. Semi-structured Data Tensor Representation In the sub-sensing representation method of semi-structured log data, the solder joint detection log is established by taking the visual SMT solder joint defect detection on intelligent terminal platform as an example. A database of semi-structured data is a set of nodes. Each node is a leaf or internal node. Each semi-structured data set has a hierarchy that can be decomposed into a tree structure. Solder joint detection log can be expressed as third-order sub-tensor data in low-order space. The row of the identification matrix, the column of the identification matrix, and the encoding of the element represent different orders of the third-order tensor, respectively. 2.1.3. Structured Data Tensor Representation In the subperceptive representation method of structured database table, the attribute detection record form based on visual SMT solder joint defect detection of intelligent terminal platform is taken as an example. Structured data is data that is logically expressed and implemented by a two-dimensional table structure, mainly managed and stored through a relational database. In a simple type of database table, a field is ofen represented by a number or a characters, so that it can be represented as a matrix. More complex field types can be represented as a tensor by adding new orders. Attribute detection Zhang, et al. Expires February 18, 2022 [Page 4] Internet-Draft representation method of heterogeneous data August 2021 record form can be expressed as fifth-order sub-tensor data in low- order space. ID, date, record, num, state, and errornum represent different orders of the fifth-order tensor, respectively. 2.1.4. Subtensor Fusion Method The tensor fusion extension operator is first defined. The order of a tensor can be extended in different directions to the order of the existing tensor space. If new heterogeneous data is added, it is added to the original tensor space in the form of new feature order. If this feature order already exists, it is extended in the form of dimension. In practical application, different heterogeneous data is first expressed as low-order sub-tensors, and then integrated into higher-order tensor space by extension operator, so as to achieve uniform representation of heterogeneous data. 2.1.5. Unified Tensor Fusion Model To reduce data redundancy and duplication, the subtensor is converted into a uniform tensor using a uniform data tensor function. When two tensors have the same property order, the finer granularity order is retained, while the order of different properties is maintained. The structured data, semi-structured data and unstructured data are first represented in low-order subtensor space, and then they are fused and unified into higher-order tensors by tensor extension operators, which correspond to unified variable data structures in computer systems. 3. Security Considerations In unsupervised or harsh environments, edge computing nodes may produce counterfeit data to change the overall fusion results and affect the accuracy and reliability of the final results. Therefore, sensors in edge computing nodes play an important role in the fusion results, and they need to be protected from attacks in the whole process. 4. IANA Considerations This document has no actions for IANA. 5. Conclusions In the process of industrial production, terminal data collected by edge computing nodes vary in structure and form, including structured, semi-structured and unstructured data. In order to mine the valuable problems hidden behind these data, the corresponding sub-tensors are constructed for different types of data through Zhang, et al. Expires February 18, 2022 [Page 5] Internet-Draft representation method of heterogeneous data August 2021 tensor model, and then the tensor extension operator is used to combine these sub-tensors to build the unified tensor model of high-order big data. 6. References [1] W. Jianmin, "Survey on industrial big data", Big Data Research, vol. 3, no. 6, pp. 3-14, 2017. [2] Wentao, H. E. , and C. Shao, "The development and challenges of industrial big data analysis technology", Information and Control, vol. 47, no. 4, pp. 398-410, 2018. [3] Kuang, L. , Hao, . , Yang, L. T. , Lin, M. , Luo, C. , and G. Min, "A tensor-based approach for big data representation and dimensionality reduction", IEEE Transactions on Emerging Topics in Computing, vol. 2, no. 3, pp. 280-291, 2017. 7. Acknowledgments TBD. Authors' Addresses Yaqian Zhang Xi'an University of Posts & Telecommunications Shaanxi China Email: zhangyaqian0701@126.com Shengsheng He Xi'an University of Posts & Telecommunications Shaanxi China Email: 513286954@qq.com Yanping Chen Xi'an University of Posts & Telecommunications Shaanxi China Email: chenyp@xupt.edu.cn Zhang, et al. Expires February 18, 2022 [Page 6] Internet-Draft representation method of heterogeneous data August 2021 Zhongmin Wang Xi'an University of Posts & Telecommunications Shaanxi China Email: zmwang@xupt.edu.cn Hong Xia Xi'an University of Posts & Telecommunications Shaanxi China Email: xiahong@xupt.edu.cn Zhang, et al. Expires February 18, 2022 [Page 7]