INTERNET-DRAFT F. Lundberg Expires: 30 November 2004 Linova Category: Standards Track March 2004 BaseStream - A Simple Typed Stream Format draft-flundberg-basestream-02.txt This document is an Internet-Draft and is subject to all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Status of this Memo This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2004). All Rights Reserved. Abstract BaseStream is a simple binary format that serves as a common base for binary formats like XML (Extensive Markup Language) provides a base for text formats. The format is used to define binary stream formats consisting of typed and possibly named data elements. An instance of the BaseStream format has a corresponding XML representation that makes it possible to use existing XML tools to view, edit, validate, and transform BaseStream data. Lundberg Standards Track [Page 1] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. BaseStream Definition . . . . . . . . . . . . . . . . . . . . 4 2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 BaseStream Syntax . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Future Versions . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Application Name . . . . . . . . . . . . . . . . . . . . . . . 9 3. XML Representation . . . . . . . . . . . . . . . . . . . . . . 10 4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.1 The use of the e-Element . . . . . . . . . . . . . . . . . . . 12 4.2 Byte Order . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.3 Character Encoding . . . . . . . . . . . . . . . . . . . . . . 12 4.4 Element Names . . . . . . . . . . . . . . . . . . . . . . . . 12 4.5 BXML B-element Representation . . . . . . . . . . . . . . . . 13 4.6 Optional Header . . . . . . . . . . . . . . . . . . . . . . . 13 5. Security Considerations . . . . . . . . . . . . . . . . . . . 14 Normative References . . . . . . . . . . . . . . . . . . . . . 15 Informative References . . . . . . . . . . . . . . . . . . . . 16 Author's Address . . . . . . . . . . . . . . . . . . . . . . . 16 A. XML Schema for BXML types . . . . . . . . . . . . . . . . . . 17 Intellectual Property and Copyright Statements . . . . . . . . 23 Lundberg Standards Track [Page 2] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 1. Introduction Binary data is difficult to handle by humans. Therefore text-based fileformats and network protocols are popular. The software for such stream formats are easier to debug since they use a relatively human-friendly representation of the data. However, text-based formats have the following drawbacks compared to binary formats. o Text-based formats need more bytes to store numerical data than binary formats do. o Parsing of text-based data is not efficient compared to parsing of binary data since text-based data is usually not stored in a way similar to how it is stored in computer memory. o Another drawback is that it is seldom possible to read only part of a text-based file since the exact byte offset to a data item is usually not known. BaseStream is a simple, light-weight, binary stream format consisting of a sequence of typed and possibly named data elements. The format is a data serialization format suitable as the base for file formats, TCP protocols, or any other format that handles sequences of bytes. Together with appropriate tools a BaseStream instance can easily be viewed and edited. BaseStream data is therefore human-friendly, but without the drawbacks of text-based formats. BaseStream serves as a common base for binary formats like XML provides a base for text-based formats. This document specifies an XML representation of a BaseStream. XML stands for Extensive Markup Language and is defined in [1]. The BaseStream XML application is called BXML. By using software to convert data between binary BaseStream and BXML, any text or XML editor can be used to edit BaseStream data. Furthermore, this transformation opens up BaseStream data to other XML technologies such as: data validation with XML Schemas and data transformation with XSLT (XML Stylesheet Language Transformation). This document specifies two things. 1. The rules for determining whether a stream is a BaseStream or not. 2. How a BaseStream is represented in XML format. The goal of the BaseStream format is to make it easy to share binary data and develop applications that use binary stream formats. Lundberg Standards Track [Page 3] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 2. BaseStream Definition 2.1 Terminology This subsection defines the specific meaning of some words in the context of this document. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [8]. A "byte" is an 8-bit data entity. A "stream" is a finite sequence of bytes. A stream is created by a writer. A "writer" is an entity capable of producing a sequence of bytes and a "BaseStreamWriter" is a writer that produces a BaseStream. A "reader" reads a stream created by a writer. A "BaseStreamReader" is a reader that can read a BaseStream and interpret the bytes as elements described in this document. A "format" or "stream format" is a set of rules for a stream that determines whether or not the stream adheres to the format. "BaseStream" is used to refer either to the stream format defined in this document or a stream that adheres to the BaseStream format. The terms "BaseStream format" and "BaseStream instance" can be used to discriminate the two cases. "BaseStream1" is version 1 of the BaseStream format, that is the version described in this document. The version number is normally left out in this document since the only version described in this document is version 1. A "format instance" is a concrete instance of a stream that follows a specific format. A "BaseStream instance" is a stream that follows the rules of the BaseStream format. A "BaseStream application" is a specific format based on BaseStream. A BaseStream application may put any further restrictions on the stream as long as every instance of the application is a BaseStream instance. Big-endian byte order specifies that multi-byte integers and floating point numbers are written to a stream with the most significant byte first. The following abbreviations are used for basic data types. Lundberg Standards Track [Page 4] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 o "INT1", a 8-bit signed integer. o "INT2", a 16-bit signed integer. o "INT4", a 32-bit signed integer. o "INT8", a 64-bit signed integer. o "FLOAT4", a 4-byte floating point number. o "FLOAT8", an 8-byte floating point number. The integer types are 8-bit, 16-bit, 32-bit and 64-bit signed two's complement integers. "FLOAT4" and "FLOAT8" are 4- and 8-byte data entities used to store floating point numbers. These follow the single-precision 32-bit and double-precision 64-bit formats in the IEEE754 standard [2]. For all the numerical types big-endian byte order is used. A byte value may be specified with the corresponding Unicode [5] character in the interval from 0 to 127. For example: 'a' is the value 97. These unicode character values coincides with the ASCII [9] character values. 2.2 BaseStream Syntax A BaseStream consists of a sequence of data elements called Element0, Element1, Element2, and so on. The elements can be of a simple type, an array type or the string type. The element type is specified by the type byte (an ASCII character) which is one of the following: b, s, i, l, f, d, B, S, I, L, F, D, or U. The type byte is also called the type character since the byte corresponds to a character. The type byte is the first byte in an unnamed element. For a named element it is the first byte after the name of the element. There are six simple element types. The simple integer types are the b, s, i and l-element types. The letters are abbreviations for "byte", "short", "int", and "long" which are type names in widely used programming languages. These elements store INT1, INT2, INT4, and INT8 values respectively. The two simple types for floating point numbers are the f-element and the d-element which store FLOAT4 and FLOAT8 values. f and d are abbreviations for "float" and "double". The array types are the B, S, I, L, F, and D-element types. These may also be referred to as the B array element type, S array element type, and so on. These types are used to store an array of INT1's, Lundberg Standards Track [Page 5] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 INT2's, INT4's, INT8's, FLOAT4's or FLOAT8's respectively. Arrays with zero elements are allowed. Elements may optionally be named. The element name uses a subset of the ASCII [9] charset. The main reason for the restrictions on the element names is that the name should be possible to type and print on as many computer systems as possible. Below is the syntax specification of the BaseStream format, version 1. Augmented Backus-Naur Form (ABNF) as defined in RFC2234 [3] is used. BaseStream1 = Element0 *element e Element0 = i %d0 %d3 %d232 %d1 ; byte 'i' followed by an INT4 with value 256001 element = [elementName] (simple / array / string) simple = b INT1 / ; b-element s INT2 / ; s-element i INT4 / ; i-element l INT8 / ; l-element f FLOAT4 / ; f-element d FLOAT8 ; d-element array = B size *INT1 / ; B-element S size *INT2 / ; S-element I size *INT4 / ; I-element L size *INT8 / ; L-element F size *FLOAT4 / ; F-element D size *FLOAT8 ; D-element string = U size utf8 utf8 = size = shortSize / longSize shortSize = longSize = minus8 longSizeNumber minus8 = Lundberg Standards Track [Page 6] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 longSizeNumber = elementName = N nameSize nameString nameSize = nameString = ALPHA 0*126digitLetterOrUnderscore digitLetterOrUnderscore = ALPHA / DIGIT / UNDERSCORE ALPHA = %x41-5A / %x61-7A ; A-Z / a-z UNDERSCORE = %x5F ; '_' DIGIT = "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" e = %x65 N = %x4E b = %x62 s = %x73 i = %x69 l = %x6C f = %x66 d = %x64 B = %x42 S = %x53 I = %x49 L = %x4C F = %x46 D = %x44 U = %x55 INT1 = INT2 = INT4 = INT8 = FLOAT4 = FLOAT8 = We can now list the rules that completely defines the BaseStream format. A stream is a BaseStream of version 1 (a BaseStream1) if and only if all the following rules are fulfilled. 1. The stream MUST follow the "BaseStream1" grammar rule. 2. The nameSize integer MUST equal the number of bytes (characters) in the following nameString. 3. The size integer (grammar rule "size") MUST equal the number of Lundberg Standards Track [Page 7] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 integers or floating point numbers that follow for the B, S, I, L, F, and D elements. For the U-element the size MUST equal the number of bytes needed to encode the UTF-8 string. 4. If a U-element is named "bs_tag" the string value of the element MUST follow the "nameString" grammar rule. Such an element is called a tag-element. 5. If a U-element is named "bs_end" the string value MUST be the empty string (a string with zero characters). Such an element is called an end-element. 6. At any point in the stream the number of end-elements written to the stream MUST NOT exceed the number of tag-elements. 7. The total number of end-elements in a BaseStream MUST equal the number of tag-elements. 8. If Element1 is a U-element named "bs_app" the string value of the element MUST be the name of the BaseStream application. The tag-element and the end-element described above affects how the BaseStream is represented as XML. This is treated later in this document. A BaseStream always starts with a byte of value 105 corresponding to 'i' in the ASCII character set [9]. Then there are four bytes which forms an INT4 with the value 256001. Thus a BaseStream1 always start with the following five bytes: 105, 0, 3, 232, 1. These bytes form the first element called Element0. The purpose of Element0 is to identify a stream as a BaseStream. After Element0, there are 0 to infinitely many user data elements called Element1, Element2, and so on. After the elements in the stream there is one byte indicating the end of the stream. This is 'e' (101). The array types and the string type have a size. The size is the number of integers or floating point numbers in the array for the array types. For the string type the size is the number of bytes needed to encode the string in UTF-8. The size is stored in two ways. See the grammar rules "shortSize" and "longSize". If the size is between 0 and 127 it is stored as an INT1. If the size is larger than or equal to 128 it is stored as an INT8. Strings are stored using the Unicode [5] character set that handles all widely used characters. After the string element type byte ('U') and the size there are bytes that represents a string in the UTF-8 format. See RFC 2279 [4] for information about the UTF-8 charset encoding. Lundberg Standards Track [Page 8] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 2.3 Future Versions Future versions of the BaseStream format are not currently anticipated, but still prepared for. BaseStreamX is used to denote version X of BaseStream where X is 1, 2, 3, and so on. Version X of the format MUST start with with the byte 'i' and then an INT4 with the value 256000 + X. Any version of BaseStream will thus always start with the four bytes: 105, 0, 3, 232. The fifth byte is the BaseStream version. A version number higher than 127 will never be used. 2.4 Application Name BaseStream provides a standard way of storing the name of the BaseStream application in the beginning of the stream. If Element1 of a BaseStream is a U-element named "bs_app" the value of this element is the name of the BaseStream application. If a BaseStream application contains the name of the application in the stream it SHOULD be stored in Element1 as described above. It is RECOMMENDED that the charset for the application name is ASCII [9]. To provide unique application names and some information about the origin of the BaseStream application a URL may be used as the application name. For example the company X could call their 2D plot fileformat "http://www.x.com/plot2d/1" where "1" stands for version 1 of the format. Lundberg Standards Track [Page 9] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 3. XML Representation This section specifies an XML representation of a BaseStream called BaseStream XML or BXML. The XML Schema standard ([6] and [7]) is used to define the exact syntax of the BXML elements. Any BaseStream instance can be converted to BXML and any BXML document can be converted to binary BaseStream format. A BaseStream instance converted to BXML and then back to a binary BaseStream instance is identical to the original BaseStream instance. Below is an example of a BXML document that stores plot data. 256001 http://www.x.com/plot2d/1 Position vs time time (s) pos (m) 1.0 2.0 3.0 4.0 0.4 1.5 2.0 1.8 A BXML document always has a root XML element called "BaseStream". The first child of the root element is "256001" which corresponds to Element0 in the BaseStream. After this the BaseStream elements are converted to XML by the following rules. The letter X is used to denote the type character of the element. 1. Any unnamed element is stored as an XML element with the same name as the type character. content The syntax of the X-element content is defined by the XML Schema type called "X-type". See Appendix A. No XML attributes are allowed. 2. A named element which is not a tag- or end-element is stored as an XML element with a name identical to the element name. The attribute "type" is required and the value of the attribute must be the type character of the element. content Lundberg Standards Track [Page 10] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 No other attributes then the type attribute is allowed. The exact syntax of X-element content is defined by the XML Schema rule called "named-X-type". See Appendix A. 3. If the element is a tag-element an unclosed start XML tag is added to the XML document. The name of the XML element is the string value of the tag-element. No XML attributes are allowed. 4. If the element is an end-element the last XML start tag (created because of a tag-element) is closed by a corresponding end tag. For simple types the corresponding XML character contents are represented as a decimal number. BaseStream to BXML converters are encouraged to use the canonical lexical representation of these values as defined in the Schema standard [7]. Array types are written as a sequence of whitespace separated representations of the corresponding simple type. There is one exception, the B-element is written as a whitespace separated sequence of pairs of characters that represents a hexadecimal number between 0 and 255. The bytes in the B-element are considered unsigned in this context. The figure below shows an XML Schema that defines the plot2d application for the example given above. The BaseStream Schema types are left out for clarity. They are defined in Appendix A. Lundberg Standards Track [Page 11] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 4. Discussion This section is not normative and is provided as a basis for further discussion. Much or all of the section will be removed before the document is published as an RFC. Send comments to frans@linova.com. 4.1 The use of the e-Element The end of stream indication 'e' may seem redundant at first since streams generally provide other ways to indicate the end of stream condition. However the 'e' element is important when a BaseStream is inserted within a stream. The normal end of stream condition will be raised at the end of the stream, not the end of the BaseStream and therefore the 'e' end indicator is needed. 4.2 Byte Order The question of whether to use big-endian, little-endian or both byte orders is not trivial. Both byte orders are heavily used and there is no large technical advantage with either one of them. BaseStream could have supported both orders, but since BaseStream is intended to be as simple as possible, this option was ruled out. Big-endian byte order was finally chosen since it is the traditional network byte order. Often the time it takes for a computer CPU to change the byte order of data is much shorter than the time to read or write data from the network or the file system. Thus, for most cases the byte order is not very important. 4.3 Character Encoding When storing strings the Unicode character set should be used to be able to represent all the widely used characters in the world. How to store the Unicode characters as bytes in a stream is however a less obvious decision. RFC 2279 (IETF Policy on Character Sets and Languages) [4] states that "protocols MUST be able to use the UTF-8 charset". BaseStream supports storing characters in this format. No other character encoding is supported to keep the format as simple as possible. 4.4 Element Names The element names are much restricted. The reason for this is that these names are intended to be used as part of the format syntax, not as actual application data. It is essential that these names can be printed and typed on as many systems as possible. Also the length of the element names are restricted so they can easily be handled in Lundberg Standards Track [Page 12] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 memory. Another reason for the restrictions on element names is that the names are intended to be possible to use as variable names in source code. This makes it easier to automatically generate source code for readers and writers for a specific BaseStream application. 4.5 BXML B-element Representation How should an array of bytes be written as a string? The choice is to write it as a sequence of pairs of hexadecimal characters. Each pair represents an unsigned byte value. The bytes could also have been represented by a sequence of signed one-byte integers in decimal notation to follow the representation for the other integer array types. The author of this document believes that that use of unsigned hexadecimal notation increases the readability and also makes the representation more compact. 4.6 Optional Header Element0 is a 5-byte header used to be able to recognize that a stream is a BaseStream of a specific version. For some applications this header may be superfluous. So should the header be optional? The author believes the header should be mandatory. Since it is only 5 bytes long it is not likely to cause a significant increase of the total length of the stream. Lundberg Standards Track [Page 13] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 5. Security Considerations The BaseStream format itself raises no security considerations, but a badly implemented BaseStreamReader may. A BaseStreamReader should be able to handle array and string sizes up to 2^63 gracefully. A simple implementation of a BaseStreamReader may always allocate new memory for the next element to directly after the type and size information is read. This can result in an out of memory error that possibly crashes the reader process if the size of an element is too large. In a network setting an attacker could of course write any value for the size without necessarily transmitting the data that should follow. Lundberg Standards Track [Page 14] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 Normative References [1] World Wide Web Consortium, "Extensible Markup Language (XML) 1.0 (Second Edition)", W3C XML, October 2000, . [2] IEEE, "Standard for Binary Floating-Point Arithmetic, Standard No.: 754-1985", 1985. [3] Crocker, D., "Augmented BNF for Syntax Specifications: ABNF", RFC 2234, November 1997. [4] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC 2279, January 1998. [5] The Unicode Consortium, "The Unicode Standard, Version 4.0.0, defined by: The Unicode Standard, Version 4.0 (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1)", 2003, . [6] World Wide Web Consortium, "XML Schema Part 1: Structures, W3C Recommendation 2 May 2001", May 2001, . [7] World Wide Web Consortium, "XML Schema Part 2: Datatypes, W3C Recommendation 02 May 2001", May 2001, . [8] Bradner, "Key words for use in RFCs to Indicate Requirement Levels", RFC 2119, BCP 14, March 1997. Lundberg Standards Track [Page 15] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 Informative References [9] ANSI, "Coded Character Set--7-Bit American Standard Code for Information Interchange, ANSI X3.4-1986.", 1986. Author's Address Frans Lundberg Linova Phone: +46 70 7601861 EMail: frans@linova.com URI: http://www.linova.com Lundberg Standards Track [Page 16] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 Appendix A. XML Schema for BXML types Lundberg Standards Track [Page 17] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 Lundberg Standards Track [Page 18] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 Lundberg Standards Track [Page 19] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 Lundberg Standards Track [Page 20] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 Lundberg Standards Track [Page 21] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 Lundberg Standards Track [Page 22] RFC nnnn BaseStream - A Simple Typed Stream Format March 2004 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the IETF's procedures with respect to rights in IETF Documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Copyright Statement Copyright (C) The Internet Society (2004). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Lundberg Standards Track [Page 23]