INTERNET-DRAFT F. Lundberg Expires: 27 September 2005 Linova Category: Standards Track 27 March 2005 BaseStream - A Simple Typed Stream Format draft-flundberg-basestream-04.txt Status of this Memo By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, or will be disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Copyright Notice Copyright (C) The Internet Society (2005). Abstract BaseStream is a simple binary stream format that serves as a common base for binary formats like XML provides a base for text formats. The binary format consists of typed and possibly named data elements. An instance of the BaseStream format has a corresponding XML representation that makes it possible to use existing XML tools to view, edit, validate, and transform BaseStream data. Lundberg Standards Track [Page 1] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. BaseStream Definition . . . . . . . . . . . . . . . . . . . . 4 2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 BaseStream Syntax . . . . . . . . . . . . . . . . . . . . 5 2.3 Future Versions . . . . . . . . . . . . . . . . . . . . . 8 2.4 Application Name . . . . . . . . . . . . . . . . . . . . . 9 3. BaseStream XML . . . . . . . . . . . . . . . . . . . . . . . . 10 4. Security Considerations . . . . . . . . . . . . . . . . . . . 12 5. References . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.1 Normative References . . . . . . . . . . . . . . . . . . . 13 5.2 Informative References . . . . . . . . . . . . . . . . . . 13 Author's Address . . . . . . . . . . . . . . . . . . . . . . . 13 A. XML Schema for BXML types . . . . . . . . . . . . . . . . . . 14 Intellectual Property and Copyright Statements . . . . . . . . 20 Lundberg Standards Track [Page 2] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 1. Introduction Binary data is difficult to handle by humans. Therefore text-based file formats and network protocols are popular. The software for such stream formats are easier to debug since they use a relatively human-friendly representation of the data. However, text-based formats have the following drawbacks compared to binary formats. o Text-based formats need more bytes to store numerical data than binary formats do. o Parsing numerical data represented as text is not efficient compared to parsing of binary data since since the text representation is not stored in a way similar to how it is stored in computer memory. o It is seldom possible to read only part of a text-based file since the exact byte offset to a data item is usually not known. BaseStream is a simple, light-weight, binary stream format consisting of a sequence of typed and possibly named data elements. The format is a data serialization format suitable as the base for file formats, TCP protocols, or any other format that handles sequences of bytes. Together with appropriate tools a BaseStream can easily be viewed and edited. BaseStream data is therefore human-friendly, but without the drawbacks of text-based formats. BaseStream serves as a common base for binary formats like XML [1] provides a base for text-based formats. This document specifies an XML application called BXML or BaseStream XML. Every BaseStream has a corresponding BXML document. By using software to convert data between binary BaseStream and BXML, any text or XML editor can be used to view and edit BaseStream data. Furthermore, BXML opens up BaseStream data to other XML technologies such as: data validation with XML Schemas and data transformation with XSLT (XML Stylesheet Language Transformation). This document specifies two things. 1. The rules for determining whether a stream is a valid BaseStream or not, and if it is, how to interpret the bytes of the stream as numbers and strings. 2. How a BaseStream is represented in XML. The goal of BaseStream is to make it easy to share binary data and develop applications that use binary formats. Lundberg Standards Track [Page 3] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 2. BaseStream Definition 2.1 Terminology This subsection defines the specific meaning of some words in this document. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [8]. A "byte" is an 8-bit data entity. A "stream" is a sequence of bytes. A stream is created by a "writer". A writer is an entity capable of producing a sequence of bytes and a "BaseStreamWriter" is a writer that produces a BaseStream. A "reader" reads a stream created by a writer. A "BaseStreamReader" is a reader that can read a BaseStream and interpret the bytes as BaseStream elements. A "format" or "stream format" is a set of rules for a stream that determines whether or not the stream adheres to the format and how to interpret the stream bytes as higher level data entities. "BaseStream" is used to refer either to the stream format defined in this document or a stream that adheres to the BaseStream format. The terms "BaseStream format" and "BaseStream instance" can be used to discriminate the two cases. "BaseStream1" is version 1 of the BaseStream format, that is, the version described in this document. The version number is normally left out in this document since the only version described in this document is version 1. A "BaseStream application" is a specific format based on BaseStream. A BaseStream application may put any further restrictions on the stream as long as every instance of the application is a BaseStream instance. Big-endian byte order specifies that multi-byte integers and floating point numbers are written to a stream with the most significant byte first. The following abbreviations are used for basic numerical data types. o "INT1", a 8-bit signed integer. Lundberg Standards Track [Page 4] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 o "INT2", a 16-bit signed integer. o "INT4", a 32-bit signed integer. o "INT8", a 64-bit signed integer. o "FLOAT4", a 4-byte floating point number. o "FLOAT8", an 8-byte floating point number. The integer types are 8-bit, 16-bit, 32-bit and 64-bit signed two's complement integers. "FLOAT4" and "FLOAT8" are 4- and 8-byte data entities used to store floating point numbers. These follow the single-precision 32-bit and double-precision 64-bit formats in the IEEE754 standard [2]. For all the numerical types big-endian byte order is used. In this document a byte value may be specified with the corresponding Unicode [5] character in the interval from 0 to 127. For example: 'a' is the value 97. These Unicode character values coincide with the ASCII [9] character values. 2.2 BaseStream Syntax A BaseStream consists of a sequence of data elements: Element0, Element1, Element2, and so on. The elements can be of a simple type, an array type or the string type. The type is specified by a type byte. The possible values of the type byte are the Unicode values for the characters: b, s, i, l, f, d, B, S, I, L, F, D, and U. There are six simple element types. The simple integer types are the b, s, i and l-element types. The letters are abbreviations for "byte", "short", "int", and "long" which are type names in widely used programming languages. These elements store INT1, INT2, INT4, and INT8 values respectively. The two simple types for floating point numbers are the f-element and the d-element which store FLOAT4 and FLOAT8 values. f and d are abbreviations for "float" and "double". The array types are the B, S, I, L, F, and D-element types. These may also be referred to as the B array element type, S array element type, and so on. These types are used to store an array of INT1s, INT2s, INT4s, INT8s, FLOAT4s or FLOAT8s respectively. Arrays with zero elements are allowed. The U element type is the string type. Strings are stored using the Unicode [5] character set that handles all widely used characters. UTF-8 encoding is used to store the characters as bytes. See RFC 2279 [4] for information on the UTF-8 charset encoding. Lundberg Standards Track [Page 5] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 Each element may optionally be named. The element name uses a subset of the ASCII [9] charset. The main reason for this restriction on the element names is that the name should be possible to type and print on as many systems as possible. The names are part of the format/protocol. They are not part of the stream data. Element0 in a BaseStream has a special meaning. A BaseStream starts with a byte of value 105 corresponding to 'i' in the ASCII character set [9]. This is the type byte of Element0. Then there are four bytes which forms an INT4 with the value 256001. Thus a BaseStream of version 1 always starts with the following five bytes: 105, 0, 3, 232, 1 which constitute Element0. The purpose of Element0 is to identify a stream as a BaseStream of a specific version. After Element0 there are zero to infinitely many user data elements. The last element is followed by a single byte ('e') used to indicate the end of the BaseStream. The end byte is important to indicate the end of a BaseStream when it is a part of another stream. The array types (B, S, I, L, F, D) and the string type (U) have a size. The size is the number of integers or floating point numbers in the array for the array types. For the string type the size is the number of bytes needed to encode the string in UTF-8. The size is stored in two ways. See the grammar rules "shortSize" and "longSize" below. If the size is between 0 and 127 it is stored as an INT1. If the size is larger than or equal to 128 it is stored as an INT8. Below is the syntax specification of the BaseStream format, version 1. Augmented Backus-Naur Form (ABNF) as defined in RFC2234 [3] is used. BaseStream1 = Element0 *element e Element0 = i %d0 %d3 %d232 %d1 ; byte 'i' followed by an INT4 with value 256001 element = [elementName] (simple / array / string) simple = b INT1 / ; b-element s INT2 / ; s-element i INT4 / ; i-element l INT8 / ; l-element f FLOAT4 / ; f-element d FLOAT8 ; d-element Lundberg Standards Track [Page 6] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 array = B size *INT1 / ; B-element S size *INT2 / ; S-element I size *INT4 / ; I-element L size *INT8 / ; L-element F size *FLOAT4 / ; F-element D size *FLOAT8 ; D-element string = U size utf8 utf8 = size = shortSize / longSize shortSize = longSize = minus8 longSizeNumber minus8 = longSizeNumber = elementName = N nameSize nameString nameSize = nameString = ALPHA 0*126digitLetterOrUnderscore digitLetterOrUnderscore = ALPHA / DIGIT / UNDERSCORE ALPHA = %x41-5A / %x61-7A ; A-Z / a-z UNDERSCORE = %x5F ; '_' DIGIT = "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" e = %x65 N = %x4E b = %x62 s = %x73 i = %x69 l = %x6C f = %x66 d = %x64 B = %x42 S = %x53 I = %x49 Lundberg Standards Track [Page 7] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 L = %x4C F = %x46 D = %x44 U = %x55 INT1 = INT2 = INT4 = INT8 = FLOAT4 = FLOAT8 = We can now list the rules that define the BaseStream format. A stream is a BaseStream of version 1 if and only if all the following rules are fulfilled. 1. The stream MUST follow the "BaseStream1" grammar rule. 2. The nameSize integer MUST equal the number of bytes (characters) in the following nameString. 3. The size integer (grammar rule "size") MUST equal the number of integers or floating point numbers that follow for the B, S, I, L, F, and D elements. For the U-element the size MUST equal the number of bytes needed to encode the UTF-8 string. 4. If a U-element is named "bs_tag" the string value of the element MUST follow the "nameString" grammar rule. Such an element is called a tag-element. 5. If a U-element is named "bs_end" the string value MUST be the empty string (a string with zero characters). Such an element is called an end-element. 6. At any point in the stream the number of end-elements written to the stream MUST NOT exceed the number of tag-elements. 7. The total number of end-elements in a BaseStream MUST equal the number of tag-elements. 8. If Element1 is a U-element named "bs_app" the string value of the element MUST be the name of the BaseStream application. The tag-element and the end-element affects how the BaseStream is represented as XML. This is treated later in this document. 2.3 Future Versions Future versions of the BaseStream format are not currently anticipated, but still prepared for. BaseStreamX is used to denote version X of BaseStream where X is 1, 2, 3, and so on. Version X of the format MUST start with with the byte 'i' and then an INT4 with the value 256000 + X. Any version of BaseStream will thus always start with the four bytes: 105, 0, 3, 232. The fifth byte is the BaseStream version. A version number higher than 127 will never be used. Lundberg Standards Track [Page 8] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 2.4 Application Name BaseStream provides a standard way of storing the name of the BaseStream application in the beginning of the stream. If Element1 of a BaseStream is a U-element named "bs_app" the value of this element is the name of the BaseStream application. If a BaseStream application contains the name of the application in the stream it is RECOMMENDED that it is stored in Element1 as described above. It is RECOMMENDED that the charset for the application name is limited to ASCII [9]. To provide unique application names, information about the application origin, and possibly a reference to a DTD or XML Schema, a URL may be used as the application name. For example the company X could call their first version of a 2D plot file format "http://www.x.com/plot2d/1.xsd". This URL should point to an XML Schema that can be used to validate the BXML representation of the BaseStream. If the application name starts with "http:" and ends with ".xsd" is is RECOMMENDED that it is considered a URL to a XML Schema file that can be used to validate the BXML representation of the BaseStream. The Schema file at the URL location should never change. If the application name starts with "http:" and ends with ".dtd" is is RECOMMENDED that it is considered a URL to a Document Type Definition (DTD) file that can be used to validate the BXML representation of the BaseStream. The format of a DTD is defined in the XML standard [1]. The DTD file at the URL location should never change. Note that if a URL is used it does not have to point to anything unless it starts with "http:" and ends with ".xsd" or ".dtd". It can be used just to provide a unique name for the BaseStream application. Lundberg Standards Track [Page 9] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 3. BaseStream XML This section specifies an XML representation of a BaseStream called BaseStream XML or BXML. The XML Schema standard ([6] and [7]) is used to define the exact syntax of the BXML elements. Any BaseStream instance can be converted to BXML and any BXML document can be converted to binary BaseStream format. A BaseStream instance converted to BXML and then back to a binary BaseStream instance is identical to the original BaseStream instance. Below is an example of a BXML document that stores plot data. 256001 http://www.x.com/plot2d/1.xsd Position vs time time (s) pos (m) 1.0 2.0 3.0 4.0 0.4 1.5 2.0 1.8 A BXML document always has a root XML element called "BaseStream". The first child of the root element is "256001" which corresponds to Element0 in the BaseStream. After this the BaseStream elements are converted to XML by the following rules. The letter X is used to denote the type character of the element. 1. Any unnamed element is stored as an XML element with the same name as the type character. content The syntax of the X-element content is defined by the XML Schema type called "X-type". See Appendix A. No XML attributes are allowed. 2. A named element which is not a tag- or end-element is stored as an XML element with a name identical to the element name. The attribute "type" is required and the value of the attribute must be the type character of the element. content No other attributes then the type attribute are allowed. The Lundberg Standards Track [Page 10] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 exact syntax of the X-element content is defined by the XML Schema rule called "named-X-type". See Appendix A. 3. If the element is a tag-element an unclosed start XML tag is added to the XML document. The name of the XML element is the string value of the tag-element. No XML attributes are allowed. 4. If the element is an end-element the last XML start tag (created because of a tag-element) is closed by a corresponding end tag. For simple types the corresponding XML character contents are represented as a decimal number. BaseStream to BXML converters are encouraged to use the canonical lexical representation of these values as defined in the Schema standard [7]. Array types are written as a sequence of whitespace separated representations of the corresponding simple type. There is one exception, the B-element is written as a whitespace separated sequence of pairs of characters that represent a hexadecimal number between 0 and 255. The bytes in the B-element are considered unsigned in this context. The figure below shows an XML Schema that defines the "plot2d" BaseStream application for the example given above. The BaseStream Schema types are left out for clarity. They are defined in Appendix A. Lundberg Standards Track [Page 11] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 4. Security Considerations The BaseStream format itself raises no security considerations, but a badly implemented BaseStreamReader may. A BaseStreamReader should be able to handle array and string sizes up to 2^63 gracefully. A simple implementation of a BaseStreamReader may always allocate new memory for the next element directly after the type and size information is read. This can result in an out of memory error that possibly crashes the reader process if the size of an element is too large. In a network setting an attacker could of course write any value for the size without necessarily transmitting the data that should follow. Lundberg Standards Track [Page 12] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 5. References 5.1 Normative References [1] World Wide Web Consortium, "Extensible Markup Language (XML) 1.0 (Third Edition)", W3C XML, February 2004, . [2] IEEE, "Standard for Binary Floating-Point Arithmetic, Standard No.: 754-1985", 1985. [3] Crocker, D., "Augmented BNF for Syntax Specifications: ABNF", RFC 2234, November 1997. [4] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC 2279, January 1998. [5] The Unicode Consortium, "The Unicode Standard, Version 4.0.0, defined by: The Unicode Standard, Version 4.0 (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1)", 2003, . [6] World Wide Web Consortium, "XML Schema Part 1: Structures, W3C Recommendation 2 May 2001", May 2001, . [7] World Wide Web Consortium, "XML Schema Part 2: Datatypes, W3C Recommendation 02 May 2001", May 2001, . [8] Bradner, "Key words for use in RFCs to Indicate Requirement Levels", RFC 2119, BCP 14, March 1997. 5.2 Informative References [9] ANSI, "Coded Character Set--7-Bit American Standard Code for Information Interchange, ANSI X3.4-1986.", 1986. Author's Address Frans Lundberg Linova Phone: +46 70 7601861 Email: frans at linova.com URI: http://www.linova.com Lundberg Standards Track [Page 13] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 Appendix A. XML Schema for BXML types Lundberg Standards Track [Page 14] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 Lundberg Standards Track [Page 15] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 Lundberg Standards Track [Page 16] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 Lundberg Standards Track [Page 17] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 Lundberg Standards Track [Page 18] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 Lundberg Standards Track [Page 19] RFC nnnn BaseStream - A Simple Typed Stream Format March 2005 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Copyright Statement Copyright (C) The Internet Society (2005). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Lundberg Standards Track [Page 20]