Binary Uniform Language Kit 1.0
draft-thierry-bulk-01

Abstract

This specification describes a uniform, decentrally extensible and efficient format for data serialization.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on February 02, 2014.

Copyright Notice

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

1. Introduction
1.1. Rationale
1.1.1. Definitions
1.1.2. State of the art
1.2. Format overview
1.3. Conventions and Terminology
2. BULK syntax
2.1. Parsing algorithm
2.2. Forms
2.2.1. starting marker byte
2.2.2. ending marker byte
2.3. Atoms
2.3.1. nil
2.3.2. Array
2.3.3. 32 bits word
2.3.4. 64 bits word
2.3.5. 128 bits word
2.3.6. Unsigned 8 bits integer
2.3.7. Unsigned 32, 64 and 128 bits integer
2.3.8. Signed 32, 64 and 128 bits integer
2.3.9. Reserved marker bytes
2.3.10. Reference
3. Standard namespaces
3.1. BULK core namespace
3.1.1. Version
3.1.2. Required namespace
3.1.3. Optional namespace
3.1.4. Strings encoding
3.1.5. IANA registered character set
3.1.6. Windows code page
3.1.7. true
3.1.8. false
3.2. arithmetic namespace
3.2.1. fraction
3.2.2. Arbitrary precision signed integer
3.2.3. Binary floating-point number
3.2.4. Decimal floating-point number
4. Extension namespaces
4.1. Official namespaces
4.2. User-defined namespaces
5. Profiles
5.1. Profile redundancy
5.2. Standard profile
6. Security Considerations
6.1. Parsing
6.2. Forwarding
7. IANA Considerations
8. Acknowledgements
9. References
9.1. Normative References
9.2. Informative references
Author's Address

1. Introduction

1.1. Rationale

This specification aims at finding an original trade-off between uniformity, generality, extensibility, decentralization, compactness and processing speed for a data format. It is our opinion that every widely used existing format occupy a different position in the solution space for formats, hence this new design. It is also our opinion that most of those existing formats constitute an optimal solution for their specific use case, either in a absolute sense, or at least at the time of their design. But the ever-changing field of IT now faces new challenges that call for a new approach.

In particular, whereas the previous trend for Internet and Web standards and programming tools has been to create human-readable syntaxes for data and protocols, the advent of technologies like protocol buffers [protobuf], Thrift [Thrift], the various binary serializations for JSON like Avro [Avro] or Smile [Smile], or the binary HTTP/2.0 [HTTP2] currently in development seem to indicate that the time is ripe for a generalized use of binary, reserved until now for the low-level protocols. The lessons about flexibility learnt in the previous switch from binary to plain text can now be applied to efficient binary syntaxes.

1.1.1. Definitions

By uniformity, we mean the property of a syntax that can be parsed even by an application that doesn't understand the semantics of every part of the processed data. Of course, almost all syntaxes that feature uniformity contain a limited number of non uniform elements. Also, uniformity really only has value in the face of extension, as a fixed syntax doesn't need uniformity (it only makes the implementation simpler).

Almost all extensible syntaxes have their extensible part uniform to a great degree. In this specification, uniformity is hence evaluated on two criteria: first, the number of non uniform elements (and, incidentally, their diversity), second, the fact that the uniformity of the extensible part is not a limitation to the users (i.e. that the temptation to extend the language in a non-uniform way is as absent as possible).

A good counter-example is found in most programming languages. Adding a new branching construct cannot be done in a terse way without modifying the underlying implementation. Such a construct either cannot be defined by user code (because of evaluation rules) or can in a terribly verbose and inconvenient way (with lot of boilerplate code). Notable exceptions to this limitation of programming languages are Lisp, Haskell and stack programming languages.

On the other hand, a stack programming language is the canonical example of a non-uniform language. Each operator takes a number of operands from the stack. Not knowing the arity of an operator makes it impossible to continue parsing, even when its evaluation was optional to the final processing. In the design space, stack programming languages completely sacrifice uniformity to achieve one the highest combination of extensibility, compactness and speed of processing.

By generality, we mean the ability of a syntax to lend itself to describe any kind of data with a reasonable (or better yet, high) level of compactness and simplicity. For example, although both arrays and linked lists could be considered very general as they are both able to store any kind of data, they actually are at the respective cost of complexity (arrays need the embedding of data structure in the data or in the processing logic) and size (in-memory linked lists can waste as much as half or two third of the space for the overhead of the data structure).

By decentralization, we mean the ability to extend the syntax in a way that avoid naming collisions without the use of a central registry. Note that the DNS is NOT decentralized in this sense, it is distributed (it cannot work without its root servers and not even without prior knowledge of their location).

1.1.2. State of the art

Uniformity, generality and extensibility are usually highly-valued traits in formats design. Programming languages obviouly feature them foremost, although their generality usually stops at what they are supposed to express: procedures. Most of them are ill-suited to represent arbitrary data, but notable exceptions include Lisp (where "code is data") and Javascript, from which a subset has been extracted to exchange data, JSON, which has seen a tremendous success for this purpose. JSON may lack in generality and compactness, but its design makes its parsing really straightforward and fast. All of them, though, lack decentralization. Some of them make it possible to extend them in a relatively decentralized way if some discipline is followed (for example, by naming modules after domaine names), but the discipline is not mandatory.

The SGML/XML family of formats also feature these traits and actually fare much better than programming languages on the three fronts. XML namespaces also make it fairly decentralized and there have been attempts at making it compact (Efficient XML Interchange).

But all the previously cited formats clearly lack compactness, although just applying standard compression techniques would sacrifice only very little processing time to gain huge size reductions on most of their intended use cases.

So-called binary formats pretty much exhibit the opposite trade-offs. Most of them are not uniform to achieve better compactness. Some are specifically designed for a great generality, but many lack extensibility. When they are extensible, it's never in a decentralized way, again for reasons that have to do with compactness. They are usually extremely fast to parse.

Actually, many binary formats are not so much formats but formats frameworks, and exclude extensibility by design. For each use case, an IDL compiler creates a brand new format that is essentially incompatible with all other formats created by the same compiler. If the IDL compiler and framework are correctly designed, such a format usually represent an optimum in compactness and speed of processing, as the compiler can also automatically generate an ad-hoc optimized parser.

1.2. Format overview

A BULK stream is a stream of 8-bit bytes, in big-endian order. Parsing a BULK stream yields a sequence of expressions, which can be either atoms or forms, composed of atoms. The syntax of forms is entirely uniform, without a single exception: a starting byte marker, a sequence of atoms and an ending byte marker. Among atoms, only nil (the null byte), arrays, words and fixed-sized integers have a special syntax, for efficiency purposes. Even booleans and floating-point numbers follow the uniform syntax that every other expression follow.

Non uniform atoms start with a marker byte, followed by a static or dynamic number of bytes, depending on the type (none for nil, static for words and fixed-size integers, dynamic for arrays).

Any other atom is a reference, which consists of a namespace marker (in most of the cases, a single byte) followed by a identifier within this namespace (a single byte). All in all, a very little sacrifice is made in compactness for the benefit of a very simple syntax: apart from nil, nothing is smaller than 2 bytes, and as most forms involve a reference followed by some content, a form is 4 bytes + its content.

A namespace marker in a BULK stream is associated by either of two simple forms to a namespace identified by a UUID, thus ensuring decentralized extension. One of the two forms declares that the stream can be processed even if the application doesn't reckognize the namespace. Parsing remains possible thanks to the uniform syntax.

1.3. Conventions and Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

Literal numerical values are provided in decimal or hexadecimal as appropriate. Hexadecimal literals are prefixed with 0x to distinguish them from decimal literals.

In the grammar, types are named by a capitalized word, like "Foo". The text representation of the BULK stream uses mnemonics for some bytes sequences. Mnemonics are series of characters, excluding all capital letters and white space, like "this-is-one-mnemonic" or "what-the-%§!?#-is-that?". They are always separated by white space.

In the grammar, a shape is a pattern of bytes, following the rules of the text representation for a BULK stream. Apart from mnemonics and types, a shape can contain:

a fixed byte, represented by its value as an unsigned integer (i.e. 0x3F)
an arbitrary sequence of a fixed number of bytes, represented by the number of bytes in decimal immediately followed by a B uppercase letter (i.e. 4B)

An arbitrary byte or sequence of bytes can be preceded by a name to make its description clearer. The name is a series of character between { and } immediately followed by a semi-colon (i.e. {quux}:4B).

2. BULK syntax

A BULK stream is a sequence of 8-bit bytes. Bits and bytes are in big-endian order. The result of parsing a BULK stream is a sequence of abstract data, called the abstract yield. BULK parsing is injective: a BULK stream has only one abstract yield, but different BULK streams can have the same abstract yield.

A processing application is not expected to actually produce the abstract yield, but an adaptation of the abstract yield to its own implementation, called the concrete yield. Also, some expressions in a BULK stream may have the semantics of a transformation of the abstract yield. A processing application may thus not produce or retain the concrete yield but the result of its transformation. This specification deals mainly with the byte sequence and the abstract yield and occasionnally provide guidelines about the concrete yield.

The abstract yield is a sequence of expressions. Expressions can be atoms or forms. Forms are sequences of expressions. If a byte sequence is parsed as an expression, this byte sequence is said to denote this expression.

2.1. Parsing algorithm

The parser operates with a context, which is a sequence of expressions. Each time an expression is parsed, it is appended at the end of the context. The initial context is the abstract yield.

At the beginning of a BULK stream and after having consumed the byte sequence denoting a complete expression, the parser is at the dispatch stage. At this stage, the next byte is a marker byte, which tells the parser what kind of expression comes next (the marker byte is the first byte of the sequence that denotes an expression).

The 0x1 and 0x2 marker bytes are special cases. When the parser reads 0x1, it immediately appends an empty sequence to the current context. This sequence becomes the new context. This new context has the previous context as parent. Then the parser returns to its dispatch stage. When the parser reads 0x2, it appends nothing to the context, but instead the parent of the current context becomes the new context and the parser returns to the dispatch stage. Thus it is a parsing error to read 0x2 when the context is the abstract yield.

Whenever a parsing error is encountered, parsing of the BULK stream MUST stop.

2.2. Forms

2.2.1. starting marker byte

value: 0x1
mnemonic: (

2.2.2. ending marker byte

value: 0x2
mnemonic: )

2.3. Atoms

2.3.1. nil

marker: 0x0 (mnemonic: nil)
shape: nil

Apart from being a possible short marker value, the fact that the 0x0 byte represents a valid atom means that a sequence of null bytes is a valid part of a BULK stream, thus making the format less fragile. In a network communication, nil atoms can also be sent to keep the channel open.

2.3.2. Array

marker: 0x3 (mnemonic: #)
shape: # Int …

Arrays have a special parsing rule. After consuming the marker byte, the parser returns to the dispatch stage. It is a parser error if the parsed expression is not of type Int or if its value cannot be reckognized. This integer is not added to any context, but the parser consumes as many bytes as this integer and they constitute the content of this array.

Type: Array

2.3.3. 32 bits word

marker: 0x4 (mnemonic: word32)
shape: word32 4B

Types: Word, Word32

2.3.4. 64 bits word

marker: 0x5 (mnemonic: word64)
shape: word64 8B

Types: Word, Word64

2.3.5. 128 bits word

marker: 0x6 (mnemonic: word128)
shape: word128 16B

Types: Word, Word128

2.3.6. Unsigned 8 bits integer

marker: 0x7 (mnemonic: int8)
shape: int8 1B

The value of this integer is the value of its contained byte.

Type: Number, Int

2.3.7. Unsigned 32, 64 and 128 bits integer

marker: 0x8 (mnemonic: uint)
shape: uint Word

The value of this integer is the value of its contained word.

Type: Number, Int

2.3.8. Signed 32, 64 and 128 bits integer

marker: 0x9 (mnemonic: sint)
shape: sint Word

The value of its contained word is the value of this integer in two's-complement notation.

Type: Number, Int

2.3.9. Reserved marker bytes

Marker bytes 0xA−0xF are reserved for future major versions of BULK. It is a parser error if a BULK stream with major version 1 contains such a marker byte.

2.3.10. Reference

marker: 0x10−0xFF
shape: {ns}:1B {name}:1B

The {ns} byte is a value associated with a namespace. Values 0x10−0x1F are reserved for namespaces defined by BULK specifications. Greater values can be associated with namespaces identified by a UUID.

The {name} byte is the name within the namespace. Vocabularies with more than 256 names thus need to be spread accross several namespaces.

The specification of a namespace SHOULD include a mnemonic for the namespace and for each defined name. When descriptions use several namespaces, the mnemonic of a reference SHOULD be the concatenation of the namespace mnemonic, ":" and the name mnemonic. For example, the fp name in namespace math becomes math:fp.

2.3.10.1. Speciale case

References have a special parsing rule. In case a BULK stream needs an important number of namespaces, if the marker byte is 0xFF, the parser continues to read bytes until it finds a byte different than 0xFF. The value of this sequence of bytes is the value associated with a namespace. For example, the reference denoted by the bytes 0xFF 0xFF 0x8C 0x1A is the name 26 in the namespace associated with 16777100.

3. Standard namespaces

Standard namespaces have a fixed namespace value and are not identified by a UUID.

3.1. BULK core namespace

marker: 0x10 (mnemonic: bulk)

3.1.1. Version

name: 0x1 (mnemonic: version)
shape: ( version {major}:Int {minor}:Int )

This form MUST be the first in a BULK stream. It is a parsing error otherwise. This specification defines BULK 1.0. When writing a BULK stream, a processing application MUST denote {major} and {minor} by the smallest byte sequence possible.

Two BULK versions with the same major version MUST share the same parsing rules and the same definitions of marker bytes. Changing the syntax or semantics of existing marker bytes and using marker bytes in the reserved interval warrants a new major version. Changing the syntax or semantics of existing names in standard namespaces also.

Adding standard namespaces or adding names in existing standard namespaces warrants a new minor version.

3.1.2. Required namespace

name: 0x2 (mnemonic: ns)
shape: ( ns {mark}:Int {uuid}:Word128 )

This associates the namespace identified by {uuid} to the value {mark}. It is a parsing error if a processing application doesn't reckognise the designated namespace.

3.1.3. Optional namespace

name: 0x3 (mnemonic: ns*)
shape: ( ns* {mark}:Int {uuid}:Word128 )

This associates the namespace identified by {uuid} to the value {mark}.

3.1.4. Strings encoding

name: 0x4 (mnemonic: stringenc)
shape: ( stringenc {enc}:Encoding )

This tells the processing application that all following expressions in the current context that are understood by the application as character strings will be encoded with the encoding designated by {enc}.

As the abstract yield doesn't contains strings but expressions that will be used as strings by the application, it is not a parsing error if the application doesn't reckognize {enc}. In this situation, it is a parsing error when the application actually needs to decode a byte sequence as a string. It is not a parsing error when a processing application only transmits a byte sequence encoding a string, if it can accurately convey the encoding to the receiving application.

3.1.5. IANA registered character set

name: 0x5 (mnemonic: iana-charset)
shape: ( iana-charset {id}:Int )

This designates the string encoding registered among the IANA Character Sets [IANA-Charsets] whose MIBenum is {id}.

Type: Encoding.

3.1.6. Windows code page

name: 0x6 (mnemonic: code-page)
shape: ( code-page {id}:Int )

This designates the string encoding among Windows code pages whose identifier is {id}.

Type: Encoding.

3.1.7. true

name: 0x7 (mnemonic: true)
shape: true

Type: Boolean.

3.1.8. false

name: 0x8 (mnemonic: false)
shape: false

Type: Boolean.

3.2. arithmetic namespace

marker: 0x11 (mnemonic: math)

A processing application must reckognize the type of all expressions defined in this specification that have the type Number, but an application MAY consider a number as having an unknown value if it has no adequate data type to store it.

3.2.1. fraction

name: 0x1 (mnemonic: frac)
shape: ( frac {num}:Int {div}:Int )

This is the number {num}/{div}.

Type: Number.

3.2.2. Arbitrary precision signed integer

name: 0x2 (mnemonic: bigint)
shape: ( bigint {bits}:Array )

The bits contained in {bits} is the value of this integer in two's-complement notation.

Type: Number, Int.

3.2.3. Binary floating-point number

name: 0x3 (mnemonic: binary)
shape: ( binary {bits}:Word )
shape: ( binary {bits}:Array )

This is a floating-point number expressed in IEEE 754-2008 binary interchange format. If {bits} is an Array, the size of its contents must be a multiple of 32 bits, as per IEEE 754-2008 rules.

Type: Number.

3.2.4. Decimal floating-point number

name: 0x4 (mnemonic: decimal)
shape: ( decimal {bits}:Word )
shape: ( decimal {bits}:Array )

This is a floating-point number expressed in IEEE 754-2008 decimal interchange format. If {bits} is an Array, the size of its contents must be a multiple of 32 bits, as per IEEE 754-2008 rules.

Type: Number.

4. Extension namespaces

Extension namespaces are defined with an identifier UUID, to be associated by a ns or ns* form.

4.1. Official namespaces

Extension namespaces defined as part of the official BULK suite MUST be identified by a v5 UUID. The namespace UUID used to generate it MUST be urn:uuid:abaddeed-face-11e2-9605-74de2b4102f1. The name used to generate it SHOULD be a URI designating the described vocabulary when one exists.

4.2. User-defined namespaces

User-defined namespaces are actually no different than official namespaces, apart from the choice of UUID.

5. Profiles

A profile is a sequence of expressions used as the initial abstract yield by a processing application. With respect to the BULK stream, the profile is an out-of-band information, usually implicit.

A processing application doesn't need to include the profile in the concrete yield, as long as the semantics of the abstract yield are maintained.

The same BULK stream might be processed with different profiles.

A processing application MUST NOT deduce the profile from the content of a BULK stream.

5.1. Profile redundancy

A processing application should only rely on the use of a profile when it is a safe assumption that the profile is known, for example within a communication where the protocol dictates the profile.

In particular, long-term storage of a BULK stream should preserve profile information, for example with a media type that dictates the profile.

Otherwise, an application creating a BULK stream SHOULD include the profile after the version form. For this reason, the expressions in a profile SHOULD have idempotent semantics.

5.2. Standard profile

This specification defines the default profile that a processing application MUST use when it is not using a specific profile:

( bulk:stringenc 106 )

This means that the default string encoding in a BULK stream is UTF-8.

6. Security Considerations

6.1. Parsing

Parsing a BULK stream is designed to be free of side-effects for the processing application, apart from storing the parsed results.

Arrays in BULK carry their size, so as for the application to know in advance the size of the data to read and store, thus making it easier to build robust code. A malicious software, however, may announce an array with a size choosen to get an application to exhaust its available memory. When a BULK stream has been completely received, an array bigger than the remaining data SHOULD trigger an error. When a BULK stream's size is not known in advance, the application SHOULD use a growable data structure.

6.2. Forwarding

When a processing application forwards all or part of the data in a BULK stream to another application, care must be taken if part of the forwarded data was not entirely reckognized, as it could be used by an attacker to benefit from the authority the forwarding application has on the recipient of the data.

7. IANA Considerations

This specification defines a new media type, application/bulk. Here are the informations for its registration to IANA:

Type name

application

Subtype name

bulk

Required parameters

none

Optional parameters

none

Encoding considerations

none, content is self-describing

Security considerations

cf. Section 6

Interoperability considerations

the constraint to start any BULK stream with a version form has the side-effect that classes of BULK streams can be identified by a sequence of bytes acting as "magic number":

0x011001: any BULK stream
0x01100107: a BULK stream of any major version beneath 256
0x0110010701: a BULK stream of major version 1
0x0110010701070202: a BULK stream of version 1.2

Published specification

this document

Applications that use this media type

none so far

Fragment identifier considerations

this specification defines no semantics for addressing the data with a fragment identifier; a future specification could define fragment identifier syntaxes to address the content by byte offset or the parsed results by their number in the yielded sequence

Additional information

a future specification may define a naming convention for media types based on bulk with a +bulk suffix, as for XML with +xml

8. Acknowledgements

The original author of this specification read Erik Naggum's famous rant about XML several years before, and it may well have unconsciouly influenced this design. He happened to stumble upon it again while writing the earliest draft of this specification and it struck him how much it embodies Erik's ideas. In any case, this format is dedicated to Erik.

9. References

9.1. Normative References

, "

[RFC2119]	Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels ", BCP 14, RFC 2119, March 1997.
[IANA-Charsets]	IANA Charset Registry (archived at): ", .

9.2. Informative references

, "

[HTTP2]	Belshe, M., Peon, R., Thomson, M. and A. Melnikov, "Hypertext Transfer Protocol version 2.0", Internet-Draft draft-ietf-httpbis-http2-04, July 2013.
[Avro]	Cutting, D., "Apache Avro™ 1.7.4 Specification", February 2013.
[protobuf]	Protocol Buffers", July 2008.
[Smile]	Saloranta, T., "Smile Data Format", September 2010.
[Thrift]	Slee, M., Agarwal, A. and M. Kwiatkowski, "Thrift: Scalable Cross-Language Services Implementation", April 2007.

Author's Address

Pierre Thierry Thierry Technologies EMail: pierre@nothos.net

Abstract

Status of This Memo

Copyright Notice

Table of Contents

1. Introduction

1.1. Rationale

1.1.1. Definitions

1.1.2. State of the art

1.2. Format overview

1.3. Conventions and Terminology

2. BULK syntax

2.1. Parsing algorithm

2.2. Forms

2.2.1. starting marker byte

2.2.2. ending marker byte

2.3. Atoms

2.3.1. nil

2.3.2. Array

2.3.3. 32 bits word

2.3.4. 64 bits word

2.3.5. 128 bits word

2.3.6. Unsigned 8 bits integer

2.3.7. Unsigned 32, 64 and 128 bits integer

2.3.8. Signed 32, 64 and 128 bits integer

2.3.9. Reserved marker bytes

2.3.10. Reference

2.3.10.1. Speciale case

3. Standard namespaces

3.1. BULK core namespace

3.1.1. Version

3.1.2. Required namespace

3.1.3. Optional namespace

3.1.4. Strings encoding

3.1.5. IANA registered character set

3.1.6. Windows code page

3.1.7. true

3.1.8. false

3.2. arithmetic namespace

3.2.1. fraction

3.2.2. Arbitrary precision signed integer

3.2.3. Binary floating-point number

3.2.4. Decimal floating-point number

4. Extension namespaces

4.1. Official namespaces

4.2. User-defined namespaces

5. Profiles

5.1. Profile redundancy

5.2. Standard profile

6. Security Considerations

6.1. Parsing

6.2. Forwarding

7. IANA Considerations

8. Acknowledgements

9. References

9.1. Normative References

9.2. Informative references

Author's Address