Binary Uniform Language Kit 1.0
draft-thierry-bulk-02

Abstract

This specification describes a uniform, decentrally extensible and efficient format for data serialization.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on February 07, 2014.

Copyright Notice

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

1. Introduction
1.1. Rationale
1.1.1. Definitions
1.1.2. State of the art
1.2. Format overview
1.3. Conventions and Terminology
2. BULK syntax
2.1. Parsing algorithm
2.1.1. Evaluation
2.2. Forms
2.2.1. starting marker byte
2.2.2. ending marker byte
2.2.3. Difference between sequence and form
2.3. Atoms
2.3.1. nil
2.3.2. Array
2.3.3. Binary words
2.3.4. Signed fixed-size integer
2.3.5. Reserved marker bytes
2.3.6. Reference
3. Standard namespaces
3.1. BULK core namespace
3.1.1. Version
3.1.2. true
3.1.3. false
3.1.4. Strings encoding
3.1.5. IANA registered character set
3.1.6. Windows code page
3.1.7. Namespaces
3.1.8. Definitions
3.1.9. Substituton
3.1.10. arithmetic
4. Extension namespaces
4.1. Official namespaces
4.2. User-defined namespaces
5. Profiles
5.1. Profile redundancy
5.2. Standard profile
6. Security Considerations
6.1. Parsing
6.2. Forwarding
6.3. Definitions
7. IANA Considerations
8. Acknowledgements
9. References
9.1. Normative References
9.2. Informative references
Appendix A. Robust namespace definition
A.1. Selective authority
A.2. Open authority
Appendix B. Semantic leeway
Author's Address

1. Introduction

1.1. Rationale

This specification aims at finding an original trade-off between uniformity, generality, extensibility, decentralization, compactness and processing speed for a data format. It is our opinion that every widely used existing format occupy a different position in the solution space for formats, hence this new design. It is also our opinion that most of those existing formats constitute an optimal solution for their specific use case, either in a absolute sense, or at least at the time of their design. But the ever-changing field of IT now faces new challenges that call for a new approach.

In particular, whereas the previous trend for Internet and Web standards and programming tools has been to create human-readable syntaxes for data and protocols, the advent of technologies like protocol buffers [protobuf], Thrift [Thrift], the various binary serializations for JSON like Avro [Avro] or Smile [Smile], or the binary HTTP/2.0 [HTTP2] currently in development seem to indicate that the time is ripe for a generalized use of binary, reserved until now for the low-level protocols. The lessons about flexibility learnt in the previous switch from binary to plain text can now be applied to efficient binary syntaxes.

1.1.1. Definitions

By uniformity, we mean the property of a syntax that can be parsed even by an application that doesn't understand the semantics of every part of the processed data. Of course, almost all syntaxes that feature uniformity contain a limited number of non uniform elements. Also, uniformity really only has value in the face of extension, as a fixed syntax doesn't need uniformity (it only makes the implementation simpler).

Almost all extensible syntaxes have their extensible part uniform to a great degree. In this specification, uniformity is hence evaluated on two criteria: first, the number of non uniform elements (and, incidentally, their diversity), second, the fact that the uniformity of the extensible part is not a limitation to the users (i.e. that the temptation to extend the language in a non-uniform way is as absent as possible).

A good counter-example is found in most programming languages. Adding a new branching construct cannot be done in a terse way without modifying the underlying implementation. Such a construct either cannot be defined by user code (because of evaluation rules) or can in a terribly verbose and inconvenient way (with lots of boilerplate code). Notable exceptions to this limitation of programming languages are Lisp, Haskell and stack programming languages.

On the other hand, a stack programming language is the canonical example of a non-uniform language. Each operator takes a number of operands from the stack. Not knowing the arity of an operator makes it impossible to continue parsing, even when its evaluation was optional to the final processing. In the design space, stack programming languages completely sacrifice uniformity to achieve one of the highest combination of extensibility, compactness and speed of processing.

By generality, we mean the ability of a syntax to lend itself to describe any kind of data with a reasonable (or better yet, high) level of compactness and simplicity. For example, although both arrays and linked lists could be considered very general as they are both able to store any kind of data, they actually are at the respective cost of complexity (arrays need the embedding of data structure in the data or in the processing logic) and size (in-memory linked lists can waste as much as half or two third of the space for the overhead of the data structure).

By decentralization, we mean the ability to extend the syntax in a way that avoid naming collisions without the use of a central registry. Note that the DNS, as we use it, is NOT decentralized in this sense, but distributed, as it cannot work without its root servers and not even without prior knowledge of their location.

1.1.2. State of the art

Uniformity, generality and extensibility are usually highly-valued traits in formats design. Programming languages obviouly feature them foremost, although their generality usually stops at what they are supposed to express: procedures. Most of them are ill-suited to represent arbitrary data, but notable exceptions include Lisp (where "code is data") and Javascript, from which a subset has been extracted to exchange data, JSON, which has seen a tremendous success for this purpose. JSON may lack in generality and compactness, but its design makes its parsing really straightforward and fast. All of them, though, lack decentralization. Some of them make it possible to extend them in a relatively decentralized way if some discipline is followed (for example, by naming modules after domaine names), but the discipline is not mandatory.

The SGML/XML family of formats also feature these traits and actually fare much better than programming languages on the three fronts. XML namespaces also make it relatively decentralized and there have been attempts at making it compact (e.g. EXI from W3C, Fast Infoset from ISO/ITU or EBML).

All the previously cited formats clearly lack compactness, although just applying standard compression techniques would sacrifice only very little processing time to gain huge size reductions on most of their intended use cases.

So-called binary formats pretty much exhibit the opposite trade-offs. Most of them are not uniform to achieve better compactness. Some are specifically designed for a great generality, but many lack extensibility. When they are extensible, it's never in a decentralized way, again for reasons that have to do with compactness. They are usually extremely fast to parse.

Actually, many binary formats are not so much formats but formats frameworks, and exclude extensibility by design. For each use case, an IDL compiler creates a brand new format that is essentially incompatible with all other formats created by the same compiler (EBML specifically cites this property among its own disadvantages). If the IDL compiler and framework are correctly designed, such a format usually represent an optimum in compactness and speed of processing, as the compiler can also automatically generate an ad-hoc optimized parser.

1.2. Format overview

A BULK stream is a stream of 8-bit bytes, in big-endian order. Parsing a BULK stream yields a sequence of expressions, which can be either atoms or forms, which are sequences of expressions. The syntax of forms is entirely uniform, without a single exception: a starting byte marker, a sequence of expressions and an ending byte marker. Among atoms, only nil (the null byte), arrays and fixed-sized binary words have a special syntax, for efficiency purposes. Even booleans and floating-point numbers follow the uniform syntax that every other expression follow.

Non uniform atoms start with a marker byte, followed by a static or dynamic number of bytes, depending on the type (none for nil, static for fixed-size binary words, dynamic for arrays).

Any other atom is a reference, which consists of a namespace marker (in most of the cases, a single byte) followed by a identifier within this namespace (a single byte). All in all, a very little sacrifice is made in compactness for the benefit of a very simple syntax: apart from nil, nothing is smaller than 2 bytes, and as most forms involve a reference followed by some content, a form is 4 bytes + its content.

A namespace marker in a BULK stream is associated by either one of two simple forms to a namespace identified by a UUID, thus ensuring decentralized extensibility. One of the two forms declares that the stream can be processed even if the application doesn't recognize the namespace. Parsing remains possible thanks to the uniform syntax.

Combination of BULK namespaces, BULK streams and even other formats doesn't need any content transformation to work. Here are some examples:

The content of a BULK stream, enclosed in 0x1 and 0x2 bytes markers, constitute a valid BULK expression. Thus BULK streams can be packed or annotated within a BULK stream without modification. Annotation use cases include adding metadata and cryptographic signature.
A BULK format could specify in its syntax the place for an expression holding metadata. Whether the specification provides its own metadata forms or not, an application could use a BULK serialization for MARC, TEI Header, XML or RDF for this metadata expression. The vocabulary selected would be univocally expressed by the namespace and every vocabulary would be parsed by the same mechanisms.
Whether a content must be stored as-is instead of serialized or a highly-optimized ad hoc serialization exists for some data, anything can always be stored within an array. They can contain arbitray bytes and there is no limit to their size.

Furthermore, BULK expressions can be evaluated. Most expressions evaluate to themselves, but some evaluate by default to the result of a function call, making it possible to serialize data in an even more compact form, by eliminating boilerplate data and repeated patterns.

1.3. Conventions and Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

Literal numerical values are provided in decimal or hexadecimal as appropriate. Hexadecimal literals are prefixed with 0x to distinguish them from decimal literals.

The text notation of the BULK stream uses mnemonics for some bytes sequences. Mnemonics are series of characters, excluding all capital letters and white space, like this-is-one-mnemonic or what-the-%§!?#-is-that?. They are always separated by white space. Outside the use of mnemonics, a sequence of bytes (of one or more bytes) can be represented by its hexadecimal value as an unsigned integer (e.g. 0x3F or 0x3A0B770F). Some types in this specification define a special syntax for their representation in the text notation.

In the grammar, a shape is a pattern of bytes, following the rules of the text notation for a BULK stream. Apart from mnemonics and fixed sequences of bytes, a shape can contain:

an arbitrary sequence of a fixed number of bytes, represented by its size, i.e. a number of bytes in decimal immediately followed by a B uppercase letter (e.g. 4B)
a typed sequence of bytes, represented by the name of its type, a capitalized word (e.g. Foo); this means a sequence of bytes whose specific yield (cf. Section 2.1) has this type
a named sequence of bytes (of zero or more bytes), represented by a series of any character excluding '{}' between '{' and '}' (e.g. {quux}); a named sequence can be typed or sized, in which case it is immediately followed by ':' and a type or size (e.g. {quux}:Bar or {quux}:12B)

When an entire shape describes the byte sequence of an atom, it is the normative specification for parsing it, but shapes of forms are only normative with respect to their evaluation. A reference defined with a form shape can be used in different shapes, albeit with different semantics and value.

2. BULK syntax

A BULK stream is a sequence of 8-bit bytes. Bits and bytes are in big-endian order. The result of parsing a BULK stream is a sequence of abstract data, called the abstract yield. BULK parsing is injective: a BULK stream has only one abstract yield, but different BULK streams can have the same abstract yield.

A processing application is not expected to actually produce the abstract yield, but an adaptation of the abstract yield to its own implementation, called the concrete yield. Also, some expressions in a BULK stream may have the semantics of a transformation of the abstract yield. A processing application may thus not produce or retain the concrete yield but the result of its transformation. This specification deals mainly with the byte sequence and the abstract yield and occasionnally provide guidelines about the concrete yield.

The abstract yield is a sequence of expressions. Expressions can be atoms or forms. Forms are sequences of expressions. If a byte sequence is parsed as an expression, this byte sequence is said to denote this expression.

2.1. Parsing algorithm

The parser operates with a context, which is a sequence of expressions. Each time an expression is parsed, it is appended at the end of the context. The initial context is the abstract yield.

At the beginning of a BULK stream and after having consumed the byte sequence denoting a complete expression, the parser is at the dispatch stage. At this stage, the next byte is a marker byte, which tells the parser what kind of expression comes next (the marker byte is the first byte of the sequence that denotes an expression). The expression appended to the context after reading a byte sequence is called the specific yield of the byte sequence.

The 0x1 and 0x2 marker bytes are special cases. When the parser reads 0x1, it immediately appends an empty sequence to the current context. This sequence becomes the new context. This new context has the previous context as parent. Then the parser returns to its dispatch stage. When the parser reads 0x2, it appends nothing to the context, but instead the parent of the current context becomes the new context and the parser returns to the dispatch stage. Thus it is a parsing error to read 0x2 when the context is the abstract yield.

The scope of an expression is the part of its context that follows the expression.

Whenever a parsing error is encountered, parsing of the BULK stream MUST stop.

2.1.1. Evaluation

A processing application MAY implement evaluation of BULK expressions and streams. When evaluating a BULK stream, when the parser gets to the dispatch stage and the context is the abstract yield, the last expression in the context is replaced by what it evaluates to.

The default evaluation rule is that an expression evaluates to itself. A name within a namespace can have a value, which is what a reference associated to this name evaluates to. A reference whose marker value is associated to no namespace or whose name has no value evaluates to itself. How self-evaluating BULK expressions are represented in the concrete yield is application-dependent, but future specifications may define a standard API to access it, similar to the Document Object Model for XML.

The evaluation of a sequence obeys a special rule, though: if the first expression of the sequence has type Function, that function is called with an argument list and the sequence evaluates to the return value. If the function has type LazyFunction, the argument list is the rest of the sequence. If the function has type EagerFunction, the argument list is the rest of the sequence, where each expression is replaced by what it evaluates to. Any expression that has type LazyFunction or EagerFunction also has type Function.

2.2. Forms

2.2.1. starting marker byte

marker: 0x1
mnemonic: (

2.2.2. ending marker byte

marker: 0x2
mnemonic: )

2.2.3. Difference between sequence and form

Beware that although the specific yield of a form is a sequence of expressions, it is not the same thing as a byte sequence described as a sequence of expressions. Let's examine several forms of the shape ( foo {seq} ).

In the first form, {seq} is a sequence of 3 expressions: ( foo nil nil nil ). In the second form, {seq} is a single expression, and that expression is an atom: ( foo nil ). But in the third form, {seq} is also a single expression, and that expression is a form: ( foo ( nil nil nil ) ).

2.3. Atoms

2.3.1. nil

marker: 0x0 (mnemonic: nil)
shape: nil

Apart from being a possible short marker value, the fact that the 0x0 byte represents a valid atom means that a sequence of null bytes is a valid part of a BULK stream, thus making the format less fragile. In a network communication, nil atoms can also be sent to keep the channel open. They can also be used as padding at the end of a form.

2.3.2. Array

marker: 0x3 (mnemonic: #)
shape: # Int {content}

Arrays have a special parsing rule. After consuming the marker byte, the parser returns to the dispatch stage. It is a parser error if the parsed expression is not of type Int or if its value cannot be recognized. This integer is not added to any context, but the parser consumes as many bytes as this integer and they constitute the content of this array.

If two arrays have the shapes # {s1} {c1} and # {s2} {c2} and if {s1+s2} denotes the sum of {s1} and {s2}, then their concatenation is # {s1+s2} {c1} {c2}.

In the text notation, a quoted string represent an array containing the encoding of that string in the current encoding.

Type: Array

2.3.3. Binary words

A word can be interpreted either as a bits sequence or as an unsigned integer in binary notation. The choice depends on the context and the application. Actually, many processing applications may not need make any choice, as most programming language implementations actually also confuse unsigned integers and bits sequences to some extent.

2.3.3.1. 8 bits word

marker: 0x4 (mnemonic: w8)
shape: w8 1B

Types: Int, Word, Word8

2.3.3.2. 16 bits word

marker: 0x5 (mnemonic: w16)
shape: w16 2B

Types: Int, Word, Word16

2.3.3.3. 32 bits word

marker: 0x6 (mnemonic: w32)
shape: w32 4B

Types: Int, Word, Word32

2.3.3.4. 64 bits word

marker: 0x7 (mnemonic: w64)
shape: w64 8B

Types: Int, Word, Word64

2.3.3.5. 128 bits word

marker: 0x8 (mnemonic: w128)
shape: w128 16B

Types: Int, Word, Word128

2.3.4. Signed fixed-size integer

marker: 0x9 (mnemonic: sint)
shape: sint Word

The value of its contained word is the value of this integer in two's-complement notation.

Type: Number, Int

2.3.5. Reserved marker bytes

Marker bytes 0xA−0xF are reserved for future major versions of BULK. It is a parser error if a BULK stream with major version 1 contains such a marker byte.

2.3.6. Reference

marker: 0x10−0xFF
shape: {ns}:1B {name}:1B

The {ns} byte is a value associated with a namespace. Values 0x10−0x1F are reserved for namespaces defined by BULK specifications. Greater values can be associated with namespaces identified by a UUID.

The {name} byte is the name within the namespace. Vocabularies with more than 256 names thus need to be spread accross several namespaces.

The specification of a namespace SHOULD include a mnemonic for the namespace and for each defined name. When descriptions use several namespaces, the mnemonic of a reference SHOULD be the concatenation of the namespace mnemonic, ":" and the name mnemonic if there can be an ambiguity. For example, the fp name in namespace math becomes math:fp.

Type: Ref

2.3.6.1. Speciale case

References have a special parsing rule. In case a BULK stream needs an important number of namespaces, if the marker byte is 0xFF, the parser continues to read bytes until it finds a byte different than 0xFF. The value of this sequence of bytes is the value associated with a namespace. For example, the reference denoted by the bytes 0xFF 0xFF 0x8C 0x1A is the name 26 in the namespace associated with 16777100.

3. Standard namespaces

Standard namespaces have a fixed namespace value and are not identified by a UUID.

3.1. BULK core namespace

marker: 0x10 (mnemonic: bulk)

3.1.1. Version

name: 0x1 (mnemonic: version)
shape: ( version {major}:Int {minor}:Int )

When parsing a BULK stream, a processing application MUST determine explicitely the major and minor version of the BULK specification that the stream obeys. This information MAY be exchanged out-of-band, if BULK is used to exchange a number a very small messages, where repeated headers of 8 bytes might become too big a overhead. A processing application MUST NOT assume a default version.

If the version is expressed within a BULK stream, this form MUST be the first in the stream. In any other place, this form has no semantics attached to it. This specification defines BULK 1.0. When writing a BULK stream, an application MUST denote {major} and {minor} by the smallest byte sequence possible.

An application writing a BULK stream to long-term storage (e.g. in a file or a database record) SHOULD include a version form.

Two BULK versions with the same major version MUST share the same parsing rules and the same definitions of marker bytes. Changing the syntax or semantics of existing marker bytes and using marker bytes in the reserved interval warrants a new major version. Changing the syntax or semantics of existing names in standard namespaces also.

Adding standard namespaces or adding names in existing standard namespaces warrants a new minor version.

3.1.2. true

name: 0x2 (mnemonic: true)
shape: true

Type: Boolean.

3.1.3. false

name: 0x3 (mnemonic: false)
shape: false

Type: Boolean.

3.1.4. Strings encoding

name: 0x4 (mnemonic: stringenc)
shape: ( stringenc {enc}:Encoding )

This tells the processing application that, in the scope of this expression, all expressions that are understood by the application as character strings will be encoded with the encoding designated by {enc}.

As the abstract yield doesn't contains strings but expressions that will be used as strings by the application, it is not a parsing error if the application doesn't recognize {enc}. In this situation, it is a parsing error when the application actually needs to decode a byte sequence as a string. It is not a parsing error when a processing application only transmits a byte sequence encoding a string, if it can accurately convey the encoding to the receiving application.

3.1.5. IANA registered character set

name: 0x5 (mnemonic: iana-charset)
shape: ( iana-charset {id}:Int )

This designates the string encoding registered among the IANA Character Sets [IANA-Charsets] whose MIBenum is {id}.

Type: Encoding.

3.1.6. Windows code page

name: 0x6 (mnemonic: code-page)
shape: ( code-page {id}:Int )

This designates the string encoding among Windows code pages whose identifier is {id}.

Type: Encoding.

3.1.7. Namespaces

The semantic of some expressions is to make a namespace required. It is a parsing error if a processing application doesn't recognise this namespace.

3.1.7.1. Required namespace

name: 0x7 (mnemonic: ns)
shape: ( ns {mark}:Int {uuid}:Word128 )

This associates the namespace identified by {uuid} to the value {mark}. It makes this namespace required.

3.1.7.2. Optional namespace

name: 0x8 (mnemonic: ns*)
shape: ( ns* {mark}:Int {uuid}:Word128 )

This associates the namespace identified by {uuid} to the value {mark}.

3.1.7.3. Package

name: 0x9 (mnemonic: package)
shape: ( package {uuid}:Word128 {namespaces} )

This creates a package identified by {uuid}. Packages are immutable, {uuid} MUST be a v5 UUID generated with the BULK UUID and the byte sequence {namespaces}. {namespaces} must be a sequence of expressions. They can be either expressions of type Word128 or have the shape ( ns* Word128 ), which means that the designated namespace is optional to this package.

3.1.7.4. Import

name: 0xA (mnemonic: import)
shape: ( import {base}:Int {uuid}:Word128 )

This associates the namespaces in the package identified by {uuid} with a continuous range of values starting at {base}. If the namespace is not optional to this package, it make the namespace required. It is a parsing error if a processing application doesn't recognize the package.

3.1.8. Definitions

To define a reference is to change the the value of its name in its namespace (as identified by its UUID, not the marker value) within a certain scope.

If a BULK stream is not evaluated, the semantics of an definition are entirely application-dependent.

When a BULK stream containing definitions for a namespace comes from a trusted source (i.e. in configuration files of the application, or in the communication with an agent that has been granted the relevant authority), an application MAY give those definitions long-lasting semantics (i.e. keep the values of the names at the end of parsing). This is the preferred mechanism for bulk namespace definition when the semantics of the defined expressions can be expressed completely by BULK forms.

3.1.8.1. Simple definition

name: 0xB (mnemonic: define)
shape: ( define {ref}:Ref Expr )

This defines the reference {ref} in the scope of this expression.

3.1.9. Substituton

3.1.9.1. Substitution function

name: 0xC (mnemonic: subst)
shape: ( subst {code} )

Name's type: LazyFunction
Form's type: EagerFunction
Form's value: A substitution function whose return value is the value of {code}. Within {code}'s specific yield, the names arg and rest are defined:

3.1.9.1.1. Argument

name: 0xD (mnemonic: arg)
shape: ( arg {n}:Int )

Name's type: EagerFunction
Form's type: Expr
Form's value: the element number {n} (starting at zero) of the substitution function's arguments list

3.1.9.1.2. Rest of arguments list

name: 0xE (mnemonic: rest)
shape: ( rest {n}:Int )

Name's type: EagerFunction
Form's type: Expr
Form's value: the substitution function's arguments list without its first {n} elements.

3.1.9.2. Named expression

name: 0xF (mnemonic: named)
shape: ( named {ref}:Ref {expr}:Expr )

This form defines {ref} as {expr} in its scope. This form evaluates to {expr}.

3.1.10. arithmetic

A processing application must recognize the type of all expressions defined in this specification that have the type Number, but an application MAY consider a number as having an unknown value if it has no adequate data type to store it.

In the text notation of a BULK stream, a decimal integer represent the smallest byte sequence that is represented by this integer. For example, ( 31 256 ) is a notation for the bytes 0x1 0x7 0x1F 0x8 0x0 0x0 0x1 0x0 0x2.

3.1.10.1. fraction

name: 0x20 (mnemonic: frac)
shape: ( frac {num}:Int {div}:Int )

This is the number {num}/{div}.

Type: Number.

3.1.10.2. Arbitrary precision signed integer

name: 0x21 (mnemonic: bigint)
shape: ( bigint {bits}:Array )

The bits contained in {bits} is the value of this integer in two's-complement notation.

Type: Number, Int.

3.1.10.3. Binary floating-point number

name: 0x22 (mnemonic: binary)
shape: ( binary {bits}:Word )
shape: ( binary {bits}:Array )

This is a floating-point number expressed in IEEE 754-2008 binary interchange format. If {bits} is an Array, the size of its contents must be a multiple of 32 bits, as per IEEE 754-2008 rules. {bits} MUST NOT have type Word8.

Types: Number, Float.

3.1.10.4. Decimal floating-point number

name: 0x23 (mnemonic: decimal)
shape: ( decimal {bits}:Word )
shape: ( decimal {bits}:Array )

This is a floating-point number expressed in IEEE 754-2008 decimal interchange format. If {bits} is an Array, the size of its contents must be a multiple of 32 bits, as per IEEE 754-2008 rules. {bits} MUST NOT have type Word8.

Types: Number, Float.

4. Extension namespaces

Extension namespaces are defined with an identifier UUID, to be associated to a marker value.

4.1. Official namespaces

Extension namespaces defined as part of the official BULK suite MUST be identified by a v5 UUID. The namespace UUID used to generate it MUST be urn:uuid:abaddeed-face-11e2-9605-74de2b4102f1. The name used to generate it SHOULD be a URI designating the described vocabulary when one exists.

4.2. User-defined namespaces

User-defined namespaces are actually no different than official namespaces, apart from the choice of UUID.

5. Profiles

A profile is a byte sequence parsed by a processing application just after the version form or before the first expression if there is no version form. Thus a parser SHOULD look ahead at the beginning of a stream to see if the first three bytes are ( bulk:version. With respect to the BULK stream, the profile is an out-of-band information, usually implicit.

A processing application doesn't need to include the profile in the concrete yield, as long as the semantics of the abstract yield are maintained.

The same BULK stream might be processed with different profiles.

A processing application MUST NOT deduce the profile from the content of a BULK stream.

5.1. Profile redundancy

A processing application should only rely on the use of a profile when it is a safe assumption that the profile is known, for example within a communication where the protocol dictates the profile.

In particular, long-term storage of a BULK stream should preserve profile information, for example with a media type that dictates the profile.

Otherwise, an application writing a BULK stream in a long-term storage SHOULD include the profile after the version form. For this reason, the expressions in a profile SHOULD have idempotent semantics.

5.2. Standard profile

This specification defines the default profile that a processing application MUST use when it is not using a specific profile:

( bulk:stringenc 106 )

This means that the default string encoding in a BULK stream is UTF-8.

6. Security Considerations

6.1. Parsing

Parsing a BULK stream is designed to be free of side-effects for the processing application, apart from storing the parsed results.

Arrays in BULK carry their size, so as for the application to know in advance the size of the data to read and store, thus making it easier to build robust code. A malicious software, however, may announce an array with a size choosen to get an application to exhaust its available memory. When a BULK stream has been completely received, an array bigger than the remaining data SHOULD trigger an error. When a BULK stream's size is not known in advance, the application SHOULD use a growable data structure.

6.2. Forwarding

When a processing application forwards all or part of the data in a BULK stream to another application, care must be taken if part of the forwarded data was not entirely recognized, as it could be used by an attacker to benefit from the authority the forwarding application has on the recipient of the data.

6.3. Definitions

The architecture of a processing application SHOULD ensure that a malicious agent cannot abuse authority given to it to define a namespace in order to modify associations in other namespaces. Depending on the use of data structures storing BULK expressions, this could amount to giving an attacker a way to manipulate the application's state. See Appendix A for an example of architecture that is resistant to that kind of attack.

7. IANA Considerations

This specification defines a new media type, application/bulk. Here are the informations for its registration to IANA:

Type name

application

Subtype name

bulk

Required parameters

none

Optional parameters

none

Encoding considerations

none, content is self-describing

Security considerations

cf. Section 6

Interoperability considerations

the constraint to start any BULK stream with a version form has the side-effect that classes of BULK streams can be identified by a sequence of bytes acting as "magic number":

0x011001: any BULK stream
0x01100104: a BULK stream of any major version beneath 256
0x0110010401: a BULK stream of major version 1
0x0110010401040202: a BULK stream of version 1.2

Published specification

this document

Applications that use this media type

none so far

Fragment identifier considerations

this specification defines no semantics for addressing the data with a fragment identifier; a future specification could define fragment identifier syntaxes to address the content by byte offset or the parsed results by their number in the yielded sequence

Additional information

a future specification may define a naming convention for media types based on bulk with a +bulk suffix, as for XML with +xml

8. Acknowledgements

The original author of this specification read Erik Naggum's famous rant about XML several years before, and it may well have unconsciouly influenced this design. He happened to stumble upon it again while writing the earliest draft of this specification and it struck him how much it embodies Erik's ideas. In any case, this format is dedicated to Erik.

9. References

9.1. Normative References

, "

[RFC2119]	Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels ", BCP 14, RFC 2119, March 1997.
[IANA-Charsets]	IANA Charset Registry (archived at): ", .

9.2. Informative references

, "

[HTTP2]	Belshe, M., Peon, R., Thomson, M. and A. Melnikov, "Hypertext Transfer Protocol version 2.0", Internet-Draft draft-ietf-httpbis-http2-04, July 2013.
[Avro]	Cutting, D., "Apache Avro™ 1.7.4 Specification", February 2013.
[protobuf]	Protocol Buffers", July 2008.
[Smile]	Saloranta, T., "Smile Data Format", September 2010.
[Thrift]	Slee, M., Agarwal, A. and M. Kwiatkowski, "Thrift: Scalable Cross-Language Services Implementation", April 2007.

Appendix A. Robust namespace definition

This constitutes a suggestion of architecture for a BULK processing application. It has the advantage that an agent cannot modify the values of names to which it has not specifically be given authority. This architecture doesn't ensure this property by checking the validity of definitions but by adhering to the Principle Of Least Authority, thus ensuring no false positives or TOCTOU race conditions.

For each new context (including the abstract yield when parsing starts), the parser creates a new copy of each known namespace. These copies are available in this context to retrieve and define values. It implements the lexical scoping of definitions on top of providing the robustness properties discussed here.

By default, all namespaces created in a context are discarded at the end of this context.

Of course, an implementation of the architecture presented here can be optimized compared to the abstract algorithm, for example by using copy-on-demand.

Any namespace that is not a copy for its context but the object retained by the application afterwards, gives authority to make long-lasting definitions. Such a namespace is called lasting here.

A.1. Selective authority

A number of lasting namespaces are included for the abstract yield. Their UUIDs are agreed out-of-band. The disadvantage of this solution is that it needs prior agreement on the definable namespaces.

A.2. Open authority

Any ns* form for a UUID unknown to the processing application triggers the creation a lasting namespace.

The disadvantage of this solution is that it opens a denial of service vulnerability. If Bob is a processing application and Carol and Dave are agents communicating with Bob with an open authority, Dave can prevent Carol from defining a namespace if it manages to know the UUID and starting a communication with Bob before Carol.

If an agent uses a secure way to create UUIDs and protects their secrecy, this solution is both flexible and safe.

Appendix B. Semantic leeway

Atoms and forms have semantics and sometimes evaluation rules defined by this specification, but as long as parsing rules are maintained, a processing application MAY use any form and atom with different semantics and evaluation rules, as long as it is within the confines of a form that makes this semantics change explicit.

( foo:sints w32 0xAF0CFF15 w16 0x8A8B w32 0xFFFE7873 )

For example, if an application needs to store a big number of unsigned integers and the overhead of the two marker bytes is too big, the application SHOULD bypass the default semantics of words and define a form in which they are taken as two's-complement notation. The following form could be interpreted as the sequence of integers (-1358102763 35467 -100237):

( foo:array16 # w8 0xA 0xC7D8FF1511D785BD002A )

If an overhead of one byte per integer is still too big and the integers all have the same size, an application MAY pack them into an array. The sequence of integers (-14376 -235 4567 -31299 42) could be serialized as:

( foo:flatten w32 0xAF0CFF15 w16 0x8A8B w32 0xFFFE7873 ( foo:array16 # w8 0xA 0xC7D8FF1511D785BD002A
) w64 0x8ABC23F0FFF98283 )

BULK's syntax also enables hybrid solutions, for example by packing together into an array the sequences of same-size integers, among a sequence of words. So the sequence of integers (-1358102763 35467 -100237 -14376 -235 4567 -31299 42 -8449839282860229000) could be serialized as:

Author's Address

Pierre Thierry Thierry Technologies EMail: pierre@nothos.net