Internet Engineering Task Force MMUSIC WG Internet Draft P. Cordell draft-cordell-mmusic-umf-00.txt Tech-Know-Ware June 1, 20001 Expires: December 2001 UMF - The Universal Message Format STATUS OF THIS MEMO This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as work in progress. The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract A number of methods and tools are available for defining the format of messages used for signalling protocols. However, many of these methods and tools have been designed for purposes other than message definition, and have been adopted on the basis that they are readily available rather than being ideally suited to the task. This often means that the methods make it difficult to get definitions correct, or result in unnecessary verbosity both in the definition and on the wire. UMF - the Universal Message Format - has been custom designed for the purpose of message definition. It is thus easy to specify messages in a compact, extensible format that is readily machine manipulated to produce a compact encoding on the wire. Cordell [Page 1] Internet Draft UMF - The Universal Message Format June 2001 1. Introduction This document defines the UMF message definition language, and the default text encoding method for messages defined in this way. 2. Requirements for Message Definition and Encoding A good message definition method will have the following properties. It is these properties that UMF has been designed to have. Precise Definitions It is important to accurately capture type information in a message definition. Some message definition methods simply capture the name of a parameter without specifying the type of the parameter (e.g. integer, boolean etc). Additionally types like integers need to be constrained to appropriate values. UMF provides this precision of definition. Compact Definitions The message definition should be as compact as possible, but no more compact. While helpful to the inexperienced developer, excessive keywords and other formatting can actually be detrimental to the understanding of the experienced developer. UMF adopts a compact C like definition that contains minimal clutter and thus allows the true message structure to be readily seen at a glance. Readily Extensible The message definition and the resultant on the wire encoding need to support extensibility. As part of this, code should be able to pass over parameters that it does not understand without becoming confused. The UMF message definition and encoding allows this. Extensible by Third Parties It often occurs that a protocol is defined by one body and then adopted and modified by another body. In other cases a base protocol may be defined that is then augmented by external profiles. An effective method of allowing a third-party to accurately specify a message definition as deltas to an existing message definition is important in this respect. UMF allows third-parties to specify protocol additions that should not clash with additions made by other third parties. Cordell [Page 2] Internet Draft UMF - The Universal Message Format June 2001 Machine Parsable It is desirable that the message definition be machine readable so that as much of the slog involved in turning a message definition into running code is as automated as possible. This improves time to market and significantly reduces the potential for adding bugs into the code. An UMF definition is in many respects a generalised form of C data structure definition. Therefore it is relatively simple to convert a machine independent UMF definition into a machine dependent C definition and provide all the code to convert from one data representation to another. This process can remove a vast amount of slog. Additionally, the various compilers involved in the process can do a large amount of validating to ensure that the implementation is correct. Simplicity While accurate message definition is important, it is perhaps even more important that the message definition method be intelligible to people that do not have a great deal of time to become gurus in yet another language. Therefore the definition method should be quick and easy to learn. This means that the message definition language must have minimal complexity. As complexity of definition and expressiveness are often interrelated, in some cases it is necessary to restrict expressiveness in the interests of simplicity. Additionally, consideration should also be given to the complexity of the required parser, which may favour simplicity of format over absolute message compactness. UMF is based on the 80-20 principle. It is a small language that can accommodate the majority of situations extremely well. There will be times where a UMF representation is sub-optimal in terms of on-the-wire compactness. However, it is felt that on the whole, the gains in simplicity that this enables outweigh these sub-optimalities. Compact On-the-Wire Encoding As a general principle, it is desirable that encoded messages be as compact as possible. This minimises transmission bandwidth, can make processing the messages more efficient, and prevents premature fragmentation of datagrams. Compact messages are also important in the area of mobile devices that have limited memory and possibly transmission bandwidth. This is particularly the case if the information is stored as persistent configuration data rather than being immediately discarded. Also, in many cases, compact messages are easier for developers experienced in the protocol to read than some more verbose types, and it is these developers that should be the primary target for any measure aimed at easing debugging. Cordell [Page 3] Internet Draft UMF - The Universal Message Format June 2001 Given that there are limits to how compactly the actual data in a message can be represented, the compactness of a message is determined largely by the tagging. Existing protocols often use no tagging of data to minimise message size. They also allow for comma separated lists of parameters that have the same meaning rather than requiring each parameter to be separately tagged. Additionally descriptive parameter names are essential to a clear message definition, but tags used in messages are often shorter than is descriptively useful (e.g.

instead of , instead of ). Therefore, it is desirable to be able to define a descriptive name that can be used in code and a tag name that can be used on the wire. UMF accommodates all of these requirements. Flexible Implementation While turnkey solutions are desirable, they are potentially complex to develop, and thus may incur some cost to use, thus making them inaccessible to some. Therefore a range of implementation routes are desirable, from minimal tools / maximum leg work, to maximal tools/minimum leg work. UMF has a number of implementation routes in addition to the compilation route. An UMF definition can be converted into an ABNF definition and implemented via that route, or a DOM like tree based parsing method can be used. (Downloadable software for these implementation routes is - or soon will be - available from [1].) Support Easy Application Debugging Ideally the messages on the wire should be in a form that is aid the debugging process. By default UMF uses a text based line format, and is thus readily readable by human developers. Additionally it is also easy to manually generate test messages. With the aid of cb-like tools, it is possible to format messages so that they are more readable than the most compact line representation. Additional tools make it possible to automatically generate test messages and use them as test vectors to test a parser, or validate that manually generated test messages actually conform to the message definition. Nesting of Protocols In some systems messages from one protocol are carried within messages from another protocol (TCP in IP is a simple example, as is HTML in HTTP). The definition and line encoding should allow this. UMF allows this. Cordell [Page 4] Internet Draft UMF - The Universal Message Format June 2001 Flexible On-the-Wire Encoding It is not always possible to anticipate the direction of development so flexibility in the actual wire representation of the messages is desirable. The principal UMF on-the-wire representation in text based. However, an UMF message definition can also be represented using alternate text formats such as XML, and can also be represented in binary. 2.1 That's UMF UMF has been specifically designed to meet all of the above requirements. 3. UMF Messages Definition This section describes how UMF specifies the content of messages. As the syntax is C-like it is felt that many will immediately understand the message definition. For this reason a short example of a message definition is presented before describing the format in detail. The example is also used to give a rough indication of what the formal definition describes, and will thus hopefully help with the understanding of the latter. 3.1 Basic Principles of the Message Definition Before presenting an example, and a more formal definition, it may be helpful to describe the basic principles of the message definition format. Following the C language format, the basic format of a parameter definition is: type name Type specifies things like integers, booleans, ASCII strings, Unicode strings and so on. The name is obviously the name of the parameter. Thus a parameter definition might be: int rfc-number ; In addition, a parameter definition can express constraints on the basic type, cardinality (how many instances of the type are valid in a message), and the tag to be used for the value on the wire. For example, an integer may be limited to the values 0 to 255, and an ASCII string may be limited to a maximum size. The fuller format of a parameter will have the form: Cordell [Page 5] Internet Draft UMF - The Universal Message Format June 2001 type name [cardinality] tagging For example: int <1..30000> referenced-rfcs [0..255] as refers ; This defines an integer that can have values between 1 and 30000. The name of the parameter is refereced-rfcs, but is tagged on-the-wire by 'refers'. The parameter can consist of between 0 and 255 instances of the integer in a valid encoding. Two types of compound parameter are also possible, these being 'struct' and 'union'. Having much the same meaning as they have in C, a struct specifies a group of parameters, all of which may be used in a particular instance of the struct. A union similarly specifies a group of parameters, but in this case only one of the parameters can be used in any one instance of the union. An example of a struct is: struct my-rfc { int rfc-number; int <1..32000> referenced-rfcs[0..255] as refers ; }; 3.2 An Example Message Definition The following is an example message definition: Cordell [Page 6] Internet Draft UMF - The Universal Message Format June 2001 module my-example.ietf.org struct my-example { int <0..255> participant-id as ?; Action action as ?; struct my-addition[0..1] as tech-know-ware.com plugin { bool tkw-app-capable as ?; }; }; union action { Join join; Message message as msg; null leave; }; struct Join { ascii<0..63> name; }; struct Message { int <0..255> to-delegates[1..127] as to; ascii<0..255> message as msg; [ // Version 2 additions int <0..5> priority; bool acknowledge as ack; ] [ // Version 5 additions ascii<0..16> font-name[0..1] as name; null bold[0..1]; null italic[0..1]; null underlined[0..1] as ul; ] }; The above definition is intended to represnt a very crude meeting controller. The first construct (my-example) is the root of all messages for the protocol. Each message identifies a participant using an integer in the range 0 to 255, called participant-id. When encoded on the wire, this parameter will be untagged due to the 'as ?' specification. Each message then has an action, which is also untagged. The type of the action parameter is not immediately specified, and instead references the 'Action' definition. Cordell [Page 7] Internet Draft UMF - The Universal Message Format June 2001 The Action definition is a union in which only one of the specified parameters may appear in an instance of the Action construct. This effectively represents a fork in the semantics of any given message. The options within Action can indicate that somebody has joined the meeting, left the meeting, or is sending a message to other delegates. There is no explicit tag for the 'join' and 'leave' options, so these will be tagged on-the-wire by the parameters' names, 'join' and 'leave' respectively. Conversely, an explicit tag for the 'message' parameter is specified, and hence the message option will be tagged by 'msg' on-the-wire. The join parameter also has a referenced definition. Conceptually, when a person joins a meeting, all the other delegates are informed of their name. The name is an ASCII string that has a minimum length of 0 characters and a maximum length of 63 characters. The message option is also a referenced definition. Conceptually, to send a messages, the participant-id is used to identify the sender, and the to-delegates field contains the participant ids of all the people to whom the message is being sent. On-the-wire, the to-delegates parameter will be tagged with 'to'. Between one and 127 instances of the to-delegates parameter may appear in a message. Also, the message itself is included. The message will consist of ASCII characters and can be between 0 and 255 characters long. On-the-wire, the message field will have the tag 'msg'. The priority and acknowledge fields within the message struct have been added in a later version of the protocol. This is indicated by the square brackets in which the parameters are wrapped. Similarly, font-name, and associated parameters have been added in version 5 of the protocol (according to the comment). The reader should already understand enough of the definition language to understand the meaning of these fields. Returning to the 'my-example' root, a third-party has added an extension to the protocol in the form of the 'my-addition' parameter. It is identified as not being part of the base specification by the keyword 'plugin'. On-the-wire, the additional parameter will be identified by the tag 'tech-know-ware.com' to differentiate it from additions that may be made by other third parties. On-the-wire encoded examples of this message definition are shown in section 4.2. 3.3 Formal Message Definition Syntax There are two types of parameter in UMF, simple types and compound types. The ABNF definition of these is: Cordell [Page 8] Internet Draft UMF - The Universal Message Format June 2001 UMF-parameter = simple-param / compound-param Simple types represent parameters such as integers, booleans etc. The ABNF definition of a simple param is: simple-param = simple-type WS name [ cardinality ] [ WS "as" WS explicit-tag ] [ WS plugin ] ";" where WS represents white space. The 'simple-type' represents the type of the parameter. It can have the following forms: simple-type = "null" / "bool" / "ipv4addr" / "ipv6addr" / "embedded" / integer-type / string-type / const-type / bytes-type / reference where: integer-type = "int" [ "<" range-constraint ">" ] string-type = ( "ascii" / "unquoted-ascii" / "unicode" ) [ "<" length-constraint ">" ] const-type = "const" "<" 1*( safe-chars ) ">" bytes-type = "byte-array" [ "<" length-constraint ">" ] reference = name ; Refers to a type defined elsewhere range-constraint = constraint length-constraint = constraint constraint = [ min-constraint ".." ] max-constraint min-constraint = ["-"] 1*DIGIT max-constraint = ( ["-"] 1*DIGIT / "*" ) In the case of integer-type, the optional constraint specifies the minimum and maximum permissible values that the integer can take. In the case of string-type, the optional constraint specifies the minimum and maximum number of characters that are allowed to appear Cordell [Page 9] Internet Draft UMF - The Universal Message Format June 2001 in a valid encoding. In the case of bytes-type, the optional constraint specifies the minimum and maximum number of bytes that are allowed to appear in a valid encoding. In the constraint syntax, a maximum value '*' means infinite or unbounded. The various types have the following meaning: null A parameter that has no value. This is most useful in unions, and can also be used to represent boolean events wherein the absence of the parameter indicates false, and the presence of the parameter indicates true. It is more useful than you might at first think! bool Can be true or false int An integer value ipv4addr Represents an IPv4 address, but not the port. ipv6addr Represents an IPv6 address, but not the port. ascii A string made up of ASCII characters, limited at most to values 0 to 127. unquoted-ascii An ascii string usually has quote marks around it. This type does not have quotes around it. Consequently it can not have any white space, or include any special characters (such as "=", "{", and "}") that would confuse the parser. unicode A string made up of Unicode characters. const Cordell [Page 10] Internet Draft UMF - The Universal Message Format June 2001 This type allows a constant value to be inserted into the encoded message. It will typically be untagged. One thing it might be used for is identifying the protocol of the message definition. For example: const protocol as ?; byte-array An array of bytes. Also useful for carriage of opaque data. embedded The value is an embedded UMF message. This allows layering of message definitions. The name is the name of the parameter. If there is no explicitly defined tag, then this is also used as the parameter's tag on-the-wire. It has the format: name = ALPHA *( ALPHA / DIGIT / "-" / "_" ) The cardinality of a parameter specifies how many times a particular parameter can appear in a message. The format mirrors a C-like array specification, but uses UML style ranges rather than singular values as are required in C. If the cardinality field is absent, then one and only one instance of the parameter must occur in a valid message. The format of the cardinality specification is: cardinality = "[" [ min-occurrences ".." ] max-occurrences "]" min-occurrences = ["-"] 1*DIGIT max-occurrences = ( ["-"] 1*DIGIT / "*" ) Once again, the '*' in max-occurrences represents infinite or unbound. Example cardinalities are as follows: [0..1] ; Zero or one time [0..*] ; Zero or more times [*] ; Same as above, zero or more times [1..*] ; One or more times [5] ; Exactly five times An explicit tag can be any sequence of characters that do not have special significance to the parser. Cordell [Page 11] Internet Draft UMF - The Universal Message Format June 2001 explicit-tag = 1*( safe-chars ) safe-chars = 1*( %x21 / ; Not " %x23-26 / ; Not ' ( ) %28-2B ; Not , %x2D-3C / ; Not = %x3E ; Not ? %x40-7A / ; Not { %7C / ; Not } %7E-7F ) ; Visible characters except = , " ' { } ( ) ? Marking an item as plugin indicates to the developer and the tools that this parameter is (probably) not part of the original message definition. For example, it might be a proprietary extension. It also indicates that the parameter may not be present in all received messages, and impacts on the way the binary encoding operates. The compound types are struct and union. For a struct, subject to the various parameters cardinality specifications, any all or none of the parameters that a struct groups together may appear in a valid encoding of the construct. In the case of a union, only one of the parameters may be encoded in a valid instance of the construct. The format of the compound types is similar to the simple types. They have the form: compound-param = struct-param / union-param struct-param = "struct" WS name [ cardinality ] [ WS "as" WS explicit-tag ] [ WS plugin ] "{" struct-body "}" ";" union-param = "union" name [ cardinality ] [ WS "as" WS explicit-tag ] [ WS plugin ] "{" union-body "}" ";" The format of the struct body is: struct-body = *( untagged-UMF-parameter ) [ last-untagged-UMF-parameter ] *( UMF-parameter ) *( struct-extension ) Cordell [Page 12] Internet Draft UMF - The Universal Message Format June 2001 The struct body starts with all the untagged parameters that have a cardinality of one and only one (untagged-UMF-parameter). These may be followed by a single untagged parameter that has a cardinality other than one. Following this, the tagged parameters are included. When the message definition is subsequently extended, another instance of the extension parameters construct is added for each version in which the construct is extended. (Note that all new parameters must always be added onto the end of an existing construct, and the order of parameters must never be rearranged from one version to the next.) All of these have a similar format to the types already defined, except that in some cases they may be untagged, or only allow a unary cardinality. To make the ABNF definition accurate it is therefore necessary to repeat the above basic definitions with the appropriate tagging and cardinality specifications. As mentioned, the struct body may start with untagged-UMF-parameters. These are untagged, and must have a cardinality of 1. There definition is: untagged-UMF-parameter = untagged-simple-param / untagged-compound-param untagged-simple-type = simple-type WS name WS "as" WS "?" ";" untagged-compound-param = untagged-struct-param / untagged-union-param untagged-struct-param = "struct" WS name WS "as" WS "?" "{" struct-body "}" ";" untagged-union-param = "union" WS name WS "as" WS "?" "{" union-body "}" ";" The next item in a struct body may be a single untagged item that has a cardinality other than one. It has the definition: Cordell [Page 13] Internet Draft UMF - The Universal Message Format June 2001 last-untagged-UMF-parameter = last-untagged-simple-param / last-untagged-compound-param last-untagged-simple-type = simple-type WS name cardinality WS "as" WS "?" ";" last-untagged-compound-param = last-untagged-struct-param / last-untagged-union-param last-untagged-struct-param = "struct" WS name cardinality WS "as" WS "?" "{" struct-body "}" ";" last-untagged-union-param = "union" WS name cardinality "as" WS "?" "{" union-body "}" ";" The third part of a struct definition are the items that are tagged. These can have any desired cardinality. These have the basic parameter definition that was initially presented, i.e. UMF-parameter. The fourth and final part of a struct body is the extension fields. These are parameters that are added in subsequent versions of the protocol specification. They are marked out separately because a parser must always consider absence of these parameters to be a valid encoding so that it can receive messages from entities that are working with an earlier version of the protocol. To do this would dictate that all extension parameters would have to have a cardinality specification that included zero. This is tedious, potentially error prone, and loses some expressiveness. Instead, extension parameters are wrapped inside square brackets to indicate that they are extensions. It is then clear to any tools and developers that these parameters may be absent if a message is received from a host running an earlier version of the message definition. The format of the struct extension is: struct-extension = "[" 1*( UMF-parameter ) "]" The definition of a union-body is as follows: union-body = [ integer-type WS name WS "as" WS "?" ";" ] *( singular-UMF-parameter ) *( union-extension ) A union-body may have a single untagged integer parameter. All other parameters must be tagged and have a cardinality of one and only one. A union is extended in much the same way as a struct. The untagged integer parameter allows integers to be defined that have wild-carding options. For example, a union might be defined as: Cordell [Page 14] Internet Draft UMF - The Universal Message Format June 2001 union select { int<0..65535> numbered as ?; null any as *; }; Examples of the encoded form might be: select = 12 select = * The parameters within a union are only allowed unary cardinality to avoid ambiguity in the line encoding. If multiple instances of a parameter must be included as an option in a union, it is necessary to wrap the parameters within a struct, using something similar to: struct X { X x[1..*] as ?; }; As mentioned, most of the parameters within a union are tagged and have a cardinality of one. There defininition is: singular-UMF-parameter = singular-simple-param / singular-compound-param singular-simple-param = simple-type WS name [ WS "as" WS explicit-tag ] [ WS plugin ] ";" singular-compound-param = singular-struct-param / singular-union-param singular-struct-param = "struct" WS name [ WS "as" WS explicit-tag ] [ WS plugin ] "{" struct-body "}" ";" singular-union-param = "union" WS name [ WS "as" WS explicit-tag ] [ WS plugin ] "{" union-body "}" ";" The union extension operates in a similar fashion to that of the struct, but references singular-UMF-parameters. Its definition is: union-extension = "[" 1*( singular-UMF-parameter ) "]" It was mentioned previously that unions and structs could reference types that are defined elsewhere. The format of a referenced type can now be defined. Referenced types have a cardinality of one, and are untagged. This is because the cardinality and tagging of the type are defined in the item that does the referencing, rather than where the referenced type is defined. (If a referenced type needs a Cordell [Page 15] Internet Draft UMF - The Universal Message Format June 2001 cardinality other than one, it is recommended that the trick for giving a parameter within a union a non-unary cardinality be used.) The definition of the referenced types are: referenced-UMF-parameter = referenced-simple-param / referenced-compound-param referenced-simple-param = simple-type WS name ";" referenced-compound-param = referenced-struct-param / referenced-union-param referenced-struct-param = "struct" WS name "{" struct-body "}" ";" referenced-union-param = "union" WS name "{" union-body "}" ";" A protocol may be extended by a third party without modifying the original definition. This may be due to a proprietary extension, or an externally defined profile of the base protocol. The specification for this type of extension is: third-party-extension = "plug" WS tp-struct-extension / tp-union-extension "into" WS name *( "::" name ) *( "," name *( "::" name ) ) ";" tp-struct-extension = UMF-parameter tp-union-extension = singular-UMF-parameter This specifies a parameter that is to be plugged into an existing construct. For example, if the following were defined: plug ascii cookie as cookie.tkw.com into my-example::my-addition; The resulant definition would be treated as if it were: Cordell [Page 16] Internet Draft UMF - The Universal Message Format June 2001 struct my-example { int <0..255> participant-id as ?; Action action as ?; struct my-addition[0..1] as tech-know-ware.com plugin; { bool tkw-app-capable as ?; ascii cookie as cookie.tkw.com plugin; }; }; The name field indicates that name of the construct that the item is to be plugged into. A single protocol may be defined in number of message definition file. This might be for the purpose of accessing predefined libraries, or specifying the definition that the current definition extends. A message definition therefore begins with a set of optional directives expressing this information. They have the form: UMF-directive = [ "module" WS module-name WS ] [ "extends" WS module-name ";" ] *( "imports" WS module-name ";" ) module-name = name *( "." name ) Module specifies the name of the module. Extends is used for a definition that contains a third party extension. The module-name in the extends specification indicates the message definition that is being extended. The imports statement indicates a library message definition that contains referenced types that are referenced within the message definition. The module-name follows the hierarchical format used in Java. It is based on a domain name that is created from the name of the protocol, combined with the domain name of the entity that defined it. For example, if a protocol called the Simple Conference Protocol (SCP) were defined by the MMUSIC working group within the IETF, the module name might be: scp.mmusic.ietf.org Finally, we are in a position to describe a complete UMF message definition. This is: Cordell [Page 17] Internet Draft UMF - The Universal Message Format June 2001 UMF-definition = UMF-directives 1* ( referenced-UMF-parameter / third-party-extension ) The first parameter defined within the message definition is the root of the message definition tree, and is thus the outer-most construct of an encoded message. 3.4 Complete ABNF This section presents the complete ABNF of a message definition without narrative. This definition indicates where white space (WS) must occur. However, white space may also occur between any token. UMF-definition = UMF-directives 1* ( referenced-UMF-parameter / third-party-extension ) UMF-directive = [ "module" WS module-name WS ] [ "extends" WS module-name ";" ] *( "imports" WS module-name ";" ) module-name = name *( "." name ) referenced-UMF-parameter = referenced-simple-param / referenced-compound-param referenced-simple-param = simple-type WS name ";" simple-type = "null" / "bool" / "ipv4addr" / "ipv6addr" / "embedded" / integer-type / string-type / bytes-type / const-type / reference integer-type = "int" [ "<" constraint ">" ] string-type = ( "ascii" / "unquoted-ascii" / "unicode" ) [ "<" constraint ">" ] bytes-type = "byte-array" [ "<" constraint ">" ] reference = name ; Refers to a type defined elsewhere constraint = [ min-constraint ".." ] max-constraint min-constraint = ["-"] 1*DIGIT max-constraint = ( ["-"] 1*DIGIT / "*" ) name = ALPHA *( ALPHANUM / "-" / "_" ) referenced-compound-param = referenced-struct-param / referenced-union-param Cordell [Page 18] Internet Draft UMF - The Universal Message Format June 2001 referenced-struct-param = "struct" WS name "{" struct-body "}" ";" struct-body = *( untagged-UMF-parameter ) [ last-untagged-UMF-parameter ] *( UMF-parameter ) *( struct-extension ) referenced-union-param = "union" WS name "{" union-body "}" ";" union-body = [ integer-type WS name WS "as" WS "?" ";" ] *( singular-UMF-parameter ) *( union-extension ) untagged-UMF-parameter = untagged-simple-param / untagged-compound-param untagged-simple-type = simple-type WS name WS "as" WS "?" ";" untagged-compound-param = untagged-struct-param / untagged-union-param untagged-struct-param = "struct" WS name WS "as" WS "?" "{" struct-body "}" ";" untagged-union-param = "union" WS name WS "as" WS "?" "{" union-body "}" ";" last-untagged-UMF-parameter = last-untagged-simple-param / last-untagged-compound-param last-untagged-simple-type = simple-type WS name cardinality WS "as" WS "?" ";" last-untagged-compound-param = last-untagged-struct-param / last-untagged-union-param last-untagged-struct-param = "struct" WS name cardinality WS "as" WS "?" "{" struct-body "}" ";" last-untagged-union-param = "union" WS name cardinality WS "as" WS "?" "{" union-body "}" ";" UMF-parameter = simple-param / compound-param simple-param = simple-type WS name [ cardinality ] [ WS "as" WS explicit-tag ] [ WS plugin ] ";" Cordell [Page 19] Internet Draft UMF - The Universal Message Format June 2001 cardinality = "[" [ min-occurrences ".." ] max-occurrences "]" min-occurrences = ["-"] 1*DIGIT max-occurrences = ( ["-"] 1*DIGIT / "*" ) explicit-tag = 1* (safe-char) safe-char = %x21 / ; Not " %x23-26 / ; Not ' ( ) %28-2B ; Not , %x2D-3C / ; Not = %x3E ; Not ? %x40-7A / ; Not { %7C / ; Not } %7E-7F ; Visible characters except = , " ' { } ( ) ? compound-param = struct-param / union-param struct-param = "struct" WS name [ cardinality ] [ WS "as" WS explicit-tag ] [ WS plugin ] "{" struct-body "}" ";" union-param = "union" WS name [ cardinality ] [ WS "as" WS explicit-tag ] [ WS plugin ] "{" union-body "}" ";" struct-extension = "[" 1*( UMF-parameter ) "]" singular-UMF-parameter = singular-simple-param / singular-compound-param singular-simple-param = type WS name [ WS "as" WS explicit-tag ] [ WS plugin ] ";" singular-compound-param = singular-struct-param / singular-union-param singular-struct-param = "struct" WS name [ WS "as" WS explicit-tag ] [ WS plugin ] "{" struct-body "}" ";" singular-union-param = "union" WS name [ WS "as" explicit-tag ] [ WS plugin ] "{" union-body "}" ";" third-party-extension = "plug" WS Cordell [Page 20] Internet Draft UMF - The Universal Message Format June 2001 tp-struct-extension / tp-union-extension "into" WS name *( "::" name ) *( "," name *( "::" name ) ) ";" tp-struct-extension = UMF-parameter tp-union-extension = singular-UMF-parameter WS = comment / " " / HTAB / CR / LF ; HTAB, CR, LF defined in RFC-2234 ; White space may appear between any ; token and is not limited to where ; it is explicitly specified comment = c-comment / cpp-comment c-comment = "/*" "*/" cpp-comment = "//" *( HTAB / %x20-%7f ) ( CR / LF ) ; A comment is treated as a single space for the ; purposes of parsing 4. On-the-Wire Representation 4.1 Principles of On-the-Wire Encoding The basic format of the text based on-the-wire encoding is to use the format: tag = value If there are multiple instances of a parameter, then the values can be conveyed in a comma separated list, for example: tag = value, value, value If a tag is explicitly specified in the message definition, then this is used on the wire. If no tag is explicitly specified, then the name of the parameter is used as the tag. It is also possible to explicitly specify that no tag should be used on the wire by specifying the explicit tag as '?'. All untagged parameters within a struct except the last one must have a cardinality of one and only one. All untagged items must appear in a struct in the same order that they are defined in the message definition, and must appear before any tagged items within a struct definition. In these cases, the format on the wire becomes: value or: Cordell [Page 21] Internet Draft UMF - The Universal Message Format June 2001 value, value, value Thus, for the examples quoted earlier, that is: int rfc-number ; int <1..30000> referenced-rfcs [0..255] as refers; The format on the wire would be something like (depending on the actual values in question): rfc-number = 3024 refers = 822, 791, 2543 4.2 Example On-the-Wire Representation The following are example on-the-wire representations of the example message. 1 join = { 'Alice' } tech-know-ware.com = { True } 1 msg = { to = 2, 5, 8, 58 msg = 'Where are we going for dinner' } 1 leave 4.3 Formal On-the-Wire Representation The principle representation of an UMF defined message on the wire is text based. Singular parameters may be untagged as long as they appear before any other tagged parameters. Parameters that have non-singular cardinality must be tagged. The top level construct of an UMF definition is a referenced type, which essentially has no tag associated with it. (Indeed, the presence of such a tag would not convey any information.) The top level construct is therefore either a struct body, a union body, or a simple value, as in: UMF-text-message = struct-body / union-body / simple-value Cordell [Page 22] Internet Draft UMF - The Universal Message Format June 2001 A struct body can contain untagged and tagged parameters. All untagged parameters except the last one must have a cardinality of one and only one. All untagged parameters must appear before any tagged parameters. The definion of a struct-body is therefore: struct-body = *( value WS ) [ value *( "," value ) WS ] *( ( tag WS ) / ; For a null parameter ( tag "=" value *( "," value ) WS ) ) tag = 1*( safe-char ) All items of a union body must be tagged, except for a single integer parameter that may be untagged. Also, parameters must only have a cardinality of one in the encoding to avoid ambiguities in the encoded message. Therefore a union body has the form: union-body = integer-value / tag / ; For a null parameter ( tag "=" value ) where: value = simple-value / compound-value simple-value = bool-value / integer-value / ipv4addr-value / ipv6addr-value / ascii-value / unquoted-ascii-value / unicode-value / const-value / embedded-value / bytes-value bool-value = "True" / "False" int-value = [ "-" ] 1*DIGIT ipv4addr-value = 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT ipv6addr-value = ( 1*3DIGIT *( ":" 1*3DIGIT ) [ ":" *( ":" 1*3DIGIT ] ) ascii-value = "'" *( %x00-26 / %x28-5B / %x2D-x7F / "\\" / "\'" ) "'" Cordell [Page 23] Internet Draft UMF - The Universal Message Format June 2001 unquoted-ascii-value = 1*( safe-char ) safe-char = %x21 / ; Not " %x23-26 / ; Not ' ( ) %28-2B ; Not , %x2D-3C / ; Not = %x3E ; Not ? %x40-7A / ; Not { %7C / ; Not } %7E-7F ) ; Visible characters except = , " ' { } ( ) ? unicode-value = DQUOTE *( %x00-21 / %x23-5B / %x5D-xFF / "\\" / "\" DQUOTE ) DQUOTE ; DQUOTE defined in RFC 2234 byte-value = 1*( HEXDIG HEXDIG ) ; HEXDIG defined in RFC 2234 const-value = 1*( safe-char ) embedded-value = "(" *(%x00-28 / %x2A-5B / %x5D-FF / "\)" / "\\" ) ")" ; \ & ) are escaped Illustrating the recursiveness of the message format, we have: compound-value = struct-value / union-value struct-value = "{" struct-body "}" union-value = union-body WS WS = 1*( comment / SP / HTAB / CR / LF ) ; SP HTAB CR LF defined in RFC 2234 ; WS may appear between any token and is not ; limited to those places where it is ; explicitly specified comment = c-comment / cpp-comment c-comment = "/*" "*/" cpp-comment = "//" *( HTAB / %x20-%7f ) ( CR / LF ) ; A comment is treated as a single space for the ; purposes of parsing 4.4 Marking Message Boundaries Cordell [Page 24] Internet Draft UMF - The Universal Message Format June 2001 Before a message is parsed it is necessary to know the boundaries of the message. There are many ways in which this can be done, and the method adopted should be specified in the protocol specification. However, in the absence of any other way, UMF parsers should take the presence of an unmatched closing brace to be the end of message marker. Hence, the definition of a message delimited in this way becomes: delimited-UMF-text-message = UMF-text-message "}" 4.5 Illustration of Encoded Types This section illustrates how the types look once they have been encoded according to the syntax above. The tag of each item has the format 'my-XXXX'. Except in the case of the 'null' example, the XXXX part indicates the type that is encoded to the right of the equals sign. my-null // Tag only for a null parameter my-bool = True my-int = 5643 my-ipv4addr = 10.0.0.1 my-ipv6addr = 201:123::0 my-ascii = 'UMF' my-unquoted-ascii = UMF my-unicode = "UMF" my-const = UMF my-bytes = 01AF3C my-embedded = ( my-other-int=5 single-closing-bracket-text= '\)' ) my-struct = { 5434 All time=98787654654 } my-union = 5434 my-union1 = Switch my-union2 = Volume = 11 5. Why UMF The name UMF is pronounced in the same way as 'oomph'. The Collins Paperback English Dictionary (1986) defines oomph as: Cordell [Page 25] Internet Draft UMF - The Universal Message Format June 2001 oomph - (umf) n. Inf. 1. enthusiasm, vigour, or energy. 2. sex appeal. So who wants their code to have UMF? 6. References To be added [1] http://www.tech-know-ware.com/umf 7. Author's Address Pete Cordell Tech-Know-Ware Ltd P.O. Box 30 Ipswich, IP5 2WY UK pete@tech-know-ware.com Expires: December 2001 Cordell [Page 26]