Internet Engineering Task Force P. Cordell Internet Draft Tech-Know-Ware Ltd draft-cordell-lumas-01.txt July 24, 2003 Expires: January 2004 Lumas - Language for Universal Message Abstraction and Specification STATUS OF THIS MEMO This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as work in progress. The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract A number of methods and tools are available for defining the format of messages used for application protocols. However, many of these methods and tools have been designed for purposes other than message definition, and have been adopted on the basis that they are available rather than being ideally suited to the task. This often means that the methods make it difficult to get definitions correct, or result in unnecessary complexity and verbosity both in the definition and on the wire. Cordell [Page 1] Internet Draft Lumas July 2003 Lumas - Language for Universal Message Abstraction and Specification - has been custom designed for the purpose of message definition. It is thus easy to specify messages in a compact, extensible format that is readily machine manipulated to produce a compact encoding on the wire. Table of Contents 1. Introduction 2. About Lumas 3. Lumas and Other Message Definition Languages 4. Terminology 5. Lumas Message Definition 5.1 Basic Principles of the Message Definition 5.2 An Example Message Definition 5.3 Formal Message Definition Syntax 5.3.1. Lumas Parameters 5.3.2. Simple Parameters 5.3.3. The Simple Types 5.3.4. Simple Type Definition 5.3.5. The Pattern Constraint 5.3.6. The Name 5.3.7. Cardinality 5.3.8. Tagging 5.3.9. The Plugin Extension Mechanism 5.3.10. Reference Parameters 5.3.11. Compound Parameters 5.3.12. Struct Parameters 5.3.13. Union Parameters 5.3.14. Combined Parameters 5.3.15. Referenced Parameters 5.3.16. External Extensions - Plug and Pluggable 5.3.17. Module Definition and Directives 5.3.18. The Top Level Definition 5.4 Locating Lumas within a Specification 6. On-the-Wire Representation 6.1 Principles of On-the-Wire Encoding 6.2 Example On-the-Wire Representation 6.3 Formal On-the-Wire Representation 6.4 Marking Message Boundaries 6.5 Examples of Encoded Types 7. Common ABNF Definitions 8. Notes on Comments 9. Locating Lumas Modules 10. Mandatory to Understand 11. Security Considerations 12. References 13. Informative References 14. Author's Address 1. Introduction Cordell [Page 2] Internet Draft Lumas July 2003 Lumas is a lightweight, message definition language that is both flexible and highly extensible. This document defines the Lumas message definition language, and the default text encoding method for messages defined in this way. 2. About Lumas Lumas - Language for Universal Message Abstraction and Specification - is a simple message definition language that can be used to define the messages used by protocols. In this context, a message is defined as a collection of data used to convey information between two or more machines (or processes). Typically Lumas is used to define application layer messages (e.g. at the layer at which the likes of SMTP [SMTP] is defined), but there is no practical reason why Lumas should not be used at other layers. The design objectives of Lumas are simplicity, ease of use, efficiency, and extensibility. Lumas recognises that message definition is a small part of the overall development process and thus should not warrant a disproportionately large investment in learning the language. Lumas uses the 80/20 principle to keep it simple. Lumas is designed to readily allow the use of Lumas aware software tools to aid in the development process. Lumas messages are text-encoded by default so that they are easy to read, and it is easy to create test messages for debugging. Using Lumas in applications is designed to be simple and efficient. Lumas addresses a number of different types of extensibility, including versioning, external extensions, and component based architectures. This makes Lumas an ideal definition language to use where simplicity, efficiency, compactness and/or a high degree of extensibility is required, especially where the extensibility involves plugging external modules into the base syntax. 3. Lumas and Other Message Definition Languages Over the years a number of message definition methods have been developed. These include XDR [XDR], ASN.1 [ASN1], various flavours of IDL (such as OMG IDL [OMGIDL]), 'bit pictures,' various flavours of BNF (e.g. ABNF [ABNF]), and XML [XML]. It is therefore worthwhile considering how Lumas relates to these other message definition languages. Lumas differs from XDR in that Lumas is primarily a language for defining text-encoded messages. XDR is fixed to defining binary messages of very specific types. ASN.1 is also primarily a language for defining binary messages, although recently there have been XML encoding rules defined. ASN.1 information object classes are difficult to understand and a Cordell [Page 3] Internet Draft Lumas July 2003 deterrent to its use. The complexity of some of the encoding rules, such as BER and PER, make the method difficult to use without using special tools. ASN.1 has found uses in the IETF, notably in the areas of cryptography (CMS [CMS] etc) and SNMP [SNMP]. However, it is not much loved, and efforts such as SMING have been undertaken to replace its usage (although at the time of writing this effort seems to have stalled). The IDL languages such as OMG IDL have similarities with message definition languages, but are subtly different. IDLs define a collection of objects, each of which describes a remote procedure call. They also define a return value for the procedure call. A protocol message set is typically a single object that can have a number of variants. A protocol will typically send another message is response to a message rather than sending a return value. Perhaps for the reasons mentioned, the above methods have not received wide usage within the IETF. The main workhorses for message definition in the IETF have been 'bit pictures,' various types of BNF and more recently XML. The term 'bit pictures' is used to refer to the pictures of bits and bytes that is used to capture the layout of parameters within a message, such as used to define IP [IP], UDP [UDP] and TCP [TCP]. This is very low-level and really only suitable for protocols containing a few parameters which ideally have fixed positions. At a level higher than pure 'bit pictures' is the scheme used in TLS [TLS], but this again is specific to defining binary messages. Diameter [DIAMETER] presents another variation on this approach. A number of types of BNF have been defined over the years, most recently ABNF. Until recently, the BNFs have been the main workhorse of IETF application level protocol definition. ABNF is very low-level, and is much like programming in assembler when high-level languages would be more useful. It is very difficult to get definitions correct, and issues such as ensuring extensibility have to be addressed not only for each message definition, but also for each parameter within the definition. The implementation route from ABNF can also be long as there is typically not enough high level information in the specification for tools to extract the important elements. This leaves XML. XML is a comprehensive and powerful way of defining messages. It would be a long and unproductive exercise to list all the things that XML gets right. Instead, the focus here is on the areas that a developer may wish to consider when choosing between Lumas and XML. The main differences between Lumas and XML are in the areas of simplicity and efficiency. Whether these differences are significant will depend on the application. Cordell [Page 4] Internet Draft Lumas July 2003 There are two parts of the XML route: XML itself, and the method used to define the XML messages. Some of the less significant issues to consider are to do with XML itself. For example, it has long been recognised that the format of XML messages, with its start and end tags, is inefficient. (It is the author's belief that the extra tagging also makes the messages harder to read, because the message is dominated by tags rather than the important part, which is the values. Hence, what works well when there is a high ratio of PCDATA to tags, is detrimental when that ratio is significantly reduced.) The separation of parameters into attributes and elements adds complexity, but adds no real value in a protocol, and is an artefact of markup use. The provision for multiple character encodings (such as UTF-8, UTF-16BE, UTF-16LE, ISO-8859-1 etc) places demands on a parser as does the implementation of namespaces (where in a start tag the namespace is defined after the first use of the namespace), which requires double parsing or significant intermediate storage. The task of converting a namespace prefix to a namespace is potentially an area involving significant lookup effort. Once expanded, the effective tag is a long sequence of characters on which comparison operations are performed, the size of which potentially reduces efficiency. User definable general entities and parameter entities are additional burdens that have little value for message definition, as is the white space handling which is a hang over from XML as a markup language. While these are surmountable problems, the consideration for a developer has to be 'why pay for it if I don't need it?' The second issue is how to define the XML messages. Arguably the current favourite is W3C XML Schema, although there are other methods including RELAX NG [RELAX] and Schematron [STRON]. First of all, it has to be admitted that this is currently a controversial area and the existence of the latter two is largely due to concerns about the former. The main concern with XML Schema is again complexity. Maybe in the future one of the other methods will prevail. Keeping with XML Schema for now, firstly the language can be very difficult to learn. The specification is some 350 pages long (ignoring XML itself, and XML namespaces etc), and uses a formal language that is very confusing to interpret. In a number of areas there is even debate among the experts about what is intended. The constructs can be confusing and apparently contradictory in a number of areas, such as the notion of simpleType with ComplexContent and so on. While XML Schema is touted as being extensible, in practice for the unwary, there are a number of traps to fall into. For example, incorporated attribute and element groups, especially those from different schemas can easy result in name clashes when they are extended independently. Enumerated strings are not extensible without careful consideration. There is no support for capturing what has changed from one version of a schema to the next, other than doing a diff operation on two files. This again makes it difficult for tools. Other features also make it difficult for tools, such as Cordell [Page 5] Internet Draft Lumas July 2003 the ability to use patterns to restrict the format of basic types such as floating point numbers. XML Schema has no concise way of specifying short tag names while at the same time specifying descriptive formal names. For example, the most common XML like syntax, HTML [HTML], has an abundance of short tags such as ,

, etc. This makes it easy for the expert to type, and it must be assumed that the approach has some merit otherwise it wouldn't have been done that way. But XML Schema does not readily support this. Verbosity is even more of an issue when it comes to XML Schema, in a number of cases requiring five of more lines of text when only one would do. This means extra scrolling or page turning when editing and viewing, which makes a schema harder to write, harder to check, and harder for a third-party to understand. Many of these problems are subjective. Some can also be avoided by defining style guides and best practices for using XML Schema (for example [XMLBCP]). Compression can be used to reduce the size of messages. However, this really just addresses the complexity by adding more complexity. Not only does this make it harder to learn, it is important to remember that where there is complexity, there is the potential for bugs. And bugs not only affect the integrity of the code, but can affect the security of the system on which the code runs also. Complexity is also a barrier to implementation. It could be argued that the Internet has been successful because of its use of simple protocols. Using XML Schema would seem to be at odds with that principle. By being designed to be simple, Lumas avoids these problems. In summary, currently the main tools used for message definition in the IETF are ABNF and XML Schema. In many respects these represent two extremes, one simple and very low-level, and the other complex and high-level. Lumas is a data point between these two extremes, giving much of the flexibility of XML with the ease of understanding and compactness of ABNF. As such it is a useful extra tool that allows protocol developers to better tailor protocols to their needs. On another level, although message definition languages have been around for many years now, the relative paucity of options available, and the fact that XML is being trumpeted as a break through in inter-platform communication suggests that in terms of evolution, the field is in its infancy. It's easy to see why this might be. Message definition has not been seen as a core activity, and developers simply make-do by borrowing what is already available in other fields, even if they are not an ideal fit to their requirements. This would suggest that there is scope for much development, and it may transpire that XML turns out to be the FORTRAN or COBOL of the message definition world, and there is much more exciting stuff to come. It is hoped that Lumas can play a part in that story. 4. Terminology Cordell [Page 6] Internet Draft Lumas July 2003 The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY" in this document are to be interpreted as described in [KWORDS]. 5. Lumas Message Definition This section describes how Lumas specifies the content of messages. As the syntax is C-like it is felt that many will immediately understand the majority of a message definition. For this reason the basic principles of the definition language and a short example are presented before describing the format in detail. 5.1 Basic Principles of the Message Definition The basic principles of the message definition format are described in this section. Following the C language format, the basic format of a parameter definition is: type name ; 'Type' specifies things like integers, booleans, ASCII strings, Unicode strings and so on. The 'name' is the name of the parameter. Thus a parameter definition might be: ascii rfc-name ; This says that 'rfc-name' is an ASCII string. In addition, a parameter definition can express constraints on the type, constraints on the cardinality (how many instances of the type are valid in a message), and the tag to be used for the value on the wire. (A tag is a fixed sequence of characters that is used to identify the value that it is associated with.) For example, an integer may be limited to the values 0 to 255, and an ASCII string may be limited to a maximum size. The fuller format of a parameter definition has the form: type name [cardinality] tagging ; For example: int <1..30000> referenced-rfcs [0..255] as refers ; This defines an integer that can have values between 1 and 30000. The name of the parameter is 'referenced-rfcs', but is tagged on-the-wire using the character sequence 'refers'. The parameter can consist of between 0 and 255 instances of the integer in a valid message. Cordell [Page 7] Internet Draft Lumas July 2003 Two main types of compound parameter are possible, these being 'struct' and 'union'. Having much the same meaning as they have in C, a struct specifies a group of parameters, all of which may be used in a particular instance of the struct. A union similarly specifies a group of parameters, but in this case only one of the parameters can be used in any one instance of the union. An example of a struct is: struct rfc-info { ascii rfc-name; int <1..30000> referenced-rfcs[0..255] as refers ; }; A third form of compound type called 'combi' is also available. The name is short for 'combined' and the type allows a number of values to be concatenated together into what looks like a single value. Hence it can be used to define constructs like the character sequence 'HTTP/1.0', and that the '1' and the '0' are the major and minor version numbers. 5.2 An Example Message Definition The following is an example message definition that is intended to represent a very crude meeting controller: Cordell [Page 8] Internet Draft Lumas July 2003 lumas module com.tech-know-ware.my-example; /* An example Lumas definition */ import com.tech-know-ware.general as tkwg; struct my-example { int <0..255> participant-id as ?; Action action as ?; struct my-addition[0..1] as new.tech-know-ware.com plugin { bool tkw-app-capable as ?; }; }; union Action { Join join; Message message as msg; void leave; }; struct Join { unicode<0..63> name; }; struct Message { int <0..255> to-participants[1..127] as to; unicode<1..255> message as msg; [ // Version 2 additions tkwg::Priority priority; ] [ // Version 5 additions ascii<0..16> font-name[0..1] as font; void bold[0..1]; void italic[0..1]; void underlined[0..1] as ul; ] }; The first construct (in this case the struct my-example) is the root of all messages for the protocol. Each message identifies a participant using an integer in the range 0 to 255, called 'participant-id'. When encoded on the wire, this parameter will be untagged due to the 'as ?' specification. Each message then has an action, which is also untagged. The type of the action parameter is not immediately specified, and instead Cordell [Page 9] Internet Draft Lumas July 2003 references the 'Action' definition. The Action definition is a union in which only one of the specified parameters may appear in an instance of the Action construct. This effectively represents a fork in the semantics of any given message. The options within Action can indicate that somebody has joined the meeting, left the meeting, or is sending a message to other participants. There is no explicit tag for the 'join' and 'leave' options, so these will be tagged on-the-wire by the parameters' names, 'join' and 'leave' respectively. Conversely, an explicit tag for the 'message' parameter is specified, and hence the message option will be tagged by 'msg' on-the-wire. The join parameter also has a referenced definition. For the purposes of this example, when a person joins a meeting, all the other participants are informed of their name. The name is a Unicode string that has a minimum length of 0 characters and a maximum length of 63 characters. The message option is also a referenced definition. Conceptually, to send a message, the 'participant-id' is used to identify the sender, and the 'to-participants' field contains the participant ids of all the people to whom the message is being sent. On-the-wire, the to-participants parameter will be tagged with 'to'. Between 1 and 127 (inclusive) instances of the to-participants parameter may appear in a message. Also, the message itself is included. The message will consist of Unicode characters and can be between 1 and 255 Unicode characters long. On-the-wire, the message parameter will have the tag 'msg'. The priority field within the message struct has been added in a later version of the protocol. This is indicated by the square brackets in which the parameter is wrapped. Similarly, font-name, and the associated parameters have, according to the comment, been added in version 5 of the protocol. The type of the 'priority' parameter is defined in an external module that has the alias 'tkwg'. The 'import' directive at the beginning of the example indicates that the 'tkwg' alias corresponds to the module 'com.tech-know-ware.general', and it is in this module that the definition of 'Priority' is located. The definition indicates that 'font-name' is an ASCII string. The reader should already understand enough of the definition language to understand the meaning of the other fields. Returning to the 'my-example' root, a third-party has added an extension to the protocol in the form of the 'my-addition' parameter. It is identified as not being part of the base specification by the keyword 'plugin'. On-the-wire, the additional parameter will be identified by the tag 'new.tech-know-ware.com' to differentiate it Cordell [Page 10] Internet Draft Lumas July 2003 from additions that may be made by other third parties. On-the-wire encoded examples of this message definition are shown in section 6.2. 5.3 Formal Message Definition Syntax The Lumas syntax is defined using ABNF [ABNF]. Note, however, that contrary to ABNF, in this specification all literal strings such as "as" are case sensitive. Therefore "AS" can not be used in place of "as". The sections below describe the Lumas message definition syntax. 5.3.1. Lumas Parameters There are three classess of parameter in Lumas, simple parameters, compound parameters and reference parameters, which are defined as: lumas-parameter = simple-param / compound-param / reference-param 5.3.2. Simple Parameters The ABNF definition of a simple parameter is: simple-param = simple-type WS name [ OWS cardinality ] [ WS "as" WS explicit-tag ] [ WS "plugin" ] OWS ";" OWS where 'WS' represents white space, and 'OWS' represents optional white space. ('WS' and 'OWS' are defined in Section 7 - 'Common ABNF Definitions'. Generally, comments can be included wherever white space is allowed.) 5.3.3. The Simple Types Simple parameters have simple types such as integers, booleans etc. The following simple types are defined in Lumas: void A parameter that has no value. This is most useful in unions (wherein a converts a union into an enumerated type), and can also be used in a struct to represent boolean events wherein the absence of the parameter indicates false, and the presence of the parameter indicates true. It is more useful than you might at first think! bool Cordell [Page 11] Internet Draft Lumas July 2003 Can be true or false. int An integer value. float A floating point value. The constraints of a float specify the float to be either in accordance with a single precision value or a double precision value as specified in IEEE 754 [IEEE754]. The absence of a constraint indicates a single precision value. ipv4 Represents an IPv4 address, but not the port. ipv6 Represents an IPv6 address, but not the port. date Date according to the Gregorian calendar, with year, month and day of month. Other calendar types may be constructed from primitive types if required. time Represents the time in hours, minutes and seconds using the 24 hour clock notation. By default the time MUST be adjusted to UTC, unless the time can be guaranteed to have only local significance. oid This is an ASN.1 style Object Identifier. This is primarily included to enable identification of security protocols. ascii A string made up of ASCII characters, limited to the values 0 to 127. unquoted-ascii An ascii string usually has quote marks around it. This type does not have quotes around it. Consequently it can not have any white space, or include any special characters (such as "=", ")", and "}") that would confuse the parser. unicode Cordell [Page 12] Internet Draft Lumas July 2003 A string representing Unicode characters. const This type allows a constant value to be inserted into the encoded message. It will typically be untagged. One thing it might be used for is identifying the protocol of the message definition. For example: const protocol as ?; bytes An array of bytes. Also useful for carriage of opaque data. embedded The value is an embedded Lumas message. This allows layering of message definitions. 5.3.4. Simple Type Definition The 'simple-type' represents the type of the parameter. It has the following form: simple-type = "void" / "bool" / integer-type / float-type / "ipv4" / "ipv6" / "date" / "time" / "oid" / string-type / const-type / bytes-type / embedded-type where: integer-type = "int" OWS "<" OWS int-constraint OWS ">" float-type = "float" OWS [ "<" OWS float-constraint OWS ">" ] string-type = ( "ascii" / "unquoted-ascii" / "unicode" ) [ OWS "<" OWS string-constraint OWS ">" ] const-type = "const" OWS "<" first-safe-char *( safe-char ) ">" ; See the section 'Notes on Comments' below bytes-type = "bytes" [ OWS "<" OWS length-constraint OWS ">" ] embedded-type = "embedded" [OWS "<" OWS length-constraint OWS ">"] Cordell [Page 13] Internet Draft Lumas July 2003 int-constraint = min-int-constraint OWS ".." OWS max-int-constraint [ OWS use-leading-zero-marker ] min-int-constraint = ["-"] pos-number max-int-constraint = ["-"] pos-number use-leading-zero-marker = "z" float-constraint = "single" / "double" string-constraint = length-constraint [ OWS pattern-constraint ] length-constraint = [ min-len-constraint OWS ".." OWS ] max-len-constraint min-len-constraint = pos-number max-len-constraint = pos-number / "*" pos-number = 1*DIGIT ; Decimal number / "0x" 1*HEX ; Hex number / 1*DIGIT "b" ; Specifies number of binary bits In the case of the integer-type, the mandatory constraint specifies the minimum and maximum permissible values that the integer can take. If the 'use-leading-zeros-marker' character ('z') is included in the constraint, then where necessary the integer MUST be represented on the wire with leading zeros to make the value fixed width. (This is primarily applicable to combined types.) The pos-number construct used to specify the integer value constraint has a form that can specify the number of binary bits. The number of bits specified does not include any sign bits. Hence an unsigned 32 bit number can be represented as 0..32b, whereas a signed 32 bit number can be represented as -31b..31b (although this will actually exclude the most negative value of a signed 32 bit number). float is either a single precision IEEE 754 number or a double precision IEEE 754 number [IEEE754]. The absence of a constraint indicates single precision. (Developers are advised that in a number of cases a binary IEEE 754 number can not be exactly represented in a text-based base 10 format. Hence the decoder's binary representation of a floating-point number may differ from the encoder's binary representation of the number. If such discrepancies are not acceptable, developers should use an alternative representation for floating-point numbers.) In the case of string-type, the optional constraint specifies the minimum and maximum number of characters that are allowed to be represented in a valid encoding and optionally a valid pattern of characters. The minimum and maximum character constraint specifies the minimum and maximum number of characters at the application level, not the actual number of characters that are used to represent the application level characters on the wire. The format of the pattern constraint is designed to simplify regular expression evaluation by preventing the need for the trial and error type Cordell [Page 14] Internet Draft Lumas July 2003 processing of general regular expressions. Thus, in accordance with Lumas' 80/20 principle, valid patterns MUST not require the regular expression evaluator to do backtracking. The pattern constraint is described further in Section 5.3.5. In the case of bytes-type, the optional constraint specifies the minimum and maximum number of bytes that are allowed to be represented in a valid encoding. The constraint specifies the minimum and maximum number of bytes at the application level, not the number of characters that are used to encode those bytes on the wire. In the constraint syntax, a maximum value '*' means infinite or unbounded. 5.3.5. The Pattern Constraint The pattern-constraint has the following form: pattern-constraint = "/" sub-pattern *( "|" sub-pattern ) "/" sub-pattern = *pattern-element pattern-element = pattern-char [ quantifier ] pattern-char = %x20-29 / %0x2C-2E / %x30-3E / %x40-5A / %x5D-7A / %x7D-FF ;not \/|[?*+{ / escaped-char / special-char / character-class escaped-char = "\\" ; Matches \ / "\/" ; Matches / / "\|" ; Matches | / "\[" ; Matches [ / "\?" ; Matches ? / "\*" ; Matches * / "\+" ; Matches + / "\{" ; Matches { / "\." ; Matches . special-char = "\r" ; Matches the return character / "\n" ; Matches the new line character / "\t" ; Matches the tab character / "\f" ; Matches the form feed character / "\s" ; Matches white space [ \t\r\n\f] / "\d" ; Matches any digit [0-9] / "\w" ; Matches any word character [a-zA-Z_0-9] / "\S" ; Matches anything not matched by \s / "\D" ; Matches anything not matched by \d / "\W" ; Matches anything not matched by \w / "." ; Matches any character Cordell [Page 15] Internet Draft Lumas July 2003 character-class = matching-character-class / inverse-character-class matching-character-class = "[" *(class-char / class-range) "]" ; For a successful match, the character in the string ; being matched must be one of the characters ; specified in the matching-character-class. inverse-character-class = "[^" *(class-char / class-range) "]" ; For a successful match, the character in the string ; being matched must NOT be one of the characters ; specified in the inverse-character-class. class-char = class-single-char / class-escaped-char / escaped-char / special-char class-single-char = %x20-2C / %x2E-5B / %x5E-FF ; not - ] \ class-escaped-char = "\-" ; Matches - / "\]" ; Matches ] ; /|[?*+{. need not be escaped within character-class class-range = first-range-char "-" last-range-char ; The class-range matches all character that have ; an ASCII value greater or equal to that of ; first-range-char and less than or equal to ; last-range-char. first-range-char = class-single-char / class-escaped-char / escaped-char last-range-char = class-single-char / class-escaped-char / escaped-char quantifier = "?" / "*" / "+" / "{" quant-min-occurs [ "," [ quant-max-occurs ] ] "}" ; The absence of a quantifier indicates once and only ; once quant-min-occurs = 1*DIGIT quant-max-occurs = 1*DIGIT The 'pattern-constraint' allows a number of 'sub-pattern's to be defined, any one of which may match the string value. In each 'sub-pattern' there are no grouping or alternation constructs. This removes the need for backtracking and is suitable for 80% (or more) of applications. The pattern matching uses a "greedy" match. Each 'sub-pattern' can be viewed as a concatenation of 'pattern-element's. Each 'pattern-element' is a pattern-char and an optional 'quantifier'. The 'pattern-char' may actually match multiple characters. The 'quantifier' indicates how many times the associated 'pattern-char' may appear in a valid pattern. If the 'quantifier' is '?', the 'pattern-char' may appear 0 or 1 times. If the 'quantifier' is '*', the 'pattern-char' may appear 0 or more times. If the 'quantifier' is '+', the 'pattern-char' may appear 1 or more times. If the quantifier is of the form '{n,m}', the 'pattern-char' may appear a minimum of n times, and a maximum of m times. If the Cordell [Page 16] Internet Draft Lumas July 2003 quantifier is of the form '{n}', the 'pattern-char' must appear exactly n times. If the quantifier is of the form '{n,}', the 'pattern-char' may appear n or more times. To ensure that a string is in a suitable form to represent the value, the application, subject to the quantifier of a pattern-element, MUST, starting with the first character, keep matching successive characters of the string with the first pattern-element until the match fails. The application MUST then try to match the unmatched character of the string along with subsequent characters in the string with the next pattern-element, again taking into account the quantifier for that pattern-element. If a pattern-element has a quantifier that allows zero matches, then if the unmatched character of the previous pattern-element does not match the current pattern-element, the application should attempt to match the unmatched character against the next pattern-element, and so on. The process is repeated until the whole string is matched, or the application is unable to match the current string character with an appropriate pattern-element. If the application is unable to match the current input character with an appropriate patter-element, the whole sub-pattern match is deemed to have failed. The application MUST NOT backtrack to a previous pattern-element in order to attempt to find a match. This process is repeated for each of the sub-patterns until one of the sub-patterns matches the string, or all sub-patterns fail to match the string. The message MUST NOT be encoded if none of the patterns matches the string. Example patterns include /\d{4} \d{4} \d{4} \d{4}/ for a (UK) credit card number, or /\d{4}-\d{2}-\d{2}T\d+:\d+:\d+Z/ for a date & time matching the form 2003-03-03T12:45:32Z. The pattern /-?\d+|-?\d+\.\d+|-?\d+\.\d+[eE][+\-]?\d+/ matches a floating point number that can be represented as either an integer, a decimal without exponent, or full 'scientific' format. This pattern illustrates some of the impact of not allowing pattern groupings. For more information on regular expressions, see [PERL]. 5.3.6. The Name Referring back to the simple-param definition, 'name' is the name of the parameter. It has the format: name = ALPHA *( ALPHA / DIGIT / "-" / "_" ) If there is no explicitly defined tag, then the name is also used as the parameter's tag on-the-wire. In this case, the length of the name MUST NOT exceed 63 characters in length. See Section 5.3.8 for more on tagging. 5.3.7. Cardinality The cardinality of a parameter specifies how many times a particular Cordell [Page 17] Internet Draft Lumas July 2003 parameter can appear in a message. The format mirrors a C-like array specification, but uses UML style ranges rather than the single values used in C. If the cardinality field is absent, then one and only one instance of the parameter must occur in a valid message. The format of the cardinality specification is: cardinality = "[" ( cardinality-range / "?" / "*" / "+" ) "]" ; [?] short hand for [0..1] ; [*] short hand for [0..*] ; [+] short hand for [1..*] cardinality-range = [ min-occurrences ".." ] max-occurrences min-occurrences = 1*DIGIT max-occurrences = 1*DIGIT / "*" Once again, the '*' in max-occurrences represents infinite or unbounded. If in the 'cardinality-range' only 'max-occurrences' is present and it has a numerical value, the containing struct MUST have exactly 'max-occurrences' instances of the parameter. Example cardinalities are as follows: [0..1] ; Zero or one time [?] ; Short hand for zero or one time [0..*] ; Zero or more times [*] ; Same as above, zero or more times [1..*] ; One or more times [+] ; Same as above, one or more times [2..*] ; Two or more times [5] ; Exactly five times 5.3.8. Tagging A parameter can have a tag associated with it. A tag is a fixed sequence of characters used on the wire to enable a parser to identify the value or values that it is associated with. By default, the name of the parameter is used as the tag. If the name of the parameter is used as the tag the name MUST NOT exceed 63 characters in length. Alternatively an explicit tag can be specified. It can be any sequence of characters that do not have special significance to the parser. To facilitate buffer management, an explicit tag MUST NOT exceed 63 characters in length. If the tag definition begins with a Cordell [Page 18] Internet Draft Lumas July 2003 "?", the "?" is discarded. Thus to specify that "?" should be used as the tag on-the-wire, 'explicit-tag' should be specified as "??". explicit-tag = [ "?" ] tag ; tag defined in common definitions In certain constructs a parameter may also be untagged. This is discussed in the relevant sections below. 5.3.9. The Plugin Extension Mechanism Marking a parameter as 'plugin' indicates to the developer and the tools that this parameter is (probably) not part of the original message definition. For example, it might be a proprietary extension. It also indicates that the parameter may not be present in all received messages. A parameter that is marked as 'plugin' MUST have an explicit-tag defined for it. The explicit-tag MUST be constructed from a domain name [DOMAINS] owned by the entity defining the parameter, plus a sequence of characters that differentiate the explicit-tag from other explicit-tags defined by the defining entity. The component parts of the explicit-tag are presented in the normal domain name order so that the most variable part of the string is at the beginning, thus improving parsing efficiency. An example explicit-tag for tech-know-ware.com might be: my-tag.tech-know-ware.com 5.3.10. Reference Parameters In a struct or union, it is also possible to reference types that are defined elsewhere. The format of a 'reference-param' is: reference-param = reference-name WS name [ OWS cardinality ] [ WS "as" WS explicit-tag ] [ WS "plugin" ] OWS ";" OWS reference-name = [ module-name "::" ] name Other forms of reference-parameter are defined in the sections below. 5.3.11. Compound Parameters The compound types are struct, union and combi. For a struct, depending on the various parameters' cardinality specifications, any all or none of the parameters that a struct groups together may appear in a valid encoding. In the case of a union, only one of the parameters may be encoded in a valid instance. The combi form is effectively a compact encoding of a struct, but is subject to a number of additional constraints, which are described below. The definition format of each of the compound parameters is similar Cordell [Page 19] Internet Draft Lumas July 2003 to the simple parameters. The 'compound-param' has the form: compound-param = struct-param / union-param / combined-param 5.3.12. Struct Parameters The definition of a 'struct-param' is: struct-param = "struct" WS name [ OWS cardinality ] [ WS "as" WS explicit-tag ] [ WS "pluggable" ] [ WS "plugin" ] WS "{" struct-body "}" OWS ";" OWS 'Cardinality' and 'explicit-tag' have the same meaning as for the simple types. The 'pluggable' keyword is defined in Section 5.3.16. The format of the 'struct-body' is: struct-body = *( untagged-lumas-parameter ) *( lumas-parameter ) *( struct-extension ) The struct body starts with all the untagged parameters. Untagged parameters may have a cardinality other than one. Note that, if the cardinality of an untagged parameter allows it to be absent, then when encoded on the wire, if the untagged parameter is absent, then all subsequent parameters, including tagged parameters MUST also be absent. Thus great care is recommended when defining a message syntax that allows for an untagged parameter to be absent. The tagged parameters follow the untagged parameters. When the message definition is subsequently extended, an instance of the 'struct-extension' construct MUST be added to the end of the struct definition for each version in which the struct is extended. The 'struct-extension' construct wraps the added parameters within square brackets to indicate that they are added in a new version. This not only allows a developer to see what has been added in a new version, but also allows a parser to do the same. This is important because a parser must always consider absence of the new parameters to be a valid encoding so that it can receive messages from entities that are using an earlier version of the protocol. (To do this manually would dictate that all extension parameters would have to have a cardinality specification that included zero. This would be tedious, potentially error prone, and loses some expressiveness.) During the extension process, all new parameters MUST be added onto the end of an existing construct, and the order of parameters MUST NOT be rearranged from one version to the next. Note that 'struct-extension' does not allow the specification of untagged Cordell [Page 20] Internet Draft Lumas July 2003 parameters. All of these have a similar format to the types already defined, except that in some cases they may be untagged. To make the ABNF definition accurate it is therefore necessary to repeat the above basic definitions with the appropriate tagging specifications. The definition of the untagged struct parameters is: untagged-lumas-parameter = untagged-simple-param / untagged-compound-param / untagged-reference-param untagged-simple-type = simple-type WS name [ OWS cardinality ] WS "as" WS "?" OWS ";" OWS untagged-compound-param = untagged-struct-param / untagged-union-param / untagged-combined-param untagged-struct-param = "struct" WS name [ OWS cardinality ] WS "as" WS "?" [ WS "pluggable" ] WS "{" struct-body "}" OWS ";" OWS untagged-union-param = "union" WS name [ OWS cardinality ] WS "as" WS "?" [ WS "pluggable" ] WS "{" union-body "}" OWS ";" OWS untagged-combined-param = "combi" WS name [ OWS cardinality ] WS "as" WS "?" WS "{" combined-body "}" OWS ";" OWS untagged-reference-param = reference-name WS name [ OWS cardinality ] OWS ";" OWS Note that the 'plugin' keyword is not applicable to untagged parameters. The tagged parameters have the basic parameter definition that was initially presented, i.e. lumas-parameter. The struct body extension fields have the format: struct-extension = "[" OWS 1*( lumas-parameter ) "]" OWS 5.3.13. Union Parameters A union parameter has the following definition: Cordell [Page 21] Internet Draft Lumas July 2003 union-param = "union" name [ OWS cardinality ] [ WS "as" WS explicit-tag ] [ WS "pluggable" ] [ WS "plugin" ] WS "{" union-body "}" OWS ";" OWS 'Cardinality' and 'explicit-tag' have the same meaning as for the simple types. The 'pluggable' keyword is defined in Section 5.3.16. A union-body MAY have a single untagged integer parameter. All other parameters MUST be tagged and have a cardinality of one and only one. Other than the cardinality constraints of a union, a union can be extended in the same way as a struct. The untagged integer parameter allows integers to be defined that have wild-carding options. For example, a union might be defined as: union select { int<0..65535> numbered as ?; void any as *; }; Examples of the encoded form might be: select = 12 select = * The parameters within a union are only allowed unary cardinality to avoid ambiguity in the on-the-wire encoding. If multiple instances of a parameter must be included as an option in a union, it is necessary to wrap the parameters within a struct, using something similar to: struct X { X x[1..*] as ?; }; The definition of a union-body is as follows: union-body = [ integer-type WS name WS "as" WS "?" OWS ";" OWS ] *( singular-lumas-parameter ) *( union-extension ) As mentioned previously, most of the parameters within a union are tagged and have a cardinality of one. Their defininition is: Cordell [Page 22] Internet Draft Lumas July 2003 singular-lumas-parameter = singular-simple-param / singular-compound-param / singlular-reference-param singular-simple-param = simple-type WS name [ WS "as" WS explicit-tag ] [ WS "plugin" ] OWS ";" OWS singular-compound-param = singular-struct-param / singular-union-param / singular-combined-param singular-struct-param = "struct" WS name [ WS "as" WS explicit-tag ] [ WS "pluggable" ] [ WS "plugin" ] OWS "{" struct-body "}" OWS ";" OWS singular-union-param = "union" WS name [ WS "as" WS explicit-tag ] [ WS "pluggable" ] [ WS "plugin" ] OWS "{" union-body "}" OWS ";" OWS singular-combined-param = "combi" WS name [ WS "as" WS explicit-tag ] [ WS "plugin" ] OWS "{" combined-body "}" OWS ";" OWS singular-reference-param = reference-name WS name [ WS "as" WS explicit-tag ] [ WS "plugin" ] OWS ";" OWS The union extension operates in a similar fashion to that of a struct, but references singular-lumas-parameters. Its definition is: union-extension = "[" OWS 1*( singular-lumas-parameter ) "]" OWS 5.3.14. Combined Parameters A combined parameter has the following definition: combined-param = "combi" name [ OWS cardinality ] [ WS "as" WS explicit-tag ] [ WS "plugin" ] WS "{" combined-body "}" OWS ";" OWS The combined compound type provides a simple mechanism for defining new combined types similar to that used for date and time. All the members of a combined type are encoded on the wire using their untagged form and concatenated together with no intervening white space. The result of the encoding MUST meet all the constraints of an unquoted-ascii value. In addition, the parameters that make up the combined type are subject to the following constraints: Cordell [Page 23] Internet Draft Lumas July 2003 - Each unquoted-ascii parameter that is part of a combined body MUST have a fixed number of characters, - The first character of unquoted-ascii and const parameters MUST NOT be a digit, - integer values MUST NOT be adjacent. The form of the combined body is: combined-body = *( combined-simple-type WS name ";" ) combined-simple-type = integer-type / const-type / "unquoted-ascii" OWS "<" 1*DIGIT ">" In many respects the combined type simply makes the encoded form look prettier, and anything that can be encoded with the combined type can also be represented with the struct type. The combined type should also not be used for defining patterns of ASCII or Unicode characters. Note also that a combined type is not pluggable and hence can not be extended. It is therefore recommended that the combined type be used sparingly. An example of a combined type is: combi protocol as ? { const const1; int<0..99> major-version; const <.> const2; int<0..99> minor-version; }; Which might be encoded as: HTTP/1.1 Combined types also allow you to define numbers that contain decimal points. An example of such is: Cordell [Page 24] Internet Draft Lumas July 2003 union currency as ? { void dollars as $; void pounds as ú; void francs as FFr; } combi amount as ? { int<-31b..31b> main-denomination; const <.> const2; int<0..99z> sub-denomination; }; Which might be encoded as: $ 100.05 5.3.15. Referenced Parameters It was mentioned previously that structs and unions can reference types that are defined elsewhere. Referenced types do not have a cardinality specification, and do not specify an explicit tag. This is because the cardinality and tagging of the type are defined in the item that does the referencing, rather than where the referenced type is defined. (If a referenced type needs a cardinality other than one, it is recommended that the technique for giving a parameter within a union a non-unary cardinality be used.) The definition of the referenced types are: referenced-lumas-parameter = referenced-simple-param / referenced-compound-param / referenced-reference-param referenced-simple-param = simple-type WS name OWS ";" OWS referenced-compound-param = referenced-struct-param / referenced-union-param / referenced-combined-param referenced-struct-param = "struct" WS name [ WS "pluggable" ] OWS "{" struct-body "}" OWS ";" OWS referenced-union-param = "union" WS name [ WS "pluggable" ] OWS "{" union-body "}" OWS ";" OWS referenced-combined-param = "combi" WS name OWS "{" combined-body "}" OWS ";" OWS referenced-reference-param = reference-name WS name OWS ";" OWS 5.3.16. External Extensions - Plug and Pluggable Cordell [Page 25] Internet Draft Lumas July 2003 A protocol may be extended via an external specification without directly modifying the original definition. This may be to define a proprietary extension, or to define an external profile of the base protocol. The specification for this type of extension is: external-extension = "plug" WS ( external-struct-extension / extenal-union-extension ) WS "into" WS into-name *( OWS COMMA OWS into-name ) OWS ";" OWS into-name = [ module-name "::" ] hierarchical-name hierarchical-name = *( name "." ) name external-struct-extension = 1*lumas-parameter external-union-extension = 1*singular-lumas-parameter This specifies a parameter that is to be plugged into an existing construct. For example, if the following is defined: plug ascii cookie as cookie.tech-know-ware.com; into my-example.my-addition; The resulant definition would be treated as if it were: struct my-example { int <0..255> participant-id as ?; Action action as ?; struct my-addition[0..1] as new.tech-know-ware.com plugin; { bool tkw-app-capable as ?; ascii cookie as cookie.tech-know-ware.com plugin; }; }; The 'into-name' field indicates the name of the construct that the item is to be plugged into. The optional 'module-name' part of the name specifies the name of the module that contains the parameter into which the extension is to be plugged. The 'hierarchical-name' specifies the name of the parameter within the module that the extensions are to be plugged into. The name is hierarchical because parameters can be locally defined within structs and unions. The hierarchical name is made up of the name of each of the parameter's ancestors' names plus the name of the parameter itself joined together by the '.' character. If the parameter to be extended is contained within another parameter, the first name is the name of the outer most parameter that contains the parameter to be extended (i.e. Cordell [Page 26] Internet Draft Lumas July 2003 one that is not contained within any other parameter), and the second name is the name of the next outer most parameter that contains the parameter to be extended (if present), and so on until the parameter itself is named. An illustration of the naming is shown in the example above. In a struct and union the 'pluggable' keyword is used to indicate that the construct is a location that the message designers have formally declared as extendible using the 'plug' mechanism. Lumas compilers SHOULD emit warnings when extra material is plugged into locations that are not marked as pluggable, but MUST NOT consider it an error. Combined types are not pluggable. If a party other than the original message designers use the plug mechanism to define an extension, each added parameter MUST have an explicit-tag constructed according to the rules described in Section 5.3.9. 5.3.17. Module Definition and Directives A single protocol may be defined in a number of message definition files. This might be for the purpose of accessing predefined libraries, or specifying a definition that the current definition extends. A message definition therefore begins with a set of optional directives expressing this information. They have the form: lumas-directives = [ [ "lumas" WS ] "module" WS module-name OWS ";" OWS ] [ "extends" WS module-name [ WS "as" WS alias ] OWS ";" OWS ] *( "import" WS module-name [ WS "as" WS alias ] OWS ";" OWS ) module-name = [ "+" ] name *( "." name ) alias = name The 'module' directive specifies the name of the module. The 'extends' directive is used in a definition that contains an external extension. The module-name in the extends specification indicates the message definition that is being extended. The 'import' statement indicates a library message definition that contains referenced types that are referenced within the message definition. The 'module-name' is a hierarchical namespace that is based on the name of the protocol, combined with a domain name [DOMAINS] owned by the entity defining the protocol. The parts of the module-name are combined together so that it looks like a regular domain name. The order in which the domain levels is written is then reversed, so that the top-level domain becomes the first written domain, and the second level domain becomes the second written domain and so on. For Cordell [Page 27] Internet Draft Lumas July 2003 example, if a protocol called the Simple Conference Protocol (SCP) was defined by Tech-Know-Ware Ltd with a domain name of tech-know-ware.com, the module name might be: com.tech-know-ware.scp It is the responsibility of the entity owning the domain name to ensure that the module names it creates using its domain name are unique. Lumas defines a number of pseudo top level domains for its own purposes. These are currently as follows: +ietf A pseudo top level domain for the Internet Engineering Task Force. +iso A pseudo top level domain for the International Standards Organisation. The sub-domains of this domain follow the structure of ISO defined Object Identifiers. All spaces must be removed and numbers in brackets should be ignored when parsing this domain. E.g. iso(1) member-body(2) us(840) rsadsi(113549) digestAlgorithm(2) 5 is represented as +iso(1).member-body(2).us(840).rsadsi(113549).digestAlgorithm(2).5 and looked up as +iso.member-body.us.rsadsi.digestAlgorithm.5 . +itu A pseudo top level domain for the International Telecommunications Union. The sub-domains of this domain follow the structure of ITU defined Object Identifiers. Processing of such identifiers follows that defined for processing of ISO Object Identifiers. +lms A pseudo top level domain for defining Lumas extensions and libraries. +uuid A pseudo top level domain that uses Universally Unique Identifiers for identification. An example is: +uuid.4d36e96c-e325-11ce-bfc1-08002be10318 National standards bodies such as ANSI and BSI are defined under their national top-level domain. The 'alias' part of the import and export statements is used as an alias of the 'module-name', so that items within 'module-name' can be referenced in the abbreviated form of: alias::item For example, if a parameter definition called 'id' is contained in the module 'com.tech-know-ware.scp', and the following import statement is specified: Cordell [Page 28] Internet Draft Lumas July 2003 import com.tech-know-ware.scp as tkwscp; Then 'id' can be referenced by: tkwscp::id 5.3.18. The Top Level Definition Finally, we are in a position to describe a complete Lumas message definition. This is: lumas-definition = OWS lumas-directives *external-extension *referenced-lumas-parameter The first parameter defined within the message definition is the root of the message definition tree, and is thus the outer-most construct of an encoded message. 5.4 Locating Lumas within a Specification It is not sufficient to use Lumas alone to define a protocol. Additional narrative is required to define the semantics of a protocol in addition to the syntax defined by Lumas. Thus Lumas and narrative typically need to be combined in a single document. The issue here is that at some point the Lumas must be extracted from the document to be useful. If the Lumas is intermingled with the narrative, it can be manually removed using cut and paste, however this is tedious and error-prone. An alternative is to put all the Lumas in a separate section so that it can be easily extracted. However, this distances the Lumas specification from the narrative that explains it, which is undesirable. A third option is to do both - interleave one copy of the Lumas with the narrative and a separate copy that can be used for compiling. This approach makes it difficult to keep the two versions in step, and errors can easily creep in. Lumas compilers MUST implement a fourth option. Before parsing a file, a compiler MUST first look for a line of text on which the first non-white space text is lumas*/ and only has white space after it. If such a line is found, compilation starts at the following line. Subsequent narrative is then include in /* */ comment marks. If no such line is found, then compilation begins at the beginning of the file. For example, if any */ character sequences that follow this example are removed (which have been included to discuss how they are used and hence not properly matched), a Lumas compiler must be able to find and process the following Lumas syntax: Cordell [Page 29] Internet Draft Lumas July 2003 lumas*/ // The first 'official' line of Lumas struct top { not-much not-much; }; /* This is narrative. */ int <0..1> not-much; /* For a fuller description of Lumas comments, see Section 8. 6. On-the-Wire Representation 6.1 Principles of On-the-Wire Encoding The basic format of the text based on-the-wire encoding is to use the format: tag = value The tag is a fixed sequence of characters that identifies the parameter with which a particular value (or values) is associated. For example, there may be multiple parameters that have integer values within a struct, that might specify, say, width and height. The tags are used to identify which integer value belongs to which parameter. If there are multiple instances of a parameter, then they may either be conveyed as multiple instances of the above construct, or as a comma separated list, as in: tag = value, value, value If a tag is explicitly specified in the message definition, then this is used on the wire. If no tag is explicitly specified, then the name of the parameter is used as the tag. Tagged items may appear in any order within a struct, and do not have to be in the same order as they are defined in the struct definition. It is also possible to specify that no tag should be used on the wire by specifying 'as ?'. All untagged items MUST appear in a struct in the same order that they are defined in the message definition, and MUST appear before any tagged items within a struct definition. Untagged parameters that have greater than one instance MUST be constructed as a comma separated list. Thus untagged values have the format: value Cordell [Page 30] Internet Draft Lumas July 2003 or: value, value, value If an untagged parameter has a cardinality that allows it to be absent from an encoded message, then all subsequent parameters in the enclosing struct, including tagged parameters, MUST also be absent. Consequently, great care should be taken when defining a message definition that allows untagged parameters to be absent. For the examples quoted earlier, that is: ascii rfc-name ; int <1..30000> referenced-rfcs [0..255] as refers; The format on the wire would be something like (depending on the actual values in question): rfc-name = 'Lumas' refers = 2234, 791, 2045 6.2 Example On-the-Wire Representation The following are example on-the-wire representations of the example message definition. 1 join = { "Alice" } new.tech-know-ware.com = { True } 1 msg = { to = 2, 5, 8, 58 msg = "Where are we going for dinner" font = 'Arial' } 1 leave Note that the placing of each parameter on a separate line is not significant. Lumas is free form with respect to white space. 6.3 Formal On-the-Wire Representation The principle representation of a Lumas defined message on the wire is text based. The top-level construct of a Lumas definition is a referenced type, which essentially has no tag associated with it. (Indeed, the presence of such a tag would not convey any information.) The top-level construct is therefore either a struct body, or a union body, as in: Cordell [Page 31] Internet Draft Lumas July 2003 lumas-text-message = struct-body / union-body A struct body can contain untagged and tagged parameters. All untagged parameters MUST appear before any tagged parameters. The values of untagged parameters that have non-singular cardinality MUST be comma separated. Tagged parameters that have non-singular cardinality may either have a tag followed by a comma separated list of values, have multiple instances of the "tag = value" form, or some combination of the two. The definition of a struct-body is therefore: struct-body = OWS *( value *( COMMA value ) WS ); Untagged values *( ( tag WS ) ; For a void parameter / ( tag EQUAL value *( COMMA value ) WS ) ) ; WS not required if it's the last item Except for a single integer parameter that may be untagged, all items of a union body MUST be tagged. Also, parameters must only have a cardinality of one in the encoding to avoid ambiguities in the encoded message. Therefore a union body has the form: union-body = OWS ( integer-value WS / tag WS ; For a void parameter / ( tag EQUAL value WS ) ) The definition for 'tag' is defined in the common definitions section, Section 7. 'value' has the following definition: value = simple-value / compound-value simple-value = bool-value / integer-value / float-value / ipv4-value / ipv6-value / date-value / time-value / oid-value / ascii-value / unquoted-ascii-value / unicode-value / const-value / bytes-value / embedded-value Which in turn are defined as follows: bool-value = "True" / "False" / "T" / "F" integer-value = [ "-" ] 1*DIGIT Cordell [Page 32] Internet Draft Lumas July 2003 float-value = float-number / "NaN" ; IEEE 754 Not a Number / "INF" ; Positive infinity / "-INF" ; Negative infinity ; Note that "-0" is included in float-number float-number = float-mantissa [ ("e"/"E") float-exponent ] float-mantissa = ["-"] 1*DIGIT ["." 1*DIGIT] float-exponent = ["-"/"+"] 1*DIGIT The value encoding of a float is the base 10 representation of a base 2 number. There will typically be a degree of error introduced when the conversion is made. Hence the float type should be looked upon as a convenient way to convey floating point information where bit level accuracy between the encoder's base 2 representation of the number and the decoder's base 2 representation of the number is not required. If this is not acceptable, then implementers should seek other ways of presenting floating point numbers that do not suffer from this loss of accuracy. The 'float-mantissa' part of the number is NOT restricted to the range 1.0 to 9.9. An 'oid-value' is represented as: oid-value = 1*DIGIT *( "~" 1*DIGIT ) As can be seen, only the oid's numerical values are encoded. The IP address values are: ipv4-value = 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT ipv6-value = hexseq / hexseq "::" [ hexseq ] / "::" [ hexseq ] hexseq = hex4 *( ":" hex4) hex4 = 1*4HEXDIG Note that the IPv4 address within an IPv6 address format is not supported. Date and time parameters have fixed width to aid parsing. As such the various fields have leading zeros if required. (They adopt one of the ISO-8601 formats.) Dates are according to the Gregorian calendar. Other calendar types may be constructed from other types if required. Unless the time can be guaranteed to have only local significance, the time MUST be converted to UTC prior to including it in a message. The time uses 24-hour clock notation. The absence of the 'time-seconds' field is interpreted as meaning seconds = 0. Cordell [Page 33] Internet Draft Lumas July 2003 date-value = date-year "-" date-month "-" date-day-of-month date-year = 4DIGIT ; e.g. 2002 date-month = 2DIGIT ; With leading zeros, 01 to 12 date-day-of-month = 2DIGIT ; With leading zeros, 01 to 31 time-value = time-hours ":" time-minutes [ ":" time-seconds ] time-hours = 2DIGIT ; With leading zeros, e.g. 00 to 23 time-minutes = 2DIGIT ; With leading zeros, e.g. 00 to 59 time-seconds = 2DIGIT ; With leading zeros, e.g. 00 to 59 unquoted-ascii-value = first-safe-char *( safe-char ) ; See the section 'Notes on Comments' below The string types have the format: ascii-value = "'" *( %x00-26 / %x28-5B / %x2D-x7F / "\\" / "\'" ) "'" unicode-value = DQUOTE *( %x00-21 / %x23-5B / %x5D-xFF / "\\" / "\" DQUOTE ) DQUOTE ; DQUOTE defined in [ABNF] bytes-value = "[" OWSNC base64-line *( WSNC base64-line ) OWSNC "]" base64-line = 0*18( 4BASE64-CHAR ) ( ( 4BASE64-CHAR ) / ( 3BASE64-CHAR "=" ) / ( 2BASE64-CHAR "=" "=" ) ) BASE64-CHAR = ALPHA / DIGIT / "+" / "/" The white space between base64-lines should include characters to move to a new line as specified in [BASE64]. const-value = first-safe-char *( safe-char ) ; See the section 'Notes on Comments' below embedded-value = "(" *(%x00-FF) ")" Any occurrence of '(' within an embedded message that is not part of a string, must be matched by a corresponding ')'. Illustrating the recursiveness of the message format, we have: Cordell [Page 34] Internet Draft Lumas July 2003 compound-value = struct-value / union-value / combined-value struct-value = "{" struct-body "}" union-value = union-body combined-value = first-safe-char *( safe-char ) EQUAL = OWS "=" OWS COMMA = OWS "," OWS 6.4 Marking Message Boundaries Before a message is parsed it is necessary to know the boundary of the message. There are many ways in which this can be done, and the method adopted should be specified in the protocol specification. However, in the absence of any other way, Lumas parsers should take the presence of an unmatched closing brace to be the end of message marker. Hence, the definition of a message delimited in this way becomes: delimited-lumas-text-message = lumas-text-message ( "}" / ")" ) 6.5 Examples of Encoded Types This section illustrates how the types look once they have been encoded according to the syntax above. The tag of each item has the format 'my-XXXX'. Except in the case of the 'void' example, the XXXX part indicates the type that is encoded to the right of the equals sign. my-void // Tag only for a void parameter my-bool = True my-int = 5643 my-float = 102.4519 my-ipv4 = 10.0.0.1 my-ipv6 = 201:123:: my-date = 2002-02-28 my-time = 12:00:00 my-oid = 1~2~840~113549~2~5 my-ascii = 'Lumas' my-unquoted-ascii = Lumas Cordell [Page 35] Internet Draft Lumas July 2003 my-unicode = "Lumas" my-const = Lumas my-bytes = [ 01AF3C== ] my-embedded = ( my-other-int=5 single-closing-bracket-text=')' ) my-struct = { 5434 All time=98787654654 } my-union = 5434 my-union = Switch my-union = Volume = 11 7. Common ABNF Definitions The following definitions are common to both the message definition syntax and the on the wire representation. tag = first-tag-safe-char 0*62( safe-char ) ; Tag MUST NOT exceed 63 characters in length first-tag-safe-char = %x21 / ; Not " %x23-26 / ; Not ' ( ) %x28-2B ; Not , - %x2E-2F / ; Not 0 1 2 3 4 5 6 7 8 9 %x3A-3C / ; Not = %x3E-5A ; Not [ %x5C-7A / ; Not { %x7C / ; Not } %x7E-7F ; Visible characters except = , " ' { } ( ) [ - ; and digits (tags must not get confused with integers) first-safe-char = first-tag-safe-char / DIGIT / "-" safe-char = first-safe-char / DQUOTE / "'" / "{" / "(" / "[" ; Not = } ) , Cordell [Page 36] Internet Draft Lumas July 2003 WS = 1*( comment / SP / HTAB / CR / LF ) ; HTAB, CR, LF defined in [ABNF] OWS = [ WS ] ; Optional white space WSNC = 1*( SP / HTAB / CR / LF ) ; Whitespace - no comment OWSNC = [ WSNC ] ; Optional white space - no comment ; See section 'Notes on Comments' below for more on comments comment = c-comment / cpp-comment c-comment = "/*" (nested-end / hard-end ) nested-end = "*/" hard-end = "**/" cpp-comment = "//" *( HTAB / %x20-7F ) ( CR / LF ) ; A comment is treated as a single space during parsing ALPHA, DIGIT, HEXDIG and DQUOTE are defined in [ABNF]. 8. Notes on Comments To aid development Lumas allows comments to appear in both a message definition and on the wire. On the wire, const and unquoted-ascii values MUST NOT begin with comment start markers ('//' and '/*'). However, if the values contain comment start marker characters, the characters MUST be interpreted as part of the value, and do not indicate the start of a comment. For example, in the first of the examples below, the text "This-is-a-comment" MUST be treated as a comment, whereas in the second example the text "this-is-part-of-the-value" MUST be treated as part of the value. ascii-value = /*This-is-a-comment*/This-is-the-value ascii-value = and-//this-is-part-of-the-value In a message definition (but not on the wire) the c-comment style of commenting allows nesting of comments. In a nested comment, each occurrence of the '/*' character sequence MUST be matched by a corresponding occurrence of the '*/' character sequence before the comment ends. Additionally, if a comment starts with the '/*' character sequence, the end of the comment can be forced by the hard end of comment marker defined as '**/', which overrides the nesting. (This provision allows the commenting out of headers and footers in text only message definition documents.) A comment is treated as a single space for the purposes of parsing. 9. Locating Lumas Modules It is not intended that applications should find Lumas modules Cordell [Page 37] Internet Draft Lumas July 2003 'on-the-fly'. It is expected that some human involvement will be required to locate and interpret a Lumas definition. A Lumas definition does not therefore have any way of specifying the physical location from where a referenced definition can be acquired. Instead, the strategy is to exploit the fact that a module definition can begin with the text "lumas module" followed by the module name. By entering this text (e.g. "lumas module org.lumas.mine") into a web search engine (either one that covers the whole Internet, or is limited to a specific site) a user can locate a particular Lumas module. Determining whether a Lumas module so located is authentic is beyond the scope of this document. 10. Mandatory to Understand Many protocols require the capability to signal that certain extension parameters are mandatory to understand, and if they are not understood the message should be rejected in some way. Lumas provides no in-built mechanism for this feature. Instead implementers are recommended to use a feature similar to SIP's 'Require' header [SIP] which presents a list of feature identifiers that must be understood. Naturally, provision for this mechanism must be included in the first version of the protocol, as it is not possible to define such semantics at a later time. An example of such a construct might be: union require [*] pluggable { }; And could be populated using: plug void my-feature; into require; 11. Security Considerations Lumas itself does not have any security issues related to it, but the security requirements of a protocol must be borne in mind when writing a Lumas message definition. Common advice is that it is difficult to add security to a protocol once it has been released, and hence security issues must be considered from the outset. This is of issue to a Lumas message definition as it may affect the format of messages. This is particularly the case for integrity check values that are effectively appended to the end of the message once it is encoded. This may mean that it is appropriate to define both a main message definition and a message definition that is a wrapper that can provide cryptographic services for the main message definition. For example, a message definition wrapper might look like: Cordell [Page 38] Internet Draft Lumas July 2003 struct my-protocol-wrapper { embedded main-definition as ?; bytes<1..64> signature as signed; oid signature-algorithm as sig-alg; }; 12. References [ABNF]D. Crocker, & P. Overell, "Augmented BNF for Syntax Specifications: ABNF, " Internet Engineering Task Force, RFC 2234, November 1997. [BASE64]N. Freed, & N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies," Internet Engineering Task Force, RFC 2045, November 1996. [DOMAINS]J. Postel, "Domain Name System Structure and Delegation," Internet Engineering Task Force, RFC 1591, March 1994. [IEEE754]"IEEE Standard for Binary Floating-Point Arithmetic," IEEE 754-1985, IEEE, 1985. [KWORDS]S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels," RFC 2119, March 1997. [PERL]L. Wall, T.Christiansen, & J. Orwant, "Programming Perl", O'Reilly, ISDN-0-596-00027-8. 13. Informative References [ASN1]International Organization for Standardization, "Information Processing Systems - Open Systems Interconnection - Specification of Abstract Syntax Notation One (ASN.1)", ISO Standard 8824, December 1990. [CMS] R. Housley, "Cryptographic Message Syntax," RFC 2630, June 1999. [DIAMETER]Pat R. Calhoun, John Loughney, Erik Guttman, Glen Zorn, Jari Arkko, "Diameter Base Protocol," draft-ietf-aaa-diameter-xx, Work in Progress. [IP] "Internet Protocol," RFC 791, September 1981. [OMGIDL]"Common Object Request Broker Architecture: Core Specification, " Object Management Group, December 2002. (Accessible via: http://www.omg.org/technology/documents/corba_spec_catalog.htm) [RELAX]OASIS Technical Committee: RELAX NG, "RELAX NG Specification", December 2001, Cordell [Page 39] Internet Draft Lumas July 2003 . [SCHEMA]Thompson, H., Beech, D., Maloney, M. and N. Mendelsohn, "XML Schema Part 1: Structures", W3C REC-xmlschema-1, May 2001, , and Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes", W3C REC-xmlschema-2, May 2001, . [SIP] J. Rosenberg et al., "SIP: Session Initiation Protocol," Internet Engineering Task Force, RFC 3261, June 2002. [SMTP]Klensin, J. (Ed.), "Simple Mail Transfer Protocol", RFC 2821, April 2001. [SNMP]J. Case, M. Fedor, M. Schoffstall, J. Davin, "A Simple Network Management Protocol (SNMP)," RFC 1157, May 1990. [STRON]Jelliffe, R., "The Schematron", November 2001, . [TCP] "Transmission Control Protocol," RFC 793, September 1981. [TLS] Dierks, T. and C. Allen, "The TLS Protocol Version 1.0", RFC 2246, January 1999. [UDP] "User Datagram Protocol, " RFC 768, August 1980. [XDR] R. Srinivasan, "XDR: External Data Representation Standard," RFC 1832, August 1995. [XML] "Extensible Markup Language (XML) 1.0 (Second Edition)", W3C REC-xml, October 2000. [XMLBCP]S. Hollenbeck, M. Rose, and L. Masinter, "Guidelines for the Use of Extensible Markup Language (XML) within IETF Protocols," RFC 3470, January 2003. 14. Author's Address Pete Cordell Tech-Know-Ware Ltd P.O. Box 30 Ipswich IP5 2WY UK pete@tech-know-ware.com http://www.tech-know-ware.com Full Copyright Statement Copyright (C) The Internet Society (2002). All Rights Reserved. Cordell [Page 40] Internet Draft Lumas July 2003 This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society. Cordell [Page 41]