INTERNET DRAFT Pete Cordell Internet Engineering Task Force Tech-Know-Ware June 23, 1999 Expires December 25, 1999 Structured Message Encoding Using an ASCII Line Format Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract At some stage in the development of a protocol it is necessary to consider how the protocol elements will be conveyed on the wire. With groups from both the IETF and the ITU-T collaborating on the same protocol this has often been a contentious issue. Described here is a mechanism for encoding messages that is intended to combine the benefits that both groups seek when selecting message encoding techniques. The result is a formal method of specifying messages that is encoded on the wire using ASCII and is extensible. Techniques are also discussed that show how such messages can be easily parsed without complicated software tools. 1. Analysis With groups from both the IETF and the ITU-T collaborating on the same protocol specifying how messages are encoded on the wire has often been a contentious issue. The ITU-T preference is for ASN.1. The benefit of ASN.1 is that it describes messages in a powerful expressive high level way. It is similar to writing code in Pascal or C as opposed to Assembler. As such it is amenable to processing by software tools which can reduce the amount of labour involved in implementing a protocol, and also reduce the likelihood of making careless mistakes. The down side is that it does not generally allow proprietary extensions (although some would claim this as an advantage!) and to understand ASN.1 fully can take considerable effort, thus countering any gains made in ease of implementation. The IETF community has a preference for encoding application messages in ASCII. This is because it allows the data to be generated and read by humans and it allows the protocol to be easily extended, perhaps for proprietary purposes, while still maintaining parameter visibility. The benefits of each method leads to the two schemes' proponents being reluctant to wholesale adopt the other method. However, the two approaches are not entirely irreconcilable (at least at the level stated above) and this specification describes a scheme designed to do just that by adopting the following principals: 1.The bits on the wire are in the form of ASCII characters, 2.Only a subset of ASN.1 syntax is used to define the messages, 3.Extensions are made to the ASN.1 definition to achieve easy protocol extension. The benefits of this are a common method of expressing high level messages constructs that it is hoped will enhance the collaboration of the IETF and ITU-T communities. 2. Syntax Sub-set The first requirement is to select a sub-set of the total ASN.1 capabilities. The approach taken is to adopt the 80-20 principle in which 80% of the expressiveness is achieved with 20% of the syntax. To this end, the following keywords are used. All other keywords are not supported. INTEGER OCTET STRING IA5String BMPString BOOLEAN NULL SEQUENCE CHOICE OBJECT IDENTIFIER SEQUENCE OF SIZE OPTIONAL ... Table 1. Subset of ASN.1 keywords To provide compatibility with existing ASN.1, some ASN.1 terms are aliased to the sub-set above. These include: 1.SET is aliased to SEQUENCE, 2.NumericString and GeneralString are aliased to OCTET STRING. To ease understanding by those not familiar with ASN.1, the following terms are also adopted: 1.ASCIIString is treated as an alias of IA5String, 2.UnicodeString is treated as an alias of BMPString. Additional constraints are that ranges cannot be extensible (e.g. INTEGER( 0..56,... ) is illegal), and that a SEQUENCE OF cannot be a direct option within a CHOICE. The latter simplifies the line encoding. (Instead put the SEQUENCE OF within a SEQUENCE which can then be part of the CHOICE.) These limitations mean that there is no support for ASN.1 macros, and restricted alphabets can not be directly used (although comments in the syntax can be used to define such restricted alphabets). 3. Support for Extensibility To allow for extensibility, the syntax includes the terms EMBEDDED, AS and PLUGIN. EMBEDDED is a type that transports a pre-encoded message fragment using the same encoding rules as the message that it is contained in. It is the responsibility of the protocol designer to define syntax that indicates what the embedded type is. An example of EMBEDDED is: encapsulated-protocol EMBEDDED, AS and PLUGIN support direct extensibility, either proprietary or defined by a standards body, within the message definition. PLUGIN indicates that the parameter is an extension and thus must use an ASCII tag. This is principally of significance when the message is encoded in a binary form. However, it is included here as a message definition should not preclude particular message encoding techniques and it makes the definition more generic. The AS keyword can be used to optionally specify the value of the ASCII tag to be used. If an AS field is not included with a PLUGIN, then the value name is used as the tag on the wire. For example, the following can be used to declare a new parameter which has an ASCII tag on the wire called "myparam.mycompany.com": myparam AS myparam.mycompany.com INTEGER( 0..32767 ) PLUGIN OPTIONAL, A special use of the AS notation allows the protocol specification to indicate that no tag should be used for the parameter. This is indicated by specifying "AS ?" (this should become clearer in the description below). This has the form: version AS ? INTEGER( 0..100 ), This notation can not be used on SEQUENCE OF parameters, OPTIONAL parameters, PLUGINs, members of a CHOICE, or any parameter that follows such a parameter. 4. Examples of the Syntax Before, defining the ASCII line format, an example of the above syntax rules is presented. This allows the example that will be encoded to be presented, and gives someone who is not familiar with ASN.1 syntax a quick overview. A definition for an (complicated) ASN.1 message that does not employ sub-definitions may look as follows: startup ::= SEQUENCE { sequence_no AS ? INTEGER( 1..65535 ), host-name AS ? IA5String( SIZE( 1..128 ) ), user-name BMPString( SIZE ( 1..64 ) ), gUID OCTET STRING ( SIZE( 16 ) ), activated BOOLEAN, modes SEQUENCE { highmode BOOLEAN, lowmode BOOLEAN, ... }, response CHOICE { acknowledge NULL, -- NULL indicates no further data silent NULL, informGroup INTEGER( 0..65535 ), --Address to send --group response to ... }, id INTEGER( 1..256 ) OPTIONAL, protocol OBJECT IDENTIFIER, -- Set to { ietf (3) wg (0) newproj (0) } node_alerts SEQUENCE OF INTEGER( 0..65535 ), complex SEQUENCE SIZE (1..4) OF SEQUENCE { admin_node INTEGER( 0..256 ), user_id INTEGER( 0..256 ), mode SEQUENCE { video BOOLEAN, audio BOOLEAN, data BOOLEAN, ... } OPTIONAL, ... }, my-extension AS mine.bigco.com INTEGER( 1..3 ) PLUGIN OPTIONAL, ... } From this it can be seen that there are some basic types including INTEGER, IA5String, OCTET STRING and BOOLEAN. There are also two compound constructs; these being SEQUENCE and CHOICE. A SEQUENCE is similar to a structure (struct) in C and a CHOICE is similar to a C union (however, the chosen option in the CHOICE is also recorded, which is not the case for a C union). Additionally, you can have a SEQUENCE OF the above types, and elements can be OPTIONAL. In a SEQUENCE OF construct there can be more than one of the specified component. The number of components may be constrained or unconstrained (See complex and node_alerts respectively). Elements that are marked OPTIONAL can be absent in a correctly formed message. All other elements must be present for the message to be valid. The ... tokens are called extension markers. They tell the compiler that anything that follows has been defined in a subsequent version, and so should be treated as optional irrespective of whether it is marked as OPTIONAL. (In the scheme described here, the extension markers act no more than as a shorthand way to say that all the following parameters are optional. It is not necessary to include the extension markers in the first release of the message definition and it is possible to extend a message even if there were previously no extension markers.) Finally, it is not necessary to map each parameter directly to a base type (e.g. INTEGER, OCTET). I.e. the definition above might be encoded as: startup ::= SEQUENCE { sequence_no AS ? Seq_no, host-name AS ? IA5String(SIZE(1..128)), user-name BMPString( SIZE ( 1..64 ) ), gUID Conference_ID, activated BOOLEAN, modes Modes, . . . and elsewhere the following definitions would appear: Seq_no ::= INTEGER(1..65535) Conference_ID ::= OCTET STRING (SIZE(16)) Modes ::= SEQUENCE { highmode BOOLEAN, lowmode BOOLEAN, ... } This is a better way to do the message definition, for all the reasons that you would do the same in any piece of software. This does not affect the coding method, as it is a process of macro expansion to get to the message definition we started with. 5. ASCII Line Encoding Now lets consider how the message definition can be represented in a text format. The basic mechanism is to encode all items as: = By doing this consistently, parsers can skip fields they don't understand. An INTEGER can simply be represented as a printable string of the number, (the range of the number is not so important to the line format when represented in this way. However, the number range is probably important to the application.) as in: id [ "-" ] 1*( "0-9" ) LWS where: id = tag "=" / ; Normal case "=" / ; Option when parameter is following member ; in a SEQUENCE OF "" ; When "AS ?" is used to request that a tag is not used ; These variations will be discussed further below LWS = " " / tab / CRLF e.g. (ignoring the AS ? specifier): sequence_no = 125 An IA5String is represented as a string in quotes including any necessary back slash escapes, as in: id "\"" *( Printable characters with standard C escapes \r, \n, \t, \l, \\, and \" ) "\"" LWS e.g. (ignoring the AS ? specifier): host-name = "Zebedee" A BMPString is similarly encoded using UTF-7 within single quote marks, as in: id "'" *( UTF-7 coded characters ) "'" LWS e.g.: user-name = `Pete Cordell' The value of an OCTET STRING is represented with a leading character x, and then each OCTET is coded as the ASCII representation of two hexadecimal digits that encode the upper 4 bits followed by the lower 4 bits respectively, as in: id "x" 1*( ( "0-9" / "A-F" / "a-f" ) ( "0-9" / "A-F" / "a-f" ) ) LWS For example: gUID = x0f1b6c0dbcad01230f1b6c0dbcad0123 BOOLEANs are coded simply as TRUE and FALSE, as in: id ( "TRUE" / "FALSE" ) LWS e.g.: activated = TRUE The OBJECT IDENTIFIER type (which is included mainly for backwards compatibility with previously defined ASN.1) is encoded as: id oid-value-list LWS where: oid-value-list = oid-value-list "-" 1*("0-9") / 1*("0-9") e.g.: protocol = 3-0-0 The SEQUENCE is coded by including the elements of the sequence in brackets (), as: id "(" *(encoded-type) ")" LWS for example: modes = ( highmode = TRUE lowmode = FALSE ) Doing this allows the complete sequence to be skipped if the parameter is not understood, or it is of no interest. This is important for backwards compatibility. A CHOICE is encoded using a similar scheme to the SEQUENCE, as in: id "[" (encoded-type) "]" LWS e.g.: response = [ acknowledge = NULL ] ... or: response = [ informGroup = 137 ] In practice many CHOICE options map to NULL, (an example of which is shown above) which is inefficient in terms of characters sent and tedious to write by hand. Conversely, this presents little problem to a program scanning and generating the text as it consistently maintains the X=Y format. However, on the whole a shorthand notation for the above is preferable, and this is represented as follows: response = [ acknowledge ] (Note that the brackets are still important as this highlights that acknowledge comes from a CHOICE statement. Note also that this shorthand form of X=NULL can not be used in a SEQUENCE.) An implementation must recognise both formats. The OPTIONAL items are either present or not present. Unfortunately an example makes no sense! When multiple items of the same type are included in a message using the SEQUENCE OF encoding, this can be done simply by including the item multiple times, as in: node_alerts = 0 node_alerts = 5000 node_alerts = 12 This is quite wasteful in terms of characters, and so the following compacted encoding can be used: node_alerts = 0 = 5000 = 12 The rule that allows this is that if you get the = token when you expected to retrieve a tag, you use the most recently collected tag for the current level of SEQUENCE / CHOICE nesting. As a final, extended example, the two `complex' components shown above can be encoded as: complex = ( admin_node = 20 user_id = 6 mode = ( video = TRUE audio = TRUE data = FALSE ) ) = ( admin_node = 5 user_id = 5 ) To form a complete encoding, each of the parameters within the outer most definition is encoded and then concatenated to end of the previous output. Note that the name of the outer most definition is not encoded as this conveys no information. A complete example of the message would be (including taking into account the AS ? specifications): 125 "Zebedee" user-name = `Pete Cordell' gUID = x0f1b6c0dbcad01230f1b6c0dbcad0123 activated = TRUE modes = ( highmode = TRUE lowmode = FALSE ) response = [ informGroup = 137 ] id = 12 protocol = 3-0-0 node_alerts = 0 = 5000 = 12 complex = ( admin_node = 20 user_id = 6 mode = ( video = TRUE audio = TRUE data = FALSE ) ) = ( admin_node = 5 user_id = 5 ) mine.bigco.com = 3 -- See comment in paragraph below about marking the end of a message In the above message there is no implicit identification of the end of the message. Therefore an additional closing bracket (either ")" or "]") must be appended to the message to delimit its end. The result is that the last lines of the message should infact be: mine.bigco.com = 3 ) Note that for each tagged parameter the order is not important. Therefore, the above message could equally be represented as: 125 "Zebedee" activated = TRUE node_alerts = 0 user-name = `Pete Cordell' sequence_no = modes = ( highmode = TRUE lowmode = FALSE ) id = 12 protocol = 3-0-0 node_alerts = 5000 = 12 complex = ( admin_node = 20 user_id = 6 mode = ( video = TRUE audio = TRUE data = FALSE ) ) response = [ informGroup = 137 ] complex = ( admin_node = 5 user_id = 5 -- We can have comments too ) gUID = x0f1b6c0dbcad01230f1b6c0dbcad0123 ) 6. Other Forms of Line Encoding It should be noted that, what we really have here is a separation of the message definition and the way it is represented on the wire. Therefore, the above line-encoding scheme is not the only possible form. Indeed it may be possible to use multiple encodings for the syntax within the same system. This would allow a debug mode which could use the scheme described above with additional comments, and a more processor or bandwidth efficient scheme (if such a scheme can be found) for normal operation. 7. Forming Complete Messages There is a tendency within the ASN.1 world to see ASN.1 as the only way to generate messages. This is not always appropriate, such as when large amounts of data are streamed from a disk, or the contents of a message are being digitally signed. By considering the above scheme as a tool within a message encoding tool chest that contains a number of tools, the burdens on this form of message encoding can be significantly reduced. Thus, when using the above scheme, especially when authentication might be required, it is suggested that the above be placed in a wrapper as one of possibly a number of message fragments. The scheme adopted will depend on whether a binary representation is permissible or whether it is necessary to have a complete ASCII encoding. Possible schemes are not discussed further here, but it is raised because it is an important issue that must be considered right at the start of protocol development as retrofitting such an feature is rarely possible. 8. BNF Notation for Message Definition Syntax Below is a summary of the message definition syntax in a BNF style. Note that, although considerable effort has been put into getting the definition correct, it still may contain some errors. Syntax-definition = *( root-def ) root-def = name "::=" [ "SEQUENCE" [ "SIZE" "(" digits ".." digits ")" ] "OF" ] type name = identifier type = compound-type / simple-type / string-type / referenced-type compound-type = seq-type / choice-type seq-type = "SEQUENCE" "{" seq-line-list "}" seq-line-list = seq-line / seq-line-list "," seq-line seq-line = seq-element / extension seq-element = name [ "AS" ( tag / "?" ) ] [ numerical-tag ] [ "SEQUENCE" [ "SIZE" "(" digits ".." digits ")" ] "OF" ] type [ "PLUGIN" ] [ "OPTIONAL" ] [ "IGNORE" ] choice-type = "CHOICE" "{" choice-line-list "}" choice-line-list = choice-line / choice-line-list "," choice-line choice-line = choice-element / extension choice-element = name [ "AS" tag ] [ numerical-tag ] type [ "PLUGIN" ] tag = identifier numerical-tag = "[" digits "]" simple type = "BOOLEAN" / "OCTET" / "NULL" / integer-type integer-type = "INTEGER" [ "(" digits ".." digits ")" ] string-type = char-string / non-char-string char-string = ( "IA5String" / "BMPString" ) [ "(" "SIZE" "(" digits [ ".." digits ] ")" ")" ] non-char-string = ( "OCTET STRING" / "EMBEDDED" ) [ "SIZE" "(" digits [ ".." digits ] ")" ] referenced-type = identifier extension = "..." identifier = ( "A-Z" / "a-z" ) *( "A-Z" / "a-z" / "0-9" / "-" / "." / "_" ) ; i.e. Alphanumeric string with leading ; alphabetic character digits = [ "-" ] ( decimal | hex ) decimal = *( "0-9" ) hex = "0x" *( "0-9" / "A-F" / "a-f" ) 9. Techniques for Encoding and Decoding Messages The above message definition syntax is formal enough that message compilers can be built to convert messages on the wire to and from programming language constructs such as C structures and unions. However, this is not the only way such encoding and decoding can be done, and may not even be the best way. When encoding, simple common tools such as sprintf can be used. When decoding, the biggest problem is locating a desired parameter in an efficient manner. One way to do this is to recognise that the encoded message is a form of tree that can be readily represented using a multidimensional linked list in which each element stores the name and location of a parameter in the message. In most cases one parameter follows another, and a pointer to the `next' item can capture this. In the case of SEQUENCEs and CHOICEs, the structure of the tree forks and so another pointer is required to point to the forked set of parameters. Once parsed, the tree can be used to readily locate parameters. A parameter can be identified by the route through the tree that has to be traversed to locate it. These names can be concatenated into a string (using a suitable separator) to indicate the desired parameter. It should also be noted that, due to SEQUENCE OF, there might be more than one parameter with the same name. Therefore it is necessary to specify the instance of the parameter you desire. Using these principles, a function call for accessing the value of a boolean in the message above might look like: wasFound = getBoolean( parseTreeRootPointer, "complex:mode:video", // Parameter name 0, // Instance &myBoolean ) // Where result is placed Some example code that demonstrates the above principles has been developed and is located at: http://www.tech-know-ware.com/messaging.html 10. Alternative Message Definition Syntax The above message definition syntax is very closely related to ASN.1. However, this is not the only form such a syntax can take. Megaco has adopted a more BNF style notation for the preliminary definition of messages. This section adopts the spirit of the Megaco message definition scheme and formalises it so that it becomes more amenable to the techniques described above. It should be noted that, in defining such a syntax there is a risk that readers treat it as BNF when in fact it is quite different. This should be considered when it is decided whether or not to adopt this scheme. Any given parameter may appear once in a message, be optional, or appear multiple times. This can be captured as follows: parameter = "(" parameter-description ")" / ; Parameter always appears once "[" parameter-description "]" / ; Parameter is optional "*(" parameter-description ")" / ; Appears zero or more times "1*(" parameter-description ")" / ; Appears 1 or more times digits "*" digits "(" parameter-description ")" ; Appears a range number of times Parameters appear either one after another, or only one of a possible set of parameters appears. Thus we have the constructs: parameter-list = name "=" *parameter and: parameter-switch = name "=" parameter-switch-list parameter-switch-list = parameter-switch-list "/" parameter / parameter Each parameter has a name, a tag that is used to represent it on the wire, and a type. By default the tag that represents it on the wire is the same as its name. Also, a parameter may not be tagged on the wire. Therefore, a parameter-description is as follows: parameter-description = ( name type [PLUGIN] ) / ; Tag on wire is same as name ( name tag type [PLUGIN] ) / ; Use specified tag on wire ( name ? type ) ; Has no tag on wire The type of a parameter may be as follows: type = "UNSIGNED8" / "UNSIGNED16" / "UNSIGNED32" / "UNSIGNED64" / "SIGNED8" / "SIGNED16" / "SIGNED32" / "SIGNED64" / "BOOLEAN" / "NULL" / "OID" / strings / compound / referenced strings = string-type / ; Unbounded string [ digits [ * digits ] ] string-type / ; Bounded string digits string-type ; Fixed sized string sting-type = "ASCII" / "Unicode" / "OCTETS" / "EMBEDDED" compound = embedded-parameter-list / embedded-parameter-switch embedded-parameter-list = "{" *parameter "}" embedded-parameter-switch = "{" parameter-switch-list "}" A complete message definition has the form: message-definition = 1*( parameter-list / parameter-switch ) and to wrap up: name = tag = referenced = ( "A-Z" / "a-z" ) * ("A-Z" / "a-z" / "0-9" / "-" / "." / "_" ) Using this notation to express the example message definition above yields: startup = ( sequence_no ? UNSIGNED16 ) ( host-name ? 1*128ASCII ) ( user-name 1*64Unicode ) ( gUID 16OCTET ) ( activated BOOLEAN ) ( modes { ( highmode BOOLEAN ) (lowmode BOOLEAN ) } ) ( response { ( acknowledge NULL ) / ; NULL indicates no further data ( silent NULL ) / ( informGroup UNSIGNED16 ) ;Address to send ;group response to }) [ id UNSIGNED8 ] ( protocol OID ) ; Set to { ietf (3) wg (0) newproj (0 ) } * ( node_alerts UNSIGNED16 ) 1*4( complex { ( admin_node UNSIGNED8 ) ( user_id UNSIGNED8 ) [ mode { ( video BOOLEAN ) ( audio BOOLEAN ) ( data BOOLEAN ) } ] } ) [ my-extension mine.bigco.com UNSIGNED8 PLUGIN ] 11. Security Considerations This specification defines a method for encoding messages into clear ASCII text. If such messages need to be authenticated or encrypted the result of the message encoding performed by this specification will need some form of post-processing. Readers are directed to the section "Forming Complete Messages" for further information on this topic. 12. Conclusions The above shows that the benefits of ASCII line coding can be combined with the benefits of a formal syntax definition, thus simplifying both the definition and implementation of protocols. 13. Author's Address Pete Cordell Tech-Know-Ware Ltd Ipswich U.K. e-mail: pete@tech-know-ware.com