Internet-Draft                                       H. Alvestrand 
draft-alvestrand-i18n-howto-01.txt                   Cisco Systems 
Target Category: Informational                       November 2001 
                                                 Expires: May 2002 
 
 
Protocol Redesigner's Handbook ? volume i18n 


Guidelines for internationalization of protocols 

 
Status of this Memo 

     The file name of this memo is draft-alvestrand-i18n-howto-01.txt 

     This document is an Internet-Draft and is in full conformance with 
     all provisions of Section 10 of RFC 2026. 

     Internet-Drafts are working documents of the Internet Engineering 
     Task Force (IETF), its areas, and its working groups.  Note that 
     other groups may also distribute working documents as Internet-
     Drafts. 

     Internet-Drafts are draft documents valid for a maximum of six 
     months and may be updated, replaced, or obsoleted by other 
     documents at any time.  It is inappropriate to use Internet-Drafts 
     as reference material or to cite them other than as "work in 
     progress." 

     The list of current Internet-Drafts can be accessed at 
     http://www.ietf.org/ietf/1id-abstracts.txt 

     The list of Internet-Draft Shadow Directories can be accessed at 
     http://www.ietf.org/shadow.html. 

Discussion on this draft should be directed to the mailing list intloc-
discuss@ops.ietf.org. This is NOT an open mailing list. 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-01.txt                Expires May 2002 
 
Abstract 

This document attempts to give guidelines for the people who have to 
deal with existing protocols where issues of  languages and character 
sets were not considered from the beginning, and tries to help them a 
little along the way. Some of the advice might also be useful for 
people designing new protocols. 

With new protocols, the document might help in getting the 
internationalization right in the first attempt; at this stage, we all 
know that protocols MUST be internationalized. 


draft-alvestrand-i18n-howto-01.txt                       [Page 2] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-02.txt                Expires May 2002 
 
 
Protocol Redesigner's Handbook ? volume i18n.....................1 

Guidelines for internationalization of protocols.................1 

Status of this Memo..............................................1 

Abstract.........................................................2 

1. Introduction..................................................3 

2. Classes of information........................................4 

3. Designing Internet internationalization.......................6 

 3.1  Basic concepts for the Internet...........................6 

 3.2  Internationalization components outside IETF scope........7 

 3.3  Operations likely to be impacted by internationalization..7 

 3.4 How to tell whether you have a script problem or a language 
 problem........................................................9 

4. Specific sorting, matching and canonicalization options......10 

 4.1  Internationalized encodings..............................11 

 4.2  Normalization............................................12 

5. Tricks to shoehorn stuff into older protocols................13 

6. Security Considerations......................................16 

7. Acknowledgements.............................................17 

8. Author's Address.............................................17 

9. References...................................................17 

 
1. Introduction 

 
Human beings on our planet have, past and present, used a number of 
languages. 


draft-alvestrand-i18n-howto-02.txt                       [Page 3] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-02.txt                Expires May 2002 
 
These have been represented in a number of media using a variety of 
encoding systems, most commonly in scripts using some kinds of 
characters. 

These days, humans want to use the Internet to communicate between 
themselves, and to interact with information stores on the Internet, 
and see no reason to learn a new language in order to do so. 

This means that they have to use Internet protocols to communicate. And 
they will want to represent the scripts they are used to from off the 
Internet when they use the Internet protocols. 

And they expect the Right Thing to happen. 

This document talks about what doing the Right Thing means. 


2. Classes of information 

Most protocols are designed with pieces that belong in various 
categories: 

<<this section should have examples>> 

A. Protocol elements, defined by the protocol designer, which should 
  never be shown to the user, and are never changed. 
  Examples: Verbs in the SMTP protocol [RFC2822], SNMP object 
  identifiers [SNMP] 

B. Managed-namespace identifiers, defined by some orderly process, 
  intended to be used by any protocol user anywhere, often through 
  interfaces that hide the actual values, but sometimes directly. 
  Examples: Language tags [RFC3066], URI schemes [URLREG] 

C. Global-scope identifiers, intended for visibility to any user who has 
  an use for them anywhere, but not completely managed by a central 
  authority 
  Examples: DNS names, URLs, IP addresses, user@domain email addresses 

D. Local-scope identifiers, intended for visibility to a small set of 
  users, but may be visible in several contexts 
  Reasonable usage of such identifiers means that it is possible to 
  appeal to some shared context in order to decide what it "means" 
  Examples: login account names, filenames within a directory, port 
  numbers on a host 

E. Data elements, intended for visibility within a certain context only 
  Examples: Text of email messages, Web page content, instant messages, 
  subject lines in mail 

Internationalizing an identifier or a data element in this context 
means making it capable of representing information relevant to any 

 
draft-alvestrand-i18n-howto-01.txt                       [Page 4] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-01.txt                Expires May 2002 
 
user, no matter which script or language this user uses. This may 
involve dealing with character representation, processing rules, 
language tagging, language negotiation or other functions as 
appropriate. 

For each element to be considered, there are 3 alternatives: 

1. State that the element is immutable, invisible and inviolable, and 
  therefore internationalization is irrelevant (and the 
  protocol/product designer should try REALLY hard to make sure the 
  user never knows or needs to know the value) 

2. State that the element has to be in a very limited representation 
  (such as the A-Z 0-9 character repertoire) so that it can be globally 
  recognized and entered (822 headers, language tags) 
  (the protocol designer might reasonably want the user to get at the 
  value of the element, but should not depend on the user associating 
  anything meaningful with the identifier) 

3. State that the element is a textual element for which the user 
  decides the appropriate content.  Basically, it has to be 
  internationalized. 

 
Internationalization requirements started out with data content (MIME 
for email, for instance), and are working their way up the chain. For a 
long time (see [IAB-ARCH], for instance), we thought that global-scope 
identifiers like DNS names should be kept in category 2 (limited 
repertoire), but increasing pressure from the community of people who 
do not use ASCII in their daily lives has led to a reconsideration here 
(IDN). 

The current thinking of the group discussing this document, which is 
suggested as IETF policy, is that protocol elements (A) and most if not 
all managed-namespace identifiers (B) should be treated according to 
alternative 2 above; their values should be binary, numeric or 
invariant-subset ASCII. This makes testing and debugging easier, and 
does not limit the expressive power of any protocol. 

Note: Experience in the IETF is that implementers are lousy at hiding 
things from the users, and users are often very fond of finding the 
things implementers think should be hidden; that most people now know 
that http:// means "you can look it up in a browser" is unsurprising; 
the colloquial use of "402" (the HTTP error code for "document does not 
exist") as a synonym for "not where he should be" is perhaps more so. 


draft-alvestrand-i18n-howto-01.txt                       [Page 5] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-01.txt                Expires May 2002 
 
3. Designing Internet internationalization 

3.1 Basic concepts for the Internet 

The fundamental difference between common 
internationalization/localization and Internet protocol 
internationalization is this: 

ON THE INTERNET, THE TWO ENDS OF THE COMMUNICATION CANNOT BE ASSUMED TO 
BE IN THE SAME PLACE. 

This means, in particular, that: 

. The two ends of the communication may not share a common external 
  context such as a "locale"; quite commonly, the two ends are in 
  different countries, and may not even know (or care) what country the 
  other end of the conversation is in. 

. The two ends of the communication do not necessarily have ANY common 
  knowledge except for the implementation of the protocol. With 
  implementations in local networks, not even Internet access can be 
  assumed, so even references to Internet-accessible resources are not 
  guaranteed to work. 

This means that: 

. ALL information required for correct operation of the protocol must 
  be specified in the protocol documentation, or be carried in the 
  communication between the parties 

. When user preferences are involved, and multiple values are possible, 
  the specification must guarantee a least common subset of 
  identifiers, and properly handle the enumeration of identifiers (for 
  instance by IANA registration). 

One note that has more to do with psychology of developers than with 
correct specification: 

It is better to fill in a field than to specify a default in a protocol 
specirfication. 

At times, one has had protocols that stated a "default value", and that 
one added a parameter to change this value. Sometimes, for instance 
with the HTTP content-type field, which had a "charset" parameter for 
the "text/html" type, implementers reinterpreted the absence of a 
parameter as "anything goes", and let their implementations ship 
anything they wanted without labeling it, leaving the recipient to 
guess at charsets. This had predictably dire consequences, and has led 
some people to believe that it is better to "waste" the bytes required 
to always specify explicitly what a parameter is, instead of relying on 
a default. 

 
draft-alvestrand-i18n-howto-01.txt                       [Page 6] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-01.txt                Expires May 2002 
 
When discussing internationalization, it is also very important to use 
common terminology. The terminology of this field is littered with 
seemingly simple words that are used for different things by different 
people, with "character set", "script" and "language" being high on the 
list of abused terms. Refer to [Hoffman]. 

3.2 Internationalization components outside IETF scope 

Internationalizing a program or a service involves much more than the 
protocols. But these other matters are not IETF issues, and do not 
impinge upon the IETF standards process except indirectly. 

In particular: 

. The IETF does not standardize user interfaces. This means that input 
  methods, display methods and display characteristics are out of scope 
  for the IETF. (However, information about such methods and 
  characteristics may at times have to be communicated using parameters 
  of IETF protocols.) 

. The IETF does not standardize APIs, except for the rare case of an 
  API to a protocol 

This also means that the presentation of data, and conversions upon 
data performed in order to do presentation, is outside the scope of 
IETF standards, while conversions upon data in order to do protocol 
operations are in scope (and may possibly be reused for presentation 
purposes). The IETF standards are chiefly concerned with communicating 
the data needed, not how the data are presented. But the separation can 
be unenforceable at times; we have a long history of defining data 
representations "as seen by the user" ? see, for instance, RFC 1685, 
which talks about how to write down X.400 email addresses. 

3.3 Operations likely to be impacted by internationalization 

A basic level of internationalization is text representation. A 
protocol where it is not possible to send an Arabic letter SAD 
(U+0635), and let the recipient recognize this as such, is useless for 
communication in Arabic. 

This was addressed in RFC 2277, "IETF Policy on Character Sets and 
Languages". 

This is sufficient for handling text where that text is not treated 
further by the protocol endpoint entities. 

But there are a number of common operations that require the protocol 
designer to do more thinking and specification when dealing with an 
internationalized context: 


draft-alvestrand-i18n-howto-01.txt                       [Page 7] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-01.txt                Expires May 2002 
 
. Equality tests ? for instance deciding whether a typed string is 
  identical to an username, or (even worse) a password 

. Matching. If the protocol has any operation where one party gives a 
  text element, and the other party performs an action based on the 
  content of that text element, matching must take place. This needs 
  specification. 
  Typical sources of confusion include: 

  . What characters match (does a SPACE match a NON-BREAKING SPACE? 
     Does A match a? Does LATIN LETTER A match GREEK LETTER ALPHA?) 

  . What, if anything, is used as "units" in a match? The concept of 
     "word" can get very tricky with languages like Thai, which often 
     do not use word separator characters. 

  . How many characters there are. This is especially a problem when 
     one uses "regular expressions", which can specify (for instance) 
     "A and B, with exactly one character between them" ? does A 
     followed by COMBINING RING ABOVE followed by B match or not? 

. Collation (sometimes called sorting). If the protocol requires 
  elements to appear in a consistent order, collation needs 
  specification. Collation will often need far more information than 
  matching in order to provide the results the user expects; a 
  collation based on codepoint value ("binary sort") is useless to the 
  user except for the rare case where he does not care what the order 
  is, as long as it is consistent. 
  A common example is the case sensitivity problem; on Unix with the 
  "C" locale, "Bread" is sorted before "apples", while under Windows, 
  "Bread" is sorted after "apples", because Windows disregards case 
  when sorting file names. 

. Canonical forms. If the protocol ever expects to binary compare two 
  objects for equality, or compute checksums over the objects as done 
  for digital signatures, the implementations will often want to 
  increase the probability that if a human looking at the data in the 
  object thinks that it is unchanged, it actually compares equal. The 
  most common method of doing this is to define a single "canonical" 
  form for the data. 

. Field truncation. In single-byte encodings, one is guaranteed that a 
  field value produced by truncating a longer value is at least a valid 
  string. With multibyte encodings, this is not the case; with 
  variable-length encodings like UTF-8, there is no way to know without 
  inspecting the string where legal truncation points may be. (In UTF-8 
  one can find a legal point by inspecting relatively few octets around 
  the cut point; in ISO 2022 based encodings, it may require 
  significantly more effort) 

. Checks for legal and illegal characters. In some cases, one wants to 
  specify things like "no spaces". One then has to consider whether 

 
draft-alvestrand-i18n-howto-01.txt                       [Page 8] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-01.txt                Expires May 2002 
 
  this means no SPACE (U+0020) no space (Unicode class Sp) or no 
  separators (a class that includes TAB, for instance).  

. Bi-directional issues. If a protocol element (for instance an URI or 
  a domain name) contains multiple elements of different 
  directionality, what is the directionality of the separator elements? 
  (This makes display REALLY awkward?.) 
  An example treatment of this problem can be found in [IRI-BIDI]. 

3.4 How to tell whether to identify a script or a language 

In many applications, the application is well served as long as a 
string can be entered, stored and displayed correctly to the end user. 
In other applications, there is some degree of interaction between the 
meaning of the string and the action to be applied to it; in these 
cases information about language is critical to make a correct 
decision. 

Approaches to language identification usually fall into 3 categories: 

  . Guess the language (this requires a reasonably large chunk of text 
     for accurate determination; with closely related languages, such 
     as Norwegian and Danish, the required chunk can be in the hundreds 
     of words) 

  . Let a recipient (human) user identify the language, and apply the 
     appropriate action manually 

  . Make the application language independent, dealing with "words", 
     and let the user define (for instance by configuration or by 
     choice of words in search interfaces) what words should be 
     considered. 

Which one is appropriate depends on context. 

Typical operations where language information is needed: 

  - Dispatching on language: Trying to route an incoming query to a 
     person who can understand it. 

  - High quality display ? due to the nature of the Han unification 
     performed in Unicode, some native speakers claim that one must use 
     different fonts for representing the same character codepoint in 
     Japanese and in Chinese. The same problem occurs in some languages 
     for the Cyrillic fonts. 

  - Text to speech processing 

  - Selecting an appropriate name ? "Feuerwehr" versus "Fire station" 
     in a German airport; "Bruxelles" versus "Brussels" on a map of 
     Belgium 

 
draft-alvestrand-i18n-howto-01.txt                       [Page 9] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-01.txt                Expires May 2002 
 
Things to consider when you decide what language information you need: 

  - How much does it matter if you don't know the language? 

  - How precise do you need language to be? If you mark something as 
     "US English", will the Right Thing happen when the recipient 
     understands only "English"? If you mark it as "Nynorsk" (language 
     code "nn"), will the recipient who indicates a desire for 
     Norwegian ("no") or English ("en") see the right content? 

Examples of things that are really script issues: 

  - Displaying <alpha> and <omega> on either side of a picture ? as 
     long as the correct shapes are generated, the user does not care 
     which language they are considered to be in 

  - I have a business card with <Alef Sad> on it, and the keyboard's 
     keycaps have ASCII legends, and I don't know how to use it to 
     enter Arabic characters 

HTML 4.01 Section 8.1, "Specifying the language of content: the lang 
attribute", 

http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1 

gives a reasonable treatment of language tagging in the context of 
HTML. 

Many problem areas that turned out to have a script solution can be 
regarded as solved (at protocol level) when the carrier is defined to 
support ISO 10646. 


4. Specific choices in sorting, matching and canonicalization options 

The cardinal rule of protocol internationalization should be: 

DO NOT INVENT ANYTHING IF YOU CAN AVOID IT. 

There are a number of ready-made things available, and a number of 
pitfalls that these things have already dealt with. 
However, there is no substitute for actually understanding the tools 
you are using. 

(specifics: Unicode identifier definition, UTF-8, ACAP/IMAP comparator 
registry, IDN nameprep, UTR-15 canonicalization, case-
folding?..suggestions!) 


draft-alvestrand-i18n-howto-01.txt                      [Page 10] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-01.txt                Expires May 2002 
 
4.1 Internationalized encodings 

When you transport I18N script across the wire, you don't actually 
transport the script itself. You are transporting the bits which 
represent the script. 

How the bits are assembled and disassembled from scripts are dependent 
on character sets and encodings. 

 
I18N is not just a simple "8-bit clean" problem. 

ISO10646 is a character set with a very large number of characters 
(94.000+ of which have defined meanings in Unicode 3.1) and thus "8-
bit" is technically not sufficient. An encoding is how you transport an 
I18N script through your constrained environment. 

 
It is STRONGLY recommended that ISO10646, and ONLY that, be used as a 
reference character repertoire. 

 
When one encoding that is easy to retrofit into an ASCII/8-bit 
environment is desired, and variable length encoding is acceptable, 
UTF-8 is the preferred encoding. 

 
In other contexts, a four-octet encoding, possibly supplemented by a 
compression function, might be appropriate (UCS-4/UTF-32BE). This MUST 
ONLY be used in big-endian order. (Note that functions that involve 
encryption almost always include a compression function.) 

 
UTF-16 suffers from the endianness problem (UTF-16BE vs UTF-16LE), and 
from the likelihood of badly implemented surrogate support; UTF-16 is 
NOT RECOMMENDED. 

 
Having two encodings defined inside a single protocol is a REALLY, 
REALLY BAD IDEA. DO NOT DO THIS. 

If you allow multiple encodings for a piece of text, the encoding must 
be labelled. The MIME protocol has shown that, while adequate, this is 
a bad idea. Sending software will use obscure encoding that the 
receiving software cannot handle. Worse yet, sending software will 
encode something with an obscure label for which there is a more common 
equivalent, but this still prevents the receiver from interpreting it. 
 
draft-alvestrand-i18n-howto-01.txt                      [Page 11] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-01.txt                Expires May 2002 
 
Using a single encoding avoids this problem. 

4.2 Normalization 

Normalization is needed when you want canonical forms of scripts where 
one gets string input from multiple sources and want to compare them or 
show them to each other, e.g. in cases when you need to do matching on 
"functional equality", comparison or sorting of I18N elements. If 
normalization is needed, a good starting reference would be Unicode 
UTR-15 

 
UTR-15 specifies multiple forms of normalization; this document 
recommends normalization form C when dealing with text, and form KC (or 
equivalently ? restricting the character repertoire to some subset of 
that which is invariant under KC normalization) when limiting 
namespaces for identifiers. 

 
ISO/IEC 10646 contains characters that look similar or identical to 
each other. For example, U+0041 (LATIN CAPITAL LETTER A) looks just 
like U+0391 (GREEK CAPITAL LETTER ALPHA) in most fonts; there are 
literally hundreds of other examples. In some cases, characters that 
have very similar meaning but different looks can be normalized with 
minimal loss of functionality, but full normalization to prevent 
visually-similar characters is not feasible without losing character 
meaning and thus possibly confusing typical users. 

 
Note that normalization is not enough to convert the matching problem 
into a binary comparision problem; see section 3.3 

 
Do remember that normalization is an one-way function which will not 
preserve the original form. 

4.3 Choosing limited character sets for "names" 

In quite a few cases, there is a need to support a limited character 
set for something like "names", where more characters than ASCII are 
needed, but large swaths of special things (spaces, punctuation and so 
on). 

There is a lot of work already done on this; in particular, the Unicode 
"identifier" definition [REFHERE] and the limited range of characters 
used in the IDN domain name definition [REFHERE] are candidates. 


draft-alvestrand-i18n-howto-01.txt                      [Page 12] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-01.txt                Expires May 2002 
 
4.4 Comparator functions 

As alluded to above, deciding how to compare two strings is a hard 
task. 

What's more, the number of ways in which people want to compare strings 
is growing, not shrinking. 

This means that within a protocol that is intended to serve many 
purposes, you may need a means to name ways of comparing strings. This 
need has been seen before, and attempts to fill it include: 

  . The ACAP/IMAP comparator registry [REFHERE] 

  . The ISO Cultural Conventions registry [REFHERE] 

The target of the latter is far wider than the issue of string 
comparators, but it also includes this. 


5. Tricks to shoehorn stuff into older protocols 

Very rarely is the protocol redesigner given a clear slate, upon which 
he can deploy properly targeted internationalization. 

Most of his effort must be spent in figuring out how to create 
modifications to the protocol that allow the protocol to offer the 
features requested by the international user community, while still not 
causing undue disruption for users who use older versions of the 
protocol. 

5.1 Redefining "text" as UTF-8 

Most protocols with text in them defined without thought for 
internationalization have one of three definitions of text: 

  . ASCII 

  . Latin1 (ISO 8859-1) ? this is common for protocols developed in 
     Western Europe 

  . Unspecified octets said to carry text 

 
The last category may in practice be like the first, because nothing 
but ASCII was ever used, or the first may be like the last, because 
people were quietly ignoring the "ASCII only" requirement. 

In theory, one can shoehorn internationalized text into the first and 
last case by defining that any non-ASCII byte be considered part of an 
UTF-8 character (an extension), and into the last case by defining that 
 
draft-alvestrand-i18n-howto-01.txt                      [Page 13] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-01.txt                Expires May 2002 
 
only UTF-8 is legal to carry (a restriction). In practice, the issue is 
fuzzier. 

  . What will be the reaction of old implementations on seeing 
     extended characters? Ignore, barf or crash? 

  . To what degree will old implementations send non-ASCII, non-UTF-8 
     data to new implementations? 
     What will happen when they do? 

In protocols that do version negotiation, there is a theoretical answer 
that says that you "just" move to a new version of the protocol, and 
negotiation will take care of it. However, this is not trivial: 

  . When version upgrade has never been done before, the negotation 
     machinery is untested. 

  . When version upgrade has been common, implementations may choose 
     to ignore a "minor" version number difference. 

  . When the strings involved are identifiers, communication between 
     old and new versions is troublesome: what should one do when an 
     identifier cannot be represented in the old version of the 
     protocol, yet needs to be referred to? 

  . When protocol violations, such as putting Latin-1 in an ASCII-only 
     field, has been common in an old version, how should the new 
     version behave when faced with such violations? 

The problems are endemic to any protocol with versions, but are often 
brought to the fore by internationalization. 
This has tempted many to go the route of  "just" declaring a different 
interpretation of strings, without changing the protocol version number 
or doing option negotiation to enable the feature. 

The case of Latin-1 (or, equivalently, Shift-JIS) is especially 
troublesome, because there are byte sequences that can be interpreted 
either as UTF-8 or as Latin-1. This means that even implementations 
ready to tackle both encodings can be "fooled" into displaying 
incorrect text to their users. This is worrying. 

In protocols with "feature negotiation", such as SMTP or LDAP, the 
problem of versioning grows more complex: Any extension must be 
considered for its interaction with any other extension ? does the 
"character set" option interact with the "regexp search" option? With 
the "return results later" option? With the "foobar" ooption? 

The effort of evaluating ? and implementing ? an option can quickly 
turn into a function of the square of the number of options. 


draft-alvestrand-i18n-howto-01.txt                      [Page 14] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-01.txt                Expires May 2002 
 
5.2 Example retrofits 

More examples of protocol internationalization can be found in [I18N-
CASES]. 

5.2.1 MIME 

The Multipurpose Internet Mail Extensions are probably the most widely 
deployed set of retrofits of internationalization in a preexisting 
protocol. 

It delivered: 

  . The ability to have multiple character sets in mail bodies 

  . The ability to have multiple character sets in parts of mail 
     headers 

It failed to deliver: 

  . An unique encoding of a text to a transferred string; the sender 
     can make multiple encodings from the same message body. This has 
     implications for attempts to use digital signatures, among other 
     things. 

  . A language tagging ability for mail header components. A later 
     attempt to add this has failed to see visible deployment. 

In some areas, it seems that MIME has delivered "labeled non-
interoperability", giving senders the ability to specify what it sends, 
but not providing a means to fit that to a generally accepted subset, 
or to limit the sending to what can usefully be understood by the 
recipient. But it has been very widely deployed, and has improved 
interoperability among internationalized mail software enormously. 

A more thorough analysis is given in [I18N-CASES]. 

5.2.2 SNMP version 3/SMI version 2 

In the original Simple Network Management Protocol, a lot of fields 
were labeled "text". 

In the US context, these were naturally considered ASCII; in other 
contexts, usually ASCII was used, but on occasion, other charsets such 
as Latin-1 or iso-2022-jp could be found. 

In the course of moving SMI version 2 to Draft and Standard, two 
considerations were added: 

  . A DISPLAY-HINT called "u" was added, indicating that the expected 
     display format of the variable was as an UTF-8 string. 

 
draft-alvestrand-i18n-howto-01.txt                      [Page 15] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-01.txt                Expires May 2002 
 
  . An understanding that putting text that was neither ASCII nor UTF-
     8 into a text variable was not consistent with the protocol 

In the course of updating older MIBs, there was extensive discussion 
about whether to add new variables with display-hint UTF-8 or to 
redefine variables that had previously been understood as "text, any 
charset" or "ascii" to be UTF-8. 

<<here I need the MIB folks to tell me what was actually done!!!!!>> 

<<examples wanted!>> 

STILL BRAIN DUMP: 

 
Beware of third answers to what has previously been binary questions 
(history: NIS yes/no hostname answer did really rotten job on TEMPFAIL) 

Undisplayable characters ? hieroglyphs at the user interface. 

Both in names and other contexts ?names are worse. 

The copy/paste problem ? including where the paste buffer is in the 
brain of the user. 

"there are things better done in protocol/servers, and things better 
done in UI/client software/user brain, and the harder problem is 
realizing which category they belong to". 


6. Security Considerations 

The security implications of improperly done internationalizations can 
be considerable. 

For instance: 

. If one does not specify whether input lengths are counted in 
  characters or octets, buffer overflows are likely. 

. If multiple representations of the same character are allowed, 
  multiple items can appear to the user to have the same name, even 
  though they are distinct. This can be used as an attack. 
  (Note that this is hard to avoid ? see section 4.2 for more on this) 

. Signature failures (erroneous success or erroneous failure) due to 
  improper canonicalization are a security problem, too; a server 
  canonicalizing a name before comparing will never be able to match on 
  a certificate containing an uncanonicalized name, for instance. 


draft-alvestrand-i18n-howto-01.txt                      [Page 16] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-01.txt                Expires May 2002 
 
. Code being forced down "interesting" code paths because a string is 
  used in normalized form in part of the code and unnormalized 
  elsewhere. (example: the overlong UTF-8 code sequence, where one 
  encodes leading zeroes so that (for instance) a carriage return can 
  be slipped past the code that checks that a command line is just one 
  line. This was the reason for outlawing overlong UTF-8 sequences in 
  the Unicode Standard, version 3.1, section D.36) 


7. Acknowledgements 

This document has benefited from many rounds of review and comments in 
various fora of the IETF and the Internet working groups. 

Any list of contributors is bound to be incomplete; please regard the 
following as only a selection from the group of people who have 
contributed to make this document what it is today. 

In alphabetical order: 

Martin Duerst (apologies for the lack of internationalization) 
Patrik Faltstrom (aftloi) 
Paul Hoffman, 
John Klensin,  
James Seng (aftloi) 


8. Author's Address 

Harald Tveit Alvestrand 
Cisco Systems 
Weidemanns vei 27 
7043 Trondheim 
NORWAY 

EMail: Harald@Alvestrand.no 
Phone: +47 73 50 33 52 


9. References 

 
[ISO 639] 

     ISO 639:1988 (E/F) - Code for the representation of names of 
     languages - The International Organization for Standardization, 
     1st edition, 1988-04-01 Prepared by ISO/TC 37 - Terminology 
     (principles and coordination). 

     Note that a new version (ISO 639-1:2000) is in preparation at the 
     time of this writing. 
 
draft-alvestrand-i18n-howto-01.txt                      [Page 17] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-01.txt                Expires May 2002 
 
[ISO 639-2] 

     ISO 639-2:1998 - Codes for the representation of names of 
     languages -- Part 2: Alpha-3 code  - edition 1, 1998-11-01, 66 
     pages, prepared by a Joint Working Group of ISO TC46/SC4 and ISO 
     TC37/SC2. 

      
[ISO 3166] 

     ISO 3166:1988 (E/F) - Codes for the representation of names of 
     countries - The International Organization for Standardization, 
     3rd edition, 1988-08-15. 

[RFC 1521] 

     Borenstein, N., and N. Freed, "MIME Part One: Mechanisms for 
     Specifying and Describing the Format of Internet Message Bodies", 
     RFC 1521, Bellcore, Innosoft, September 1993. 

[RFC 2026] 

     The Internet Standards Process -- Revision 3. S. Bradner. October 
     1996. 

[RFC 2028] 

     The Organizations Involved in the IETF Standards Process. R. 
     Hovey, S. Bradner. October 1996. 

[RFC 2119] 

     Key words for use in RFCs to Indicate Requirement Levels. S. 
     Bradner. March 1997. 

[RFC 2234] 

     Augmented BNF for Syntax Specifications: ABNF. D. Crocker, Ed., P. 
Overell, November 1997. 

[RFC 2616] 

     Hypertext Transfer Protocol -- HTTP/1.1. R. Fielding, J. Gettys,  
     J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee. June 
     1999. 

[RFC 2860] 

     Memorandum of Understanding Concerning the Technical Work of the 
     Internet Assigned Numbers Authority. B. Carpenter, F. Baker, M. 
     Roberts. June 2000. 

 
draft-alvestrand-i18n-howto-01.txt                      [Page 18] 
Guidelines for protocol internationalization     Harald Alvestrand 
draft-alvestrand-i18n-howto-01.txt                Expires May 2002 
 
[IRI-BIDI] 

     Internet Identifiers and Bidirectionality. Martin Duerst. Work In 
     Progress 
     (draft-duerst-iri-bidi-00.txt) 


draft-alvestrand-i18n-howto-01.txt                      [Page 19]