Internet Draft M. Duerst University of Zurich Expires in six months July 1997 Normalization of Internationalized Identifiers Status of this Memo This document is an Internet-Draft. Internet-Drafts are working doc- uments of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute work- ing documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months. Internet-Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet- Drafts as reference material or to cite them other than as a "working draft" or "work in progress". To learn the current status of any Internet-Draft, please check the 1id-abstracts.txt listing contained in the Internet-Drafts Shadow Directories on ds.internic.net (US East Coast), nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim). Distribution of this document is unlimited. Please send comments to the author at or to the uri mailing list at uri@bunyip.com. This document is currently a very early draft, intended to stimulate discussion only. It is intended to become part of a suite of documents related to the internationalization of URLs. Abstract The Universal Character Set (UCS) makes it possible to extend the repertoire of characters used in non-local identifiers beyond US- ASCII. The UCS contains a large overall number of characters, many codepoints for backwards compatibility, and various mechanisms to cope with the features of the writing systems of the world. All this together can lead to ambiguities in representation. Such ambiguities are not a problem when representing running text. Therefore existing standards have only defined equivalences. For the use in identi- fiers, which are compared using their binary representation, this is not sufficient. This document defines a normalization algorithm and gives usage guidelines to avoid such ambiguities. Expires End of January 1998 [Page 1] Internet Draft Normalization of Identifiers July 1997 Table of contents 1. Introduction ................................................... 2 1.1 Motivation .................................................. 2 1.2 List of Potential Ambiguities ............................... 4 1.3 Categories .................................................. 5 1.3.1 Category Overview ....................................... 5 1.3.2 Category List ........................................... 5 1.4 Applicabality and Conformance ............................... 6 1.5 Notation .................................................... 6 2. Normalization Rules ............................................ 6 2.1 Normalization of Combining Sequences ........................ 7 2.2 Hangul Jamo Normalization ................................... 9 2.3 Arabic Ligature and Presentation Form Normalization ......... 9 3. Forbidden Characters and Character Combinations ................ 9 4. Dangerous Characters and Character Combinations ................ 9 5. Discouraged Characters and Character Combinations ............. 10 5.1 Similar Letters in Different Alphabets ..................... 10 6. No Normalization nor Restriction .............................. 10 6.1 Case Folding ............................................... 11 Acknowledgements ................................................. 11 Bibliography ..................................................... 11 Author's Address ................................................. 12 1. Introduction 1.1 Motivation For the identification of resources in networks, many kinds of iden- tifiers are in use. Locally, many kinds of identifiers can contain characters from all kinds of languages and scripts, but as long as different encodings for the same characters exist, these cannot be used in identifiers across a wider network. Therefore, network iden- tifiers had to be limited to a very restricted character repertoire, usually a subset of US-ASCII. With the definition of the Universal Character Set (UCS) [ISO 10646] [Unicode2], it becomes possible to extend the character repertoire of such identifiers. In some cases, this has already been done, for example in Java and for URNs [URN-Syntax]; other cases are under study. While identifiers for resources of full worldwide interest should continue to be limited to a very restricted set of widestly known characters, names for resources mainly used in a language-local Expires End of January 1998 [Page 2] Internet Draft Normalization of Identifiers July 1997 or script-local context may provide significant additional user con- venience if they can make use of a wider character repertoire. The UCS contains a large overall number of characters, many code- points for backwards compatibility, and various mechanisms to allow it to cope with the features of the writing systems of the world. These all lead to ambiguities that in some cases can be resolved by careful display, printing, and examination by the reader, but in other cases are intended to be unnoticable by the reader. Such ambi- guities can be dealt with in systems processing running text by using various kinds of equivalences and normalizations, which may differ by implementation. However, identifier processing software usually compares their binary representation to establish that two identifiers are identical. In some cases, some additional processing may be done to account for the specifics of identifier syntax variation. To upgrade all such soft- ware to take into account the equivalences and ambiguities in the UCS would be extremely tedious. For some classes of identifiers, it is impossible because their binary representation is transparent in the sense that it may allow legacy character encodings besides a charac- ter encoding based on UCS to be used and/or it may allow for arbi- trary binary data to be contained in identifiers. In order to facilitate the use of identifiers containing characters from UCS, this document therefore intends to develop clear specifica- tions for a normalization algorithm removing basic ambiguities, and guidelines for the use of characters with potential ambiguity. A key design goal of the algorithm was and is that for most identi- fiers in current use, applying the algorithm results in the identity transform (i.e. the identifier is already normalized). This allows to continue to use existing identifiers and to start to use internation- alized identifiers in new settings even without all the details of the normalization algorithm having been agreed upon. Other goals when designing the algorithms and rules have been as fol- lows: - Avoid bad surprises for users when they cannot understand that two identifiers looking exactly the same don't match. The user in this case is an average user without any specific knowledge of character encoding, but with a basic dose of "computer literacy" (e.g. know that 0 and O have distinct keys on a keyboard). - Restrict normalization to cases where it is really necessary; cover remaining ambiguities by guidelines. Expires End of January 1998 [Page 3] Internet Draft Normalization of Identifiers July 1997 - Define normalization so that it can be implemented using widely accessible documentation. - Take measures for best possible compatibility with future addi- tions to the UCS. There are some issues this document does currently not address, in particular bidirectionality. It is not clear yet whether this will be included in this document or treated separately. 1.2 List of Potential Ambiguities To give an idea of the extent of the problem, this section lists potential character ambiguities, roughly ordered so that those cases that are more difficult to distinguish come first. The difficulty to distinguish certain characters or combinations may depend greatly on context. - Precomposed/decomposed diacritic character representation - Hangul jamo vs. johab and jamo representation alternatives - CJK compatibility ideographs - Other backwards compatibility duplicated characters - Separately coded Indic length/AI/AU marks - Glyphs for vertical variants - Croatian digraphs, other ligatures (Latin, Arabic,...) - Various variant punctuation (apostrophes, middle dots, spaces,...) - Half-width/full-width characters (Latin, Katakana and Hangul) - Vertical variants (U+FE30...) - Presence or absence of joiner/non-joiner - Superscript/subscript variants (numbers and IPA) - Small form variants (U+FE50...) Expires End of January 1998 [Page 4] Internet Draft Normalization of Identifiers July 1997 - Upper case/lower case - Similar letters from different scripts (varying degrees) (e.g. "A" in Latin, Greek, and Cyrillic) - Letterlike symbols, Roman numerals (varying degrees) - Enclosed alphanumerics, katakana, hangul,... - Squared katakana (units,...), squared Latin abbreviations,... - CJK ideograph variants (varying degrees, in particular general simplifications, backwards-compatibility non-unifications, JIS 78/83 problems) - Ignorable whitespace, hyphens,... (sorting) - Ignorable accents,... (sorting) 1.3 Categories 1.3.1 Category Overview This specification distinguishes various categories of ambigous char- acters or strings. For each category, it will list or describe: - The characters and character combinations in the category - The context, if necessary - The nature of the ambiguity - The necessary actions or recommendations 1.3.1 Category List The following categories are currently under investigation: - Normalized: Characters and character combinations in this category are not allowed in identifiers, they MUST be converted to a nor- malized form. Examples include characters with strong equiva- lences. Expires End of January 1998 [Page 5] Internet Draft Normalization of Identifiers July 1997 - Forbidden: Characters and character combinations in this category are not allowed at all in identifiers; identifiers containing them are illegal. Examlpes include characters that cause problems to software, such as control characters, and cases that need normal- ization but where normalization is too difficult to specify algo- rithmically. - Dangerous: Characters and character combinations in this category are seriously advised against. Software would usually alert a user of an attempt to use such a character, but not force the user to remove it. - Discouraged: Characters and character combinations in this cate- gory are advised against, but not as strongly as to necessitate an alert. 1.4 Applicability and Conformance Where identifiers are used just to transmit data from one point to another, e.g. in the case of the query component of an URL resulting from a FORM reply, there is no need to apply the normalization rules and guidelines defined in this document. Identifiers containing a wide range of characters should be used with care and only for an audience that is understood to be able to tran- scribe them without problems. 1.5 Notation Codepoints from the UCS are denoted as U+XXXX, where XXXX is their hexadecimal representation, according to [Unicode2]. Ranges of characters are expressed as U+XXXX-U+YYYY. A block of char- acters may also be identified by its first codepoint, followed by "...". Official ISO character names are given in all upper case. 2. Normalization Rules This chapter defines several normalization algorithms. They deal with different kinds of phenomena, or different scripts. They are defined so that the sequence of their application does not change the Expires End of January 1998 [Page 6] Internet Draft Normalization of Identifiers July 1997 normalization result; each algorithm has to be applied at least once. Applying an algorithm a second time will not change the result any- more. The algorithms are to a certain extent written in a procedural fash- ion. This does not imply that an implementation has to follow each step. The only thing that is relevant is whether an implementation produces the same outputs on the same inputs for all possible inputs, i.e. for all randomly generated strings of arbitrary length. An implementation may also combine the various algorithms into a single one if the result is the same as applying each of the algorithms at least once. 2.1 Normalization of Combining Sequences UCS contains a general mechanism for encoding diacritic combinations from base letters and modifying diacritics, as well as many combina- tions as precomposed codepoints. The following algorithm normalizes such combinations: Step 1: Starting from the beginning of the identifier, find a maximal sequence of a base character (possibly decomposable) followed by mod- ifying letters. Step 2: Fully decompose the sequence found in step 1, using all canonical decompositions defined in [Unicode2] and all canonical decompositions defined for future additions to the UCS. Step 3: Sort the sequence of modifying letters found in Step 2 according to the canonical ordering algorithm of Section 3.9 of [Uni- code2]. Step 4: If the base character is a Hebrew character, go to step 6. Step 5: Try to recombine as much as possible of the sequence result- ing from Step 3 into a precomposed character by finding the longest initial match with any canonical decomposition sequence defined in [Unicode2], ignoring decomposition sequences of length 1. Step 6: Use the result obtained so far as output and continue with Step 1. Expires End of January 1998 [Page 7] Internet Draft Normalization of Identifiers July 1997 NOTE -- In Step 4, the decomposition sequences in [Uni- code2] have to be recursively expanded for each character (except for decomposition sequences of length 1) before application. Otherwise, a character such as U+1E1C, LATIN CAPITAL LETTER E WITH CEDILLA AND BREVE, will not be recom- posed correctly. NOTE -- In Step 4, canonical decompositions defined for future additions to the UCS are explicitly not considered. This is done to ease forwards compatibility. It is assumed that systems knowing about newly defined precompositions will be able to decompose them correctly in Step 2, but that it would be hard to change identifiers on older sys- tems using a decomposed representation. NOTE -- Maybe we have to define additions to the cannonical equivalences, and/or to add more exceptions such as Hebrew. NOTE -- A different definition of Step 4 may lead to shorter normalizations for some identifiers. The current definition was choosen for simplicity and implementation speed. (this may be subject to discussion, in particular if somebody has an implementation and is ready to share the code). NOTE -- The above algorithm can be sped up by shortcuts, in particular by noting that most precomposed characters which are not followed by modifying letters are already normal- ized. NOTE -- The exception for "precomposed letters that have a decomposition sequence of length 1" in Step 4 is necessary to avoid e.g. the letter "K" being "aggregated" to "KELVIN SIGN" U+212A. Expires End of January 1998 [Page 8] Internet Draft Normalization of Identifiers July 1997 2.2 Hangul Jamo Normalization Hangul Jamo (U+1100-U+11FF) provide ample possibilities for ambiguous notations and therefore must be carefully normalized. The following algorithm should be used: Step 1: A seqence of Hangul jamo is split up into syllables according to the definition of syllable boundaries on page 3-12 of [Unicode2]. Each of these syllables is processed according to Steps 2-4. Step 2: Fillers are inserted as neccessary to form a canonical sylla- ble as defined on page 3-12 of [Unicode2]. Step 3: Sequences of choseong, jungseong, and jongseong (leading con- sonants, vowels, and trailing consonants) are replaced by a single choseong, jungseong, and jongseong respectively according to the com- patibility decompositions given in [Unicode2]. If this is not possi- ble, this is a forbidden sequence. Step 4: The seqence is replaced by a Hangul Syllable (U+AC00-U+D7AF) if this is possible according to the algorithm given on pp. 3-12/3 of [Unicode2]. NOTE -- We are not currently dealing with compatibility Jamo (U+3130...). 2.3 Arabic Ligature and Presentation Form Normalization It is not yet clear whether a normalization algorithm should be defined here, or wheter ligatures and presentation forms should sim- ply be forbidden. 3. Forbidden Characters and Character Combinations To be completed. 4. Dangerous Characters and Character Combinations Half-width and full-width compatibility characters (U+FF00...) can easily be mistaken and are frequently interchanged. The version not in the compatibility section (i.e. half-width for Latin and symbols, Expires End of January 1998 [Page 9] Internet Draft Normalization of Identifiers July 1997 full-width for Katakana, Hangul, "LIGHT VERTICAL", arrows, black square, and white circle) should be used wherever possible. Because half-with Latin characters may be needed in certain parts of certain identifiers anyway, keyboard settings in places where identifiers are input should be set to produce half-width Latin characters by default, making the input of full-width characters more tedious. Also, while the difference between half-width and full-width charac- ters is well visible on computers in contexts that use fixed-pitch displays, they are not well transcribed on paper or with high quality printing. Identifiers should never differ by a half-width/full-width difference only. To be completed. 5. Discouraged Characters and Character Combinations To be completed. 5.1 Similar Letters in Different Alphabets Similar letters in different alphabets (e.g. Latin/Greek/Cyrillic A) are discouraged in contexts where their assignement to a given alpha- bet is or may be ambiguous. This means that mixed-alphabet identi- fiers, in particular in cases where the use of each alphabet is not cleary marked, e.g. by separators, is discouraged. In the case of single letters mixed with numbers and simbols, such as typicaly appearing in part numbers, it should be assumed that such letters are Latin with first priority, and Cyrillic with second pri- ority. Priority could also be different for different locations. [what is best, fixed priorities or regional?] Lower-case identifiers should be prefered to upper-case identifiers because lower-case letters are more distinct. 6. No Normalization nor Restriction This chapter lists cases where in some circumstances normalization is applied or may seem advisable, but which are explicitly not normal- ized, for example because a consistent normalization worldwide is not possible. Expires End of January 1998 [Page 10] Internet Draft Normalization of Identifiers July 1997 6.1 Case Folding This document assumes that case is distinguished, and does not have to be folded or normalized. However, for some identifiers or parts thereof, case folding may be taking place. In the absence of any spe- cific knowlegde about this, it is very much advisable, both for auto- matic processing as well as for user behaviour, to copy identifiers without changing case in any way. On the other hand, it is advisable for identifier creators to choose simple and consistent casing. Intermittent casing can be copied visually, but is difficult to transmit aurally. The decision whether to make some part of an identifier case- sensitive or not is one that can freely be taken in the case identi- fiers are limited to the basic Latin alphabet. In many cases, there is a tendency to extrapolate this to the Latin script in general. However, the Latin script at large contains several special cases which are language-dependent (e.g. Turkish dotted and dotless I/i) or invalidate the one-to-one correspondence of upper case and lower case (e.g. German sharp s). For identifiers with a repertoire extending beyond the basic Latin alphabet, it is therefore highly advisable to strictly distinguish case, i.e. to make identifiers case-sensitive. Acknowledgements I am grateful in particular to the following persons for contributing ideas, advice, criticism and help: Mark Davis, Larry Masenter, Michael Kung, Edward Cherlin, Alain LaBonte, Francois Yergeau, (to be completed). Bibliography [ISO10646] ISO/IEC 10646-1:1993. International standard -- Infor- mation technology -- Universal multiple-octet coded character Set (UCS) -- Part 1: Architecture and basic multilingual plane. [Unicode2] The Unicode Standard, Version 2, Addison-Wesley, Read- ing, MA, 1996. [URN-Syntax] R. Moats, "URN Syntax", RFC 2141, May 1997. Expires End of January 1998 [Page 11] Internet Draft Normalization of Identifiers July 1997 Author's Address Martin J. Duerst Multimedia-Laboratory Department of Computer Science University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland Tel: +41 1 257 43 16 Fax: +41 1 363 00 35 E-mail: mduerst@ifi.unizh.ch NOTE -- Please write the author's name with u-Umlaut wherever possible, e.g. in HTML as Dürst. Expires End of January 1998 [Page 12]