Internet Draft                                               M. Duerst
<draft-duerst-i18n-norm-00.txt>                   University of Zurich
Expires in six months                                        July 1997


             Normalization of Internationalized Identifiers


Status of this Memo

   This document is an Internet-Draft.  Internet-Drafts are working doc-
   uments of the Internet Engineering Task Force (IETF), its areas, and
   its working groups. Note that other groups may also distribute work-
   ing documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months. Internet-Drafts may be updated, replaced, or obsoleted by
   other documents at any time.  It is not appropriate to use Internet-
   Drafts as reference material or to cite them other than as a "working
   draft" or "work in progress".

   To learn the current status of any Internet-Draft, please check the
   1id-abstracts.txt listing contained in the Internet-Drafts Shadow
   Directories on ds.internic.net (US East Coast), nic.nordu.net
   (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific
   Rim).

   Distribution of this document is unlimited.  Please send comments to
   the author at <mduerst@ifi.unizh.ch> or to the uri mailing list at
   uri@bunyip.com. This document is currently a very early draft,
   intended to stimulate discussion only.  It is intended to become part
   of a suite of documents related to the internationalization of URLs.


Abstract

   The Universal Character Set (UCS) makes it possible to extend the
   repertoire of characters used in non-local identifiers beyond US-
   ASCII. The UCS contains a large overall number of characters, many
   codepoints for backwards compatibility, and various mechanisms to
   cope with the features of the writing systems of the world.  All this
   together can lead to ambiguities in representation.  Such ambiguities
   are not a problem when representing running text.  Therefore existing
   standards have only defined equivalences.  For the use in identi-
   fiers, which are compared using their binary representation, this is
   not sufficient.  This document defines a normalization algorithm and
   gives usage guidelines to avoid such ambiguities.


                       Expires End of January 1998      [Page 1]

Internet Draft        Normalization of Identifiers             July 1997


Table of contents

   1. Introduction ................................................... 2
     1.1 Motivation .................................................. 2
     1.2 List of Potential Ambiguities ............................... 4
     1.3 Categories .................................................. 5
       1.3.1 Category Overview ....................................... 5
       1.3.2 Category List ........................................... 5
     1.4 Applicabality and Conformance ............................... 6
     1.5 Notation .................................................... 6
   2. Normalization Rules ............................................ 6
     2.1 Normalization of Combining Sequences ........................ 7
     2.2 Hangul Jamo Normalization ................................... 9
     2.3 Arabic Ligature and Presentation Form Normalization ......... 9
   3. Forbidden Characters and Character Combinations ................ 9
   4. Dangerous Characters and Character Combinations ................ 9
   5. Discouraged Characters and Character Combinations ............. 10
     5.1 Similar Letters in Different Alphabets ..................... 10
   6. No Normalization nor Restriction .............................. 10
     6.1 Case Folding ............................................... 11
   Acknowledgements ................................................. 11
   Bibliography ..................................................... 11
   Author's Address ................................................. 12


1. Introduction


1.1 Motivation


   For the identification of resources in networks, many kinds of iden-
   tifiers are in use. Locally, many kinds of identifiers can contain
   characters from all kinds of languages and scripts, but as long as
   different encodings for the same characters exist, these cannot be
   used in identifiers across a wider network. Therefore, network iden-
   tifiers had to be limited to a very restricted character repertoire,
   usually a subset of US-ASCII.

   With the definition of the Universal Character Set (UCS) [ISO 10646]
   [Unicode2], it becomes possible to extend the character repertoire of
   such identifiers. In some cases, this has already been done, for
   example in Java and for URNs [URN-Syntax]; other cases are under
   study.  While identifiers for resources of full worldwide interest
   should continue to be limited to a very restricted set of widestly
   known characters, names for resources mainly used in a language-local


                       Expires End of January 1998      [Page 2]

Internet Draft        Normalization of Identifiers             July 1997


   or script-local context may provide significant additional user con-
   venience if they can make use of a wider character repertoire.

   The UCS contains a large overall number of characters, many code-
   points for backwards compatibility, and various mechanisms to allow
   it to cope with the features of the writing systems of the world.
   These all lead to ambiguities that in some cases can be resolved by
   careful display, printing, and examination by the reader, but in
   other cases are intended to be unnoticable by the reader.  Such ambi-
   guities can be dealt with in systems processing running text by using
   various kinds of equivalences and normalizations, which may differ by
   implementation.

   However, identifier processing software usually compares their binary
   representation to establish that two identifiers are identical. In
   some cases, some additional processing may be done to account for the
   specifics of identifier syntax variation. To upgrade all such soft-
   ware to take into account the equivalences and ambiguities in the UCS
   would be extremely tedious. For some classes of identifiers, it is
   impossible because their binary representation is transparent in the
   sense that it may allow legacy character encodings besides a charac-
   ter encoding based on UCS to be used and/or it may allow for arbi-
   trary binary data to be contained in identifiers.

   In order to facilitate the use of identifiers containing characters
   from UCS, this document therefore intends to develop clear specifica-
   tions for a normalization algorithm removing basic ambiguities, and
   guidelines for the use of characters with potential ambiguity.

   A key design goal of the algorithm was and is that for most identi-
   fiers in current use, applying the algorithm results in the identity
   transform (i.e. the identifier is already normalized). This allows to
   continue to use existing identifiers and to start to use internation-
   alized identifiers in new settings even without all the details of
   the normalization algorithm having been agreed upon.

   Other goals when designing the algorithms and rules have been as fol-
   lows:

   -  Avoid bad surprises for users when they cannot understand that two
      identifiers looking exactly the same don't match.  The user in
      this case is an average user without any specific knowledge of
      character encoding, but with a basic dose of "computer literacy"
      (e.g. know that 0 and O have distinct keys on a keyboard).

   -  Restrict normalization to cases where it is really necessary;
      cover remaining ambiguities by guidelines.


                       Expires End of January 1998      [Page 3]

Internet Draft        Normalization of Identifiers             July 1997


   -  Define normalization so that it can be implemented using widely
      accessible documentation.

   -  Take measures for best possible compatibility with future addi-
      tions to the UCS.


   There are some issues this document does currently not address, in
   particular bidirectionality. It is not clear yet whether this will be
   included in this document or treated separately.


1.2 List of Potential Ambiguities

   To give an idea of the extent of the problem, this section lists
   potential character ambiguities, roughly ordered so that those cases
   that are more difficult to distinguish come first. The difficulty to
   distinguish certain characters or combinations may depend greatly on
   context.


   -  Precomposed/decomposed diacritic character representation

   -  Hangul jamo vs. johab and jamo representation alternatives

   -  CJK compatibility ideographs

   -  Other backwards compatibility duplicated characters

   -  Separately coded Indic length/AI/AU marks

   -  Glyphs for vertical variants

   -  Croatian digraphs, other ligatures (Latin, Arabic,...)

   -  Various variant punctuation (apostrophes, middle dots, spaces,...)

   -  Half-width/full-width characters (Latin, Katakana and Hangul)

   -  Vertical variants (U+FE30...)

   -  Presence or absence of joiner/non-joiner

   -  Superscript/subscript variants (numbers and IPA)

   -  Small form variants (U+FE50...)


                       Expires End of January 1998      [Page 4]

Internet Draft        Normalization of Identifiers             July 1997


   -  Upper case/lower case

   -  Similar letters from different scripts (varying degrees) (e.g. "A"
      in Latin, Greek, and Cyrillic)

   -  Letterlike symbols, Roman numerals (varying degrees)

   -  Enclosed alphanumerics, katakana, hangul,...

   -  Squared katakana (units,...), squared Latin abbreviations,...

   -  CJK ideograph variants (varying degrees, in particular general
      simplifications, backwards-compatibility non-unifications, JIS
      78/83 problems)

   -  Ignorable whitespace, hyphens,... (sorting)

   -  Ignorable accents,... (sorting)


1.3 Categories


1.3.1 Category Overview

   This specification distinguishes various categories of ambigous char-
   acters or strings. For each category, it will list or describe:

   -  The characters and character combinations in the category

   -  The context, if necessary

   -  The nature of the ambiguity

   -  The necessary actions or recommendations


1.3.1 Category List

   The following categories are currently under investigation:

   -  Normalized: Characters and character combinations in this category
      are not allowed in identifiers, they MUST be converted to a nor-
      malized form. Examples include characters with strong equiva-
      lences.


                       Expires End of January 1998      [Page 5]

Internet Draft        Normalization of Identifiers             July 1997


   -  Forbidden: Characters and character combinations in this category
      are not allowed at all in identifiers; identifiers containing them
      are illegal. Examlpes include characters that cause problems to
      software, such as control characters, and cases that need normal-
      ization but where normalization is too difficult to specify algo-
      rithmically.

   -  Dangerous: Characters and character combinations in this category
      are seriously advised against. Software would usually alert a user
      of an attempt to use such a character, but not force the user to
      remove it.

   -  Discouraged: Characters and character combinations in this cate-
      gory are advised against, but not as strongly as to necessitate an
      alert.


1.4 Applicability and Conformance

   Where identifiers are used just to transmit data from one point to
   another, e.g. in the case of the query component of an URL resulting
   from a FORM reply, there is no need to apply the normalization rules
   and guidelines defined in this document.

   Identifiers containing a wide range of characters should be used with
   care and only for an audience that is understood to be able to tran-
   scribe them without problems.


1.5 Notation


   Codepoints from the UCS are denoted as U+XXXX, where XXXX is their
   hexadecimal representation, according to [Unicode2].

   Ranges of characters are expressed as U+XXXX-U+YYYY. A block of char-
   acters may also be identified by its first codepoint, followed by
   "...".  Official ISO character names are given in all upper case.


2. Normalization Rules


   This chapter defines several normalization algorithms.  They deal
   with different kinds of phenomena, or different scripts. They are
   defined so that the sequence of their application does not change the


                       Expires End of January 1998      [Page 6]

Internet Draft        Normalization of Identifiers             July 1997


   normalization result; each algorithm has to be applied at least once.
   Applying an algorithm a second time will not change the result any-
   more.

   The algorithms are to a certain extent written in a procedural fash-
   ion. This does not imply that an implementation has to follow each
   step. The only thing that is relevant is whether an implementation
   produces the same outputs on the same inputs for all possible inputs,
   i.e. for all randomly generated strings of arbitrary length. An
   implementation may also combine the various algorithms into a single
   one if the result is the same as applying each of the algorithms at
   least once.


2.1 Normalization of Combining Sequences


   UCS contains a general mechanism for encoding diacritic combinations
   from base letters and modifying diacritics, as well as many combina-
   tions as precomposed codepoints.

   The following algorithm normalizes such combinations:

   Step 1: Starting from the beginning of the identifier, find a maximal
   sequence of a base character (possibly decomposable) followed by mod-
   ifying letters.

   Step 2: Fully decompose the sequence found in step 1, using all
   canonical decompositions defined in [Unicode2] and all canonical
   decompositions defined for future additions to the UCS.

   Step 3: Sort the sequence of modifying letters found in Step 2
   according to the canonical ordering algorithm of Section 3.9 of [Uni-
   code2].

   Step 4: If the base character is a Hebrew character, go to step 6.

   Step 5: Try to recombine as much as possible of the sequence result-
   ing from Step 3 into a precomposed character by finding the longest
   initial match with any canonical decomposition sequence defined in
   [Unicode2], ignoring decomposition sequences of length 1.

   Step 6: Use the result obtained so far as output and continue with
   Step 1.


                       Expires End of January 1998      [Page 7]

Internet Draft        Normalization of Identifiers             July 1997


        NOTE -- In Step 4, the decomposition sequences in [Uni-
        code2] have to be recursively expanded for each character
        (except for decomposition sequences of length 1) before
        application. Otherwise, a character such as U+1E1C, LATIN
        CAPITAL LETTER E WITH CEDILLA AND BREVE, will not be recom-
        posed correctly.


        NOTE -- In Step 4, canonical decompositions defined for
        future additions to the UCS are explicitly not considered.
        This is done to ease forwards compatibility. It is assumed
        that systems knowing about newly defined precompositions
        will be able to decompose them correctly in Step 2, but
        that it would be hard to change identifiers on older sys-
        tems using a decomposed representation.


        NOTE -- Maybe we have to define additions to the cannonical
        equivalences, and/or to add more exceptions such as Hebrew.


        NOTE -- A different definition of Step 4 may lead to
        shorter normalizations for some identifiers. The current
        definition was choosen for simplicity and implementation
        speed.  (this may be subject to discussion, in particular
        if somebody has an implementation and is ready to share the
        code).


        NOTE -- The above algorithm can be sped up by shortcuts, in
        particular by noting that most precomposed characters which
        are not followed by modifying letters are already normal-
        ized.


        NOTE -- The exception for "precomposed letters that have a
        decomposition sequence of length 1" in Step 4 is necessary
        to avoid e.g. the letter "K" being "aggregated" to "KELVIN
        SIGN" U+212A.


                       Expires End of January 1998      [Page 8]

Internet Draft        Normalization of Identifiers             July 1997


2.2 Hangul Jamo Normalization


   Hangul Jamo (U+1100-U+11FF) provide ample possibilities for ambiguous
   notations and therefore must be carefully normalized.  The following
   algorithm should be used:

   Step 1: A seqence of Hangul jamo is split up into syllables according
   to the definition of syllable boundaries on page 3-12 of [Unicode2].
   Each of these syllables is processed according to Steps 2-4.

   Step 2: Fillers are inserted as neccessary to form a canonical sylla-
   ble as defined on page 3-12 of [Unicode2].

   Step 3: Sequences of choseong, jungseong, and jongseong (leading con-
   sonants, vowels, and trailing consonants) are replaced by a single
   choseong, jungseong, and jongseong respectively according to the com-
   patibility decompositions given in [Unicode2]. If this is not possi-
   ble, this is a forbidden sequence.

   Step 4: The seqence is replaced by a Hangul Syllable (U+AC00-U+D7AF)
   if this is possible according to the algorithm given on pp. 3-12/3 of
   [Unicode2].


        NOTE -- We are not currently dealing with compatibility
        Jamo (U+3130...).


2.3 Arabic Ligature and Presentation Form Normalization

   It is not yet clear whether a normalization algorithm should be
   defined here, or wheter ligatures and presentation forms should sim-
   ply be forbidden.


3. Forbidden Characters and Character Combinations

   To be completed.


4. Dangerous Characters and Character Combinations


   Half-width and full-width compatibility characters (U+FF00...)  can
   easily be mistaken and are frequently interchanged.  The version not
   in the compatibility section (i.e. half-width for Latin and symbols,


                       Expires End of January 1998      [Page 9]

Internet Draft        Normalization of Identifiers             July 1997


   full-width for Katakana, Hangul, "LIGHT VERTICAL", arrows, black
   square, and white circle) should be used wherever possible. Because
   half-with Latin characters may be needed in certain parts of certain
   identifiers anyway, keyboard settings in places where identifiers are
   input should be set to produce half-width Latin characters by
   default, making the input of full-width characters more tedious.
   Also, while the difference between half-width and full-width charac-
   ters is well visible on computers in contexts that use fixed-pitch
   displays, they are not well transcribed on paper or with high quality
   printing.  Identifiers should never differ by a half-width/full-width
   difference only.

   To be completed.


5. Discouraged Characters and Character Combinations


   To be completed.


5.1 Similar Letters in Different Alphabets


   Similar letters in different alphabets (e.g. Latin/Greek/Cyrillic A)
   are discouraged in contexts where their assignement to a given alpha-
   bet is or may be ambiguous. This means that mixed-alphabet identi-
   fiers, in particular in cases where the use of each alphabet is not
   cleary marked, e.g. by separators, is discouraged.

   In the case of single letters mixed with numbers and simbols, such as
   typicaly appearing in part numbers, it should be assumed that such
   letters are Latin with first priority, and Cyrillic with second pri-
   ority. Priority could also be different for different locations.
   [what is best, fixed priorities or regional?]

   Lower-case identifiers should be prefered to upper-case identifiers
   because lower-case letters are more distinct.


6. No Normalization nor Restriction

   This chapter lists cases where in some circumstances normalization is
   applied or may seem advisable, but which are explicitly not normal-
   ized, for example because a consistent normalization worldwide is not
   possible.


                       Expires End of January 1998     [Page 10]

Internet Draft        Normalization of Identifiers             July 1997


6.1 Case Folding


   This document assumes that case is distinguished, and does not have
   to be folded or normalized. However, for some identifiers or parts
   thereof, case folding may be taking place. In the absence of any spe-
   cific knowlegde about this, it is very much advisable, both for auto-
   matic processing as well as for user behaviour, to copy identifiers
   without changing case in any way. On the other hand, it is advisable
   for identifier creators to choose simple and consistent casing.
   Intermittent casing can be copied visually, but is difficult to
   transmit aurally.

   The decision whether to make some part of an identifier case-
   sensitive or not is one that can freely be taken in the case identi-
   fiers are limited to the basic Latin alphabet.  In many cases, there
   is a tendency to extrapolate this to the Latin script in general.
   However, the Latin script at large contains several special cases
   which are language-dependent (e.g. Turkish dotted and dotless I/i) or
   invalidate the one-to-one correspondence of upper case and lower case
   (e.g. German sharp s).  For identifiers with a repertoire extending
   beyond the basic Latin alphabet, it is therefore highly advisable to
   strictly distinguish case, i.e. to make identifiers case-sensitive.


Acknowledgements

   I am grateful in particular to the following persons for contributing
   ideas, advice, criticism and help: Mark Davis, Larry Masenter,
   Michael Kung, Edward Cherlin, Alain LaBonte, Francois Yergeau, (to be
   completed).


Bibliography

   [ISO10646]     ISO/IEC 10646-1:1993. International standard -- Infor-
                  mation technology -- Universal multiple-octet coded
                  character Set (UCS) -- Part 1: Architecture and basic
                  multilingual plane.

   [Unicode2]     The Unicode Standard, Version 2, Addison-Wesley, Read-
                  ing, MA, 1996.

   [URN-Syntax]   R. Moats, "URN Syntax", RFC 2141, May 1997.


                       Expires End of January 1998     [Page 11]

Internet Draft        Normalization of Identifiers             July 1997


Author's Address

   Martin J. Duerst
   Multimedia-Laboratory
   Department of Computer Science
   University of Zurich
   Winterthurerstrasse 190
   CH-8057 Zurich
   Switzerland

   Tel: +41 1 257 43 16
   Fax: +41 1 363 00 35
   E-mail: mduerst@ifi.unizh.ch


     NOTE -- Please write the author's name with u-Umlaut wherever
     possible, e.g. in HTML as D&uuml;rst.


                       Expires End of January 1998     [Page 12]