Network Working Group                                         M. Crispin
Internet-Draft                                  University of Washington
Intended status: Proposed Standard                        August 2, 2007
Expires: February 2, 2008
Document: internet-drafts/draft-crispin-collation-unicasemap-05.txt

              i;unicode-casemap - Simple Unicode Collation Algorithm

Status of this Memo

     By submitting this Internet-Draft, each author represents that
     any applicable patent or other IPR claims of which he or she is
     aware have been or will be disclosed, and any of which he or she
     becomes aware will be disclosed, in accordance with Section 6 of
     BCP 79.

     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups.  Note that
     other groups may also distribute working documents as
     Internet-Drafts.

     Internet-Drafts are draft documents valid for a maximum of six months
     and may be updated, replaced, or obsoleted by other documents at any
     time.  It is inappropriate to use Internet-Drafts as reference
     material or to cite them other than as "work in progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.

     A revised version of this document will be submitted to the RFC
     editor as an Informational Document for the Internet Community.

     A revised version of this draft document will be submitted to the RFC
     editor as a Proposed Standard for the Internet Community.  Discussion
     and suggestions for improvement are requested, and should be sent to
     ietf-imapext@IMC.ORG.

     Distribution of this memo is unlimited.


Abstract

     This document describes "i;unicode-casemap", a simple
     case-insensitive collation for Unicode strings.  It provides
     equality, substring and ordering operations.


Introduction

     The "i;ascii-casemap" collation described in [COMPARATOR] is quite
     simple to implement and provides case-independent comparisons for the
     26 Latin alphabetics.  It is specified as the default and/or baseline
     comparator in some application protocols, e.g., [IMAP-SORT].

     However, the "i;ascii-casemap" collation does not produce satisfactory
     results with non-ASCII characters.  It is possible, with a modest
     extension, to provide a more sophisticated collation with greater
     multilingual applicability than "i;ascii-casemap".  This extension
     provides case-independent comparisons for a much greater number of
     characters.  It also collates characters with diacriticals with the
     non-diacritical character forms.

     This collation, "i;unicode-casemap", is intended to be an alternative
     to, and preferred over, "i;ascii-casemap".  It does not replace the
     "i;basic" collation described in [BASIC].


1. Unicode Casemap Collation Description

     The "i;unicode-casemap" collation is a simple collation which is
     case-insensitive in its treatment of characters.  It provides
     equality, substring and ordering operations.  The validity test
     operation always returns a valid result.

     This collation allows strings in arbitrary (and mixed) character
     sets, as long as the character set for each string is identified and
     it is possible to convert the string to Unicode.

     Each input string is prepared by converting it to "titlecased
     canonicalized UTF-8" according to the following steps:
        (1) A UTF-8 form of the input string is produced.
            (a) If the input string is in UTF-8, it is checked for
                validity according to the rules in [UTF-8]; there
                must not be any overlong (or other invalid) UTF-8
                sequences.
            (b) If the input string is not in UTF-8, it is converted into
                UTF-8.
            (c) If a UTF-8 string has invalid UTF-8 sequences, or a
                non-UTF-8 string can not be converted into UTF-8, no
                further preparation is done for this string.  Step (2)
                is NOT performed on this this string, and the original
                is used unchanged with the i;octet comparator.
        (2) The valid UTF-8 string from step (1) is converted, using
            UnicodeData.txt ([UNICODE-DATA]) as follows on a
            per-character basis:
            (a) If the codepoint has a titlecase property in
                UnicodeData.txt (this is normally the same as the
                uppercase property) the codepoint is converted to the
                titlecased codepoint.
            (b) If the titledcased codepoint has a decomposition property
                of any type in UnicodeData.txt, it is codepoint is
                recursively converted to the decomposed codepoints
                (effectively Normalization Form KD).
                Example: codepoint U+212B (ANGSTROM SIGN) has a
                decomposition of U+00C5 (LATIN CAPITAL A WITH RING ABOVE)
                which in turn has a decomposition of U+0041 (LATIN CAPITAL
                LETTER A) U+030A (COMBINING RING ABOVE).  Neither U+0041
                nor U+030A have any decomposition properties.  Therefore,
                U+212B is converted to U+0041 U+030A by this step.

     Following the above preparation process on each string, the equality,
     ordering and substring operations are as for i;octet.

     Although the defined behavior of this collation uses the [UTF-8]
     representation of the string, this is not intended to prohibit an
     implementation from using internal representation of Unicode
     internally as long as it produce the same results that would result
     from using [UTF-8].  Note, however, that a UTF-16 internal
     representation is unsuitable for this collation because UTF-16
     surrogates cause codepoints in the upper end of the BMP to collate
     after non-BMP codepoints.

     Care should be taken when using OS-supplied functions to implement
     this collation as it is not locale sensitive.  Functions such as
     strcasecmp and toupper are sometimes locale sensitive and may
     inconsistently casemap letters.

     The i;unicode-casemap collation is well suited to use with many
     Internet protocols and computer languages.  Use with natural language
     is often inappropriate; even though the collation apparently supports
     languages such as Swahili and English, in real-world use it tends to
     mis-sort a number of types of string:

     o  people and place names containing scripts that are not collated
        according to "alphabetical order".
     o  words with characters that have diacriticals.  However,
        i;unicode-casemap generally does a better job than i;ascii-casemap
        for most (but not all) languages.  For example, German umlaut
        letters will sort correctly, but some Scandinavian letters will
        not.
     o  names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike
        in English),
     o  strings containing other non-letter symbols; e.g., euro and pound
        sterling symbols, quotation marks other than '"', dashes/hyphens,
        etc.

2. Unicode Casemap Collation Registration

     <?xml version='1.0'?>
     <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
     <collation rfc="XXXX" scope="local" intendedUse="common">
       <identifier>i;unicode-casemap</identifier>
       <title>Unicode Casemap</title>
       <operations>equality order substring</operations>
       <specification>RFC XXXX</specification>
       <owner>IETF</owner>
       <submitter>mrc@cac.washington.edu</submitter>
     </collation>

3. Security Considerations

     The security considerations for [UTF-8], [STRINGPREP] and
     [UNICODE-SECURITY] apply and are normative to this specification.


4. IANA Considerations

     The i;unicode-casemap collation defined in section 2 should be added
     to the registry of collations defined in [COMPARATOR].


5. Normative References

     The following documents are normative to this document:

     [COMPARATOR]          Newman, C., "Internet Appplication Protocol
                           Collation Registry", RFC 4790, February 2007.

     [STRINGPREP]          Hoffman, P. and M. Blanchet, "Preparation of
                           Internationalized Strings ("stringprep")",
                           RFC 3454, December 2002.

     [UTF-8]               Yergeau, F., "UTF-8, a transformation format
                           of ISO 10646", STD 63, RFC 3629, November 2003.

     [UNICODE-DATA]        <http://www.unicode.org/Public/UNIDATA/
                           UnicodeData.txt>

                           Although the UnicodeData.txt file referenced
                           here is part of the Unicode standard, it is
                           subject to change as new characters are added
                           to Unicode and errors are corrected in Unicode
                           revisions.  As a result, it may be less stable
                           than might otherwise be implied by the
                           standards status of this specification.

     [UNICODE-SECURITY]    Davis, M. and M. Suignard, "Unicode Security
                           Considerations", February 2006,
                           <http://www.unicode.org/reports/tr36/>.


6. Informative References:

     [BASIC]               Newman, C., Duerst, M., and Gulbrandsen, A.,
                           "i;basic - the Unicode Collation Algorithm",
                           draft-gulbrandsen-collation-basic, Work in
                           Progress.

     [IMAP-SORT]           Crispin, M. "Internet Message Access Protocol -
                           SORT and THREAD Extensions",
                           draft-ietf-imapext-sort, Work in Progress (in
                           RFC Editor queue).


Appendices

Author's Address

     Mark R. Crispin
     Networks and Distributed Computing
     University of Washington
     4545 15th Avenue NE
     Seattle, WA  98105-4527

     Phone: +1 (206) 543-5762

     EMail: MRC@CAC.Washington.EDU


Full Copyright Statement

     Copyright (C) The IETF Trust (2007).

     This document is subject to the rights, licenses and restrictions
     contained in BCP 78, and except as set forth therein, the authors
     retain all their rights.

     This document and the information contained herein are provided on an
     "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
     OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
     THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
     OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
     THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
     WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Intellectual Property

     The IETF takes no position regarding the validity or scope of any
     Intellectual Property Rights or other rights that might be claimed to
     pertain to the implementation or use of the technology described in
     this document or the extent to which any license under such rights
     might or might not be available; nor does it represent that it has
     made any independent effort to identify any such rights.  Information
     on the procedures with respect to rights in RFC documents can be
     found in BCP 78 and BCP 79.

     Copies of IPR disclosures made to the IETF Secretariat and any
     assurances of licenses to be made available, or the result of an
     attempt made to obtain a general license or permission for the use of
     such proprietary rights by implementers or users of this
     specification can be obtained from the IETF on-line IPR repository at
     http://www.ietf.org/ipr.

     The IETF invites any interested party to bring to its attention any
     copyrights, patents or patent applications, or other proprietary
     rights that may cover technology that may be required to implement
     this standard.  Please address the information to the IETF at ietf-
     ipr@ietf.org.


Acknowledgement

     Funding for the RFC Editor function is currently provided by the
     Internet Society.