Network Working Group Chris Newman Request for Comments: DRAFT Sun Microsystems Martin Duerst Aoyama Gakuin University Arnt Gulbrandsen Oryx Mail Systems GmbH November 2006 i;basic - the Unicode Collation Algorithm draft-gulbrandsen-collation-basic-00.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress". The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet- Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society 2006. Abstract The Unicode Collation Algorithm is a widely usable collation covering all of Unicode. It produces tolerable results for many locales as-is, and can be further improved using locale-specific tables. This document registers the UCA in the IETF's collation registry. Newman et al. Expires May 2007 [Page 1] Internet-draft November 2006 Table of Contents 1. Conventions Used in This Document . . . . . . . . . . . . . . 2 2. i;basic: The Unicode Collation Algorithm . . . . . . . . . . . 2 3. Registration . . . . . . . . . . . . . . . . . . . . . . . . . 4 4. Security Considerations . . . . . . . . . . . . . . . . . . . 5 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 5 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 5 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 5 7.1. Normative References . . . . . . . . . . . . . . . . . . . 5 8. Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 6 1. Conventions Used in This Document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [KEYWORDS]. 2. i;basic: The Unicode Collation Algorithm The basic collation is intended to provide tolerable results for a number of languages for all three operations (equality, substring and ordering) so it is suitable as a mandatory-to-implement collation for protocols which include ordering support. The ordering operation of the basic collation is the Unicode Collation Algorithm version 14 [UCAv14]. The equality and substring operations are created as described in UCAv14 section 8. While that section is informative to UCAv14, it is normative to this collation specification. This collation is based on Unicode version 3.2, with the following tables relevant: 1. For the normalization step, http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt is used. Column 5 is used to determine the canonical decomposition, while column 3 contains the canonical combining classes necessary to attain canonical order. 2. The table of characters which require a logical order exception is a subset of the table in http://www.unicode.org/Public/3.2-Update/PropList-3.2.0.txt and is included here: Newman et al. Expires May 2007 [Page 2] Internet-draft November 2006 0E40..0E44 ; Logical_Order_Exception # Lo [5] THAI CHARACTER SARA E..THAI CHARACTER SARA AI MAIMALAI 0EC0..0EC4 ; Logical_Order_Exception # Lo [5] LAO VOWEL SIGN E..LAO VOWEL SIGN AI # Total code points: 10 3. The table used to translate normalized code points to a sort key is http://www.unicode.org/reports/tr10/allkeys-3.1.1.txt. UCAv14 includes a number of configurable parameters and steps labelled as potentially optional. The following list summarizes the defaults used by this collation: - The logical order exception step is mandatory by default to support the largest number of languages. - Steps 2.1.1 to 2.1.3 are mandatory as the repertoire of the basic collation is intended to be large. - The second level in the sort key is evaluated forwards by default. This can be changed using the "direction2" variable. - The variable weighting uses the "non-ignorable" option by default. - The semi-stable option is not used by default. - Support for one level of collation is the default behavior, ie. the collation is case-insenstive and ignores accents. This can be changed using the "matchlevel" variable. - No preprocessing step is used by the basic collation prior to applying the UCAv14 algorithm. Note that an application protocol specification MAY require pre-processing prior to the use of any collations. - The equality and substring algorithms use the "Whole Characters Only" feature described in UCAv14 section 8 by default. The "uv" variable specifies the version of the UnicodeData file used. The legal values are the unicode version names starting with the default, e.g. "4.0" and "4.1", but not "2.0". The "version" variable specifies the version of the Unicode Collation Algorithm to use. The default is 14, and legal values are 1 through the latest version. Newman et al. Expires May 2007 [Page 3] Internet-draft November 2006 The exact collation identifier with these defaults is "i;basic". When a specification states that the basic collation is mandatory- to-implement, only this specific identifier is mandatory-to- implement. The default weighting option is "non-ignorable". The "semi-stable" sort key option is not used by default. Sort keys are generated as described in section 4.3 of the UCA specification. (Note that the result is not a string of characters.) 3. Registration i;basic Basic equality order substring RFC XXXX IETF chris.newman@sun.com uv 3.2 version 14 direction2 forwards forwards backwards matchlevel 3 1 2 3 Newman et al. Expires May 2007 [Page 4] Internet-draft November 2006 4. Security Considerations This document raises no security issues that are not already described in [COLLATION]. 5. IANA Considerations The IANA is requested to add the above i;basic registration to the collation registry. 6. Acknowledgements. This document was split off from [COLLATION] during its time as a draft. Many of the people acknowledged in that RFC helped with this: Brian Carpenter, John Cowan, Dave Cridland, Mark Davis, Spencer Dawkins, Lisa Dusseault, Lars Eggert, Frank Ellermann, Philip Guenther, Tony Hansen, Ted Hardie, Sam Hartman, Kjetil Torgrim Homme, Michael Kay, John Klensin, Alexey Melnikov, Jim Melton and Abhijit Menon-Sen. 7. References 7.1. Normative References [COLLATION] Newman, Duerst, Gulbrandsen, "Internet Application Protocol Collation Registry", RFC YYYY, October 2006. [KEYWORDS] Bradner, "Key words for use in RFCs to Indicate Requirement Levels", RFC 2119, Harvard University, March 1997. [UCAv14] Davis, Whistler, "Unicode Collation Algorithm version 14", May 2005, . Newman et al. Expires May 2007 [Page 5] Internet-draft November 2006 8. Authors' Addresses Chris Newman Sun Microsystems 3401 Centrelake Dr., Suite 410 Ontario, CA 91761 US Email: chris.newman@sun.com Martin Duerst Aoyama Gakuin University 5-10-1 Fuchinobe Sagamihara Kanagawa 229-8558 Japan Phone: +81 42 759 6329 Fax: +81 42 759 6495 Email: duerst@it.aoyama.ac.jp Web: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/ (Note: Please write "Duerst" with u-umlaut wherever possible, for example as "Dürst" in XML and HTML.) Arnt Gulbrandsen Oryx Mail Systems GmbH Schweppermannstr. 8 D-81671 Muenchen Germany Fax: +49 89 4502 9758 Email: arnt@oryx.com Newman et al. Expires May 2007 [Page 6] Internet-draft November 2006 Open Issues This -00 draft is published in order to establish version history. Several necessary changes have NOT been made. The Unicode version choice need consideration. 3.2 seems old? And can the ten-element table be dropped - why is it there? The variable names should be aligned with what http://unicode.org/reports/tr35/#Collation_Elements describes. IMO the best thing to do is to copy the CLDR names. The variable defaults need to be considered when doing the above rename. Change Log Changes in -00: No substantive changes from draft-newman-i18n-comparator. Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf- ipr@ietf.org. Newman et al. Expires May 2007 [Page 7] Internet-draft November 2006 Copyright Statement Copyright (C) The Internet Society (2006). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Newman et al. Expires May 2007 [Page 8]