Edmon Chung Internet Draft Neteka Intended Category: Informational April 2003 CHARPREP û Character Equivalency Preparations for IDN STATUS OF THIS MEMO This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The reader is cautioned not to depend on the values that appear in examples to be current or complete, since their purpose is primarily educational. Distribution of this memo is unlimited. The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract Charprep intends to take up where Nameprep [NAMEPREP] left off to provide additional preventive measures to bridge the users conceptual perception of a multilingual domain name with the domain matching process. The critical development from Nameprep is that common user perception is taken into account. That is, Charprep strives to take the 'case-insensitivity' concept of user-friendliness to another level for IDNs because of the inherent complexity and potential confusion that could arise from the use of multilingual characters in domain names. Charprep is designed to be a framework for Zone Administrators (e.g. domain registries) to employ relevant equivalency tables to compute and generate variants from the original string to variants that could possibly create confusion with users. The actual management of Reserved Variants (RV), Zone Variants (ZV) with the original string (Primary Domain) will be discussed in Zoneprep [ZONEPREP]. Furthermore, Charprep and Zoneprep are designed to be a recommended feature to be offered to users by a Zone Administrator (e.g. Domain Chung [Page 1] IDNOP-CHARPREP April 2003 Registries) in the management of Internationalized domain names (IDN). A key concept is that these are done without affecting the IDN protocol specified in [RFC3490], [RFC3491] and [RFC3492]. Table of Contents 1. Introduction....................................................2 1.1 Terminology....................................................3 1.2 Nomenclature...................................................3 1.3 Disclaimer.....................................................3 2. Importance of Charprep..........................................3 3. Equivalency versus Prohibition..................................4 4. Character Equivalency Preparations..............................4 5. Charprep Tables and Profiles....................................5 5.1 Codepoints Inclusion Table.....................................6 6.2 Charprep Table.................................................6 6.3 Publishing of Charprep Profiles................................7 6.4 Generation of Charprep Equivalence Set.........................7 7. IANA Considerations.............................................8 8. Security Considerations.........................................8 Acknowledgements...................................................8 1. Introduction During the discussions to establish an IDN protocol, a great number of problematic issues surrounding name equivalency were uncovered. The current Nameprep document decided to constrain its scope of appliance: "Although it would be easy to use the process in this step to "correct" perceived mis-features or bugs in the current character standards, [Nameprep] expressly does not do so." Charprep will continue to uphold the spirit of Nameprep to, "allow as wide of a range of characters as possible to be allowed in host names... The user should not be limited to only entering exactly the characters that might have been used, but to instead be able to enter characters that unambiguously [represents] the characters in the [perceived] host name." In other words, to be able to use different but perceptually equivalent characters (codepoints) and still arrive at the perceived domain. This document does not include the specific character equivalency preparation (Charprep) tables, nor does it provide explicit policies for the use of the Charprep tables. Rather, it intends to briefly describe the problem of character equivalency issues for IDNs as well as to suggest a framework for the publishing of Charprep tables for different languages. Chung [Page 2] IDNOP-CHARPREP April 2003 1.1 Terminology The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and "MAY" in this document are to be interpreted as described in RFC 2119 [RFC2119]. 1.2 Nomenclature As in the Unicode Standard [UNICODE], Unicode code points are denoted by "U+" followed by four to six hexadecimal digits. The following terms will carry specific definitions within this document: Zone Administrator û A domain operator or service that manages sub- domain delegations. This would include domain registries such as TLD registries as well as domain operators of SLDs to issue third level domains, etc. Registration û Entry of a domain into the zone file of an authoritative name server. Resolution û Matching or lookup of domain names within the name server. IDN û Internationalized Domain Names: domain names consisted of one or more characters out of the A-z 0-9 and "-" repertoire. 1.3 Disclaimer This document does NOT intend to provide any discussion on equivalence policies of any scripts, nor does it intend to suggest any type of policies. Zone Administrators SHOULD consult with and understand the needs of their user base before deciding and publishing their own policies. Examples provided in this document are for explanation only. 2. Importance of Charprep The best way to illustrate the importance and need for Charprep is through the following simple example: Suppose a person obtained a domain .example from the .example Zone Manager. The person now advertises his domain as .example (Alpah & Beta in capital letters). A user seeing this perceives the domain as AB.example. The user now attempts to access the domain and fails. It is true that the characters and are not technically equivalent, but because of their perceived equivalence, it will cause confusion to the user and therefore defeating the purpose of having a human-friendly domain name system. Chung [Page 3] IDNOP-CHARPREP April 2003 More importantly, it could create a security issue whereby a domain name is maliciously registered to confuse the end user. For example, suppose the AB.example site is an e-Commerce site, a malevolent registrant may register the domain .example set up a link to it on a competing site. The end user will not be able to realize that s/he is being brought to a different site because the display will always look like: ôAB.exampleö. Charprep will provide a framework for the publishing of Charprep tables that can be used by Zone Administrators to create a set of variants from the original submitted domain (Primary Domain) that may cause user confusion. Further management of this set of variants with regards to zone file entries is discussed in Zoneprep. 3. Equivalency versus Prohibition A common misconception is that equivalence preparations prohibit the use of mapped characters. This is NOT true. For example, even if is deemed equivalent to , and vice versa, it does not prohibit a Zone Administrator to offer a domain name that contains , or , or both. To resolve possible conflicts, the first come first serve rule as employed by most zone administrators today may naturally come into place. Another common misconception is that character equivalence consideration requires word or phrase semantic (orthographic) equivalence. This is also NOT true. Charprep does not give much regard to the end phrase or word, but focus on the character itself. Therefore, even though a character may be semantically different, it MAY still be considered as equivalent (e.g. versus ). Or in the inverse, even though a character may be visually different, it MAY still be considered equivalent (as in the case for Traditional versus Simplified Chinese characters). 4. Character Equivalency Preparations Throughout the IDN discussions, character equivalency issues were repeatedly brought up. While it is appropriately dismissed as a core protocol concern, the importance of Charprep has never been discounted. Especially from zone operators who have started to deploy IDNs as well as from a policy point of view such as in the discussions at ICANN. Charprep is important because characters that may be perceptually equivalent, whether visually or contextually, may occupy different "codepoints" (as specified in Unicode), and therefore make them "technically" distinct and unique "characters", yet in real-life they are perceived and considered to be the same. For example, the Greek capital letter is visually identical to the English capital letter , yet they occupy two different codepoints in the Unicode scheme. The implication is that .example and .example are technically two distinct domain Chung [Page 4] IDNOP-CHARPREP April 2003 names even though, when displayed may appear identical: "A.example", and "A.example". Furthermore, the Cyrillic capital letter "A" is also visually identical to the and . For another example, within the Chinese language, one particular character may have a number of different visual representations, yet they are conceptually equivalent. The most noticeable case is the Traditional Chinese versus the Simplified Chinese representation of a character (e.g. . [U+767C("fa"-prosper)] and . [U+53B1("fa"-prosper | hair)]). To complicate matters these relationships may not be one- to-one, because within different context, a character may take on a semantically different meaning, therefore creating additional variances from the root character (e.g. . [U+53B1("fa"-prosper | hair)] and . [U+9AEE("fa"-hair)] ). Furthermore, parts of the Japanese and Korean languages utilizes a subset of the Chinese character repertoire. Two characters that may be considered perceptually equivalent in the context of the Chinese language, however, may be considered distinct and unique in Japanese Kanji (e.g. . [U+570B("guo"-country)("goku"-a name)] and . [U+56FD("guo"-country)("koku"-country)] ). It is therefore very important to preserve the perceptual expectations of the end user for multilingual domain names, to maintain the user-friendly spirit of domain names in order to allow it to continue to be a useful and human-friendly means of direct navigation and resource addressing over the Internet. 5. Charprep Tables and Profiles Charprep deals with perceptual equivalency of characters. Characters are units of visual or graphical representation of the written form of languages. Scripts best define the collection of a set of characters. Charprep profiles MAY utilize the ISO15924: Codes for the representation of names for scripts, as the guide for identifying scripts and managing Charprep tables. Multiple scripts may share one Charprep profile and vice versa. Charprep profiles MAY also define their own Codepoint Inclusion table. Each Charprep Profile SHOULD consist the following three elements: 1. Charprep Report 2. Codepoints Inclusion 3. Charprep Table The Charprep report should provide description to the policy as well as some rationale and reasoning for equivalency determination of the policy. If the Charprep report simply identifies the set of one or more script codes [ISO15924], a Codepoints Inclusion table is not necessary. If a more delicate approach is desired, a Codepoints Inclusion Table SHOULD be included. A Codepoints Inclusion Table Chung [Page 5] IDNOP-CHARPREP April 2003 simply provides a set of codepoints that is intended for the corresponding Charprep Table. {Note: Current documents of reference include [TSCONV], [JPCHAR] and [HANGULCHAR], along with [IDN-ADMIN]} 5.1 Codepoints Inclusion Table The Codepoints Inclusion Table should simply be a list of codepoints that are intended to be included within the Charprep profile: #Codepoints Inclusion Table for XXX #version x.x #script: XXX YYY U+XXXX; Optional Remarks U+XXXX; Optional Remarks U+XXXX; Optional Remarks ... Note that a codepoints inclusion table name and a version number MUST be included as part of the header of the table. Optionally, scripts considered within the table could be included. If multiple scripts are used a space separated list of the script code [ISO15924] should be provided. 6.2 Charprep Table The Charprep Table MUST have 3 columns and each entry MUST be filled for the first 2 columns with the third as an optional: Codept Equivalent Set Remarks +--------+-------------------------+------------------------------+ | U+XXXX | U+XXXX U+XXXX U+XXXX ...| Optional Remarks | : : : There should be one entry for each Nameprep-ed codepoint considered in the Charprep table. The Equivalent Set column consists of a set of one or more space delimited codepoints corresponding to the codepoint in the first column. For multi-codepoint entries, the convention: U+XXXX+XXXX is used. Optional Remarks may be provided for each entry. For example: Codept Charprep Variants Remarks +--------+-------------------------+------------------------------+ | U+0061 | U+03B1 U+0430 | Greek & Cyrillic | +--------+-------------------------+------------------------------+ | U+03B1 | U+0061 U+0430 | English & Cyrillic | +--------+-------------------------+------------------------------+ | U+0430 | U+0061 U+03B1 | English & Greek | +--------+-------------------------+------------------------------+ : : : Chung [Page 6] IDNOP-CHARPREP April 2003 Note that the number of entries for the Variant Table might NOT be the same as the Codepoints Inclusion Table for the same Charprep profile. Note also that a Charprep Table MAY not be necessary if the policy of the Charprep profile is simply to have a Codepoint Inclusion Table. 6.3 Publishing of Charprep Profiles A Zone Administrator, especially Top-Level Domain Registries, SHOULD publish Charprep profiles for all scripts (languages) they allow registrations in, and make it publicly available for end users to understand the registration policies. The Codepoints Inclusion Tables and Charprep Tables SHOULD exist in flat file format with the semi-colon used as a column delimiter. For example: #Charprep Table for XXX #version x.x #script: XXX YYY U+0061; U+03B1 U+0430; Greek & Cyrillic U+03B1; U+0061 U+0430; English & Cyrillic U+0430; U+0061 U+03B1; English & Greek 6.4 Generation of Charprep Equivalence Set Charprep does not discuss about the specific policies of managing DNS zone files and how the generated variants are managed thereof. The Charprep tables and profiles enable Zone Administrators to create a set of variants from a given IDN. For example, based on the examples above, the domain: <03B1><03B1>.example [.example] Would generate a set of 8 Charprep Variants: <03B1><0061>.example <03B1><0430>.example <0061><0061>.example <0061><03B1>.example <0061><0430>.example <0430><0061>.example <0430><03B1>.example <0430><0430>.example The management of the variants and how they should be represented and managed in the DNS zone file will be further discussed in Zoneprep [ZONEPREP]. Zoneprep describes a framework for Zone Administrator to prepare their zone files based on Zoneprep profiles. Chung [Page 7] IDNOP-CHARPREP April 2003 7. IANA Considerations There are no explicit IANA considerations required for Charprep. IANA may however decide to maintain a registry for Charprep Profiles as described in Section 6. 8. Security Considerations This document does not talk about DNS security issues, and it is believed that the proposal does not introduce additional security problems not already existent and/or anticipated by adding multilingual characters to DNS and/or using ACE. Charprep considerations could however help to improve the security and authenticity for the usage of IDNs by reducing the confusion of perceptually equivalent characters. Acknowledgements This document incorporates many of the discussions from the CJK community (from CNNIC, TWNIC, JPRS and KRNIC respectively) and by the JET (Joint Engineering Team) as well as at different forums including IETF and ICANN. More importantly discussions in the document: "Internationalized Domain Names Registration and Administration Guideline for Chinese, Japanese and Korean". Furthermore, many valuable comments and discussions with the following people were incorporated: Xiaodong (Sheldon) Lee Kenny Huang Paul Hoffman Mark Davis Vincent Chen References [TSCONV] XiaoDong LEE, et al., ôTraditional and Simplified Chinese Conversionö, November 2001 [JPCHAR] Yoshiro Yoneya & Yasuhiro Morishita, JPNIC, ôJapanese characters in multilingual domain name labelsö, March 2, 2001 [HANGULCHAR] Soobok Lee & GyeongSeog Gim, ôHangeul NAMEPREP recommendation version 1.0ö, June 2001 [RFC1034] Mockapetris, P., "Domain Names - Concepts and Facilities," STD 13, RFC 1034, USC/ISI, November 1987 [RFC1035] Mockapetris, P., "Domain Names - Implementation and Specification," STD 13, RFC 1035, USC/ISI, November 1987 Chung [Page 8] IDNOP-CHARPREP April 2003 [RFC2119] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels," RFC 2119, March 1997 [RFC2181] R. Elz, University of Melbourne & R. Bush, RGnet, Inc., ôClarifications to the DNS Specificationö, July 1997 [RFC3454] P. Hoffman, IMC & VPNC & M. Blanchet, Viagenie, öPreparation of Internationalized Strings ("stringprep")ö, December 2002 [RFC3490] P. Faltstrom, Cisco, P. Hoffman, IMC & VPNC & A. Costello UC Berkeley, ôInternationalizing Domain Names in Applications (IDNA)ö, March 2003 [RFC3491] P. Hoffman, IMC & VPNC & M. Blanchet, Viagenie, ôNameprep: A Stringprep Profile for Internationalized Domain Names (IDN)ö, March 2003 [RFC3492] A. Costello, Univ. of California, Berkeley, ôPunycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)ö, March 2003 [IDN-Admin] Editors: James SENG & John KLENSIN; Authors: K. KONISHI, K. HUANG, H. QIAN & Y. KO, ôInternationalized Domain Names Registration and Administration Guideline for Chinese, Japanese and Koreanö Authors: Edmon Chung Neteka Suite 100, 243 College St., Toronto, Ontario, Canada M5T 1R5 edmon@neteka.com Chung [Page 9]