Network Working Group A. ElSherbiny,UN-ESCWA Internet-Draft M. Farah, UN-ESCWA draft-farah-adntf-ling-guidelines-00.txt A. Al Zoman,SaudiNIC Category: Standards Track I. Oueichek, STE Expires: August 2008 February 2008 Linguistic Guidelines for the Use of Arabic Characters in Internet Domains Status of this Memo This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited. This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Copyright Notice Copyright (C) The IETF TRUST (2008). Abstract This document constitutes technical specifications for the use of Arabic characters in Internet Domain names and provides linguistic guidelines for Arabic Domain Names. It addresses Arabic-specific linguistic issues pertaining to the use of Arabic language in domain names. Farah, et al. Standards Track [Page 1] Internet-Draft Linguistic Guidelines February 2008 Table of Contents 1. Introduction........................................................2 2. Arabic Language-Specific Issues.....................................2 2.1 Linguistic issues...............................................3 2.2 Supported Character Set.........................................4 2.3 Arabic Linguistic Issues Affected By Technical Constraints......6 3. Security Considerations.............................................7 4. IANA Considerations.................................................7 5. Conclusion..........................................................7 6. Acknowledgments.....................................................7 Normative References...................................................7 Informative References.................................................7 Authors' Addresses.....................................................7 Full Copyright Statement...............................................8 Disclaimer of Validity.................................................9 1. Introduction The Internet Engineering Task Force (IETF) issued in March 2003 a set of RFCs for Internationalized Domain Names (IDN) [N1, N2, N3] are supposed to become the de facto standard for all languages. In 2007, new versions of the internet-drafts proposing the revisions to the IDNA protocol have been released and are as follows: - Klensin, J., "Internationalizing Domain Names for Applications (IDNA): Issues and Rationale", Work in Progress, November 2007. - Klensin, J., "Internationalizing Domain Names in Applications (IDNA): Protocol", Work in Progress, November 2007. - Alvestrand, H. and Karp, C., "An IDNA problem in right-to-left scripts", Work in Progress, July 2007. - Faltstrom, P., "The Unicode Codepoints and IDN", Work in Progress, November 2007. This document constitutes a technical specification for the implementation of the IDN standards in the case of the Arabic Language. It will allow the use of standard language tables to write domain names in Arabic characters. Therefore, it should be considered as a logical extension to the IDN standards. This document reflects the recommendations of the Arab Working Group on Arabic Domain Names (AWG-ADN) established by The League of Arab States (LAS), based on standardisation efforts of the United Nations Economic and Social Commission for Western Asia (UN-ESCWA) and its Internet- Draft: Farah, et al. "Guidelines for an Arabic Internet Domain Name" (draft-farah-adntf-adns-guidelines-03.txt). The key words "MUST", "REQUIRED", "SHOULD", "RECOMMENDED", and "MAY" in this document are to be interpreted as described in RFC 2119. 2. Arabic Language-Specific Issues Farah, et al. Standards Track [Page 2] Internet-Draft Linguistic Guidelines February 2008 The main objective of the creation of Arabic Domain Names is to have a vehicle to increase Internet use amongst all strata of the Arabic- speaking communities. Furthermore, a non-user friendly Domain Name would further add to the ambiguity and the eccentricity of the Internet to the Arabic-speaking communities, thus contributing negatively to the spread of the Internet and leading to further isolation of these communities at the global level. Hence, there have been intensive efforts especially those spearheaded by Dr. Al-Zoman and contributed to by UN-ESCWA and its Arabic Domain Names Task Force (ADN-TF) to reach consensus on a multitude of linguistic issues with the following goals: - To define the accepted Arabic character set to be used for writing domain names in Arabic; which is the subject of this document. - To define the top-level domains of the Arabic domain name tree structure (i.e., Arabic gTLDs and ccTLDs). This goal will be handled in a separate document. The first meeting of the AWG-ADN, held in Damascus January-February 2005, gave special attention to the following: (a) Simplification of the domain names, whenever possible, to facilitate the interaction of the Arabic user with the Internet. (b) Adoption of solutions that do not lead to confusion either in reading or in writing, provided that this does not compromise the linguistic correctness of used words. (c)Mixing Arabic and non-Arabic letters in the domain name is not acceptable. 2.1 Linguistic issues There are a number of linguistic issues that have been proposed with respect to the use of the Arabic language in domain names. This section will highlight some of them. This section is based on a the paper of Dr Al-Zoman [I1] and the report of the first meeting of AWG-ADN [N4]. For details the reader is encouraged review the references. 2.1.1. Tashkeel (diacritics) and Shadda Both Tashkeel and Shadda must not be supported in the zone file, yet they can be supported only in the user interface, and stripped off at the preparation of internationalized strings (stringprep) phase. The following are their Unicode presentations: U+064B ARABIC FATHATAN U+064C ARABIC DAMMATAN Farah, et al. Standards Track [Page 3] Internet-Draft Linguistic Guidelines February 2008 U+064D ARABIC KASRATAN U+064E ARABIC FATHA U+064F ARABIC DAMMA U+0650 ARABIC KASRA U+0651 ARABIC SHADDA U+0652 ARABIC SUKUN 2.1.2. Kasheeda or Tatweel (Horizontal Character Size Extension) Kasheeda (U+0640 ARABIC TATWEEL) must not be used in Arabic domain names. 2.1.3. Character folding Character folding is the process where multiple letters (that may have some similarity with respect to their shapes) are folded into one shape. This includes: - Folding Teh Marbuta (U+0629) and Heh (U+0647) at the end of a word; - Folding different forms of Hamzah (U+0622, U+0623, U+0625, U+0627); - Folding Alef Maksura (U+0649) and Yeh (U+064A) at the end of a word; - Folding Waw with Hamzah Above (U+0624) and Waw (U+0648). With respect to the Arabic language, character folding is not acceptable because it changes the meaning of words and it is against the principle of spelling rules. Replacing a character with another character, which may have a similar shape, will give a different meaning. This will lead to have only one word representing several words consisting of all the combinations of folded characters. Hence, the other words will be masked by a single word [I1]. Mis-spelling or handwriting errors do occur leading to mixing different characters despite the fact that this is not the case in published and printed materials. One of the motivations of this effort is to preserve the language particularly with the spread of the globalization movement. Within this context, character folding is working against this motivation since it is going to have a negative affect on the principle and ethics of the language. Technology should work for preserving the language and not for destroying it. Thus, character folding should not be allowed. 2.2 Supported Character Set A domain name to be written in Arabic must be composed of a sequence of the following UNICODE characters. These are based on UNICODE version 5.0. Farah, et al. Standards Track [Page 4] Internet-Draft Linguistic Guidelines February 2008 TABLE 1: CHARACTERS FROM UNICODE ARABIC TABLE (0600-06FF) Unicode Character Name 0621 ARABIC LETTER HAMZA 0622 ARABIC LETTER ALEF WITH MADDA ABOVE 0623 ARABIC LETTER ALEF WITH HAMZA ABOVE 0624 ARABIC LETTER WAW WITH HAMZA ABOVE 0625 ARABIC LETTER ALEF WITH HAMZA BELOW 0626 ARABIC LETTER YEH WITH HAMZA ABOVE 0627 ARABIC LETTER ALEF 0628 ARABIC LETTER BEH 0629 ARABIC LETTER TEH MARBUTA 062A ARABIC LETTER TEH 062B ARABIC LETTER THEH 062C ARABIC LETTER JEEM 062D ARABIC LETTER HAH 062E ARABIC LETTER KHAH 062F ARABIC LETTER DAL 0630 ARABIC LETTER THAL 0631 ARABIC LETTER REH 0632 ARABIC LETTER ZAIN 0633 ARABIC LETTER SEEN 0634 ARABIC LETTER SHEEN 0635 ARABIC LETTER SAD 0636 ARABIC LETTER DAD 0637 ARABIC LETTER TAH 0638 ARABIC LETTER ZAH 0639 ARABIC LETTER AIN 063A ARABIC LETTER GHAIN 0641 ARABIC LETTER FEH 0642 ARABIC LETTER QAF 0643 ARABIC LETTER KAF 0644 ARABIC LETTER LAM 0645 ARABIC LETTER MEEM 0646 ARABIC LETTER NOON 0647 ARABIC LETTER HEH 0648 ARABIC LETTER WAW 0649 ARABIC LETTER ALEF MAKSURA 064A ARABIC LETTER YEH 0660 ARABIC-INDIC DIGIT ZERO 0661 ARABIC-INDIC DIGIT ONE 0662 ARABIC-INDIC DIGIT TWO 0663 ARABIC-INDIC DIGIT THREE 0664 ARABIC-INDIC DIGIT FOUR 0665 ARABIC-INDIC DIGIT FIVE 0666 ARABIC-INDIC DIGIT SIX 0667 ARABIC-INDIC DIGIT SEVEN 0668 ARABIC-INDIC DIGIT EIGHT 0669 ARABIC-INDIC DIGIT NINE Source: A. Al-Zoman, "Supporting the Arabic Language in Domain Farah, et al. Standards Track [Page 5] Internet-Draft Linguistic Guidelines February 2008 Names", October 2003 TABLE 2: CHARACTERS FROM UNICODE BASIC LATIN TABLE (0000-007F): Unicode Digit Name 0030 DIGIT ZERO 0031 DIGIT ONE 0032 DIGIT TWO 0033 DIGIT THREE 0034 DIGIT FOUR 0035 DIGIT FIVE 0036 DIGIT SIX 0037 DIGIT SEVEN 0038 DIGIT EIGHT 0039 DIGIT NINE 002D HYPHEN-MINUS 002E FULL STOP (Dot) Source: A. Al-Zoman, "Supporting the Arabic Language in Domain Names", October 2003 2.3 Arabic Linguistic Issues Affected By Technical Constraints In this section, technical aspects of some linguistic issues are discussed. 2.3.1. Numerals In the Arab countries, there are two sets of numerical digits used: - Set I: (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) mostly used in the western part of the Arab world. - Set II: (u+0660, u+0661, u+0662, u+0663, u+0664, u+0665, u+0666, u+0667, u+0668, u+0669) mostly used in the eastern part of the Arab world. Although visual differentiation between the Arabic zero (u+0660) and the dot (u+002E) in printed material is possible (the zero is larger in size and is printed higher than the dot), using it in domain names may lead to confusion. Folding set II to set I will eliminate the problem of the zero, in specific, and that of numerals in general. Both sets may be supported in the user interface but both must be folded to one set (Set I) at the preparation of internationalized strings (e.g., "stringprep") phase; i.e. storage of numerals in the zone file is done in ASCII format. 2.3.2. The Space Character The space character is strictly not allowed in domain names, as it Farah, et al. Standards Track [Page 6] Internet-Draft Linguistic Guidelines February 2008 is a control character. Instead, the hyphen (Al-sharta) (i.e.u+02D) is proposed as a separator between Arabic words: confusion can take place if the words are typed without a separator, unlike in ASCII. It is acceptable to use the hyphen to separate between words within the same domain name label. 3. Security Considerations No particular security considerations could be identified regarding the use of Arabic characters in writing domain names. In particular, any potential visual confusion between different character strings is avoided using the guidelines proposed in this document. 4. IANA Considerations This document has no action for IANA. 5. Conclusion The proposed guidelines are in full accordance with the IETF IDN standards and take into account Arabic language-specific issues within a compromise between grammatical rules of the Arabic language and the ease of use of the language on the Internet. 6. Acknowledgments ESCWA ICT Division provided support and funding for the development of this document with the objective of reaching a standard for a comprehensive Arabic Domain Names. Thanks are due to SaudiNIC for its continuous efforts in supporting the development of Arabic Domain Names. Normative References [N1] Faltstrom, P., Hoffman, P. and A. Costello, "Internationalizing Domain Names in Applications (IDNA)", RFC 3490, March 2003. [N2] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)", RFC 3491, March 2003. [N3] Costello, "Punnycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)", RFC 3492, March 2003. [N4] League of Arab States, report of the first meeting of AWG-ADN, Damascus, February 2005. http://www.arabic-domains.org/ar/ intrnational-entites.php Informative References [I1] A. Al-Zoman, "Supporting the Arabic Language in Domain Names", October 2003,http://www.arabic-domains.org/docs/NIC-docs/ SupportingArabicDomainNmaes.pdf Farah, et al. Standards Track [Page 7] Internet-Draft Linguistic Guidelines February 2008 [I2] A. Al-Zoman, "Arabic Top-Level Domains", paper presented in EGM on promotion of Digital Arabic Content, the United Nations, ESCWA, Beirut, June-2003. Author's Addresses Abdulaziz H. Al-Zoman, PhD Director SaudiNIC, General Directorate of Internet Services, IT Sector, CITC Riyadh, Saudi Arabia Email: azoman@citc.gov.sa Ayman El-Sherbiny Information and Communication Technology Division ESCWA, UN-House P.O. Box 11-8575 Beirut, Lebanon Email: El-sherbiny@un.org Mansour Farah Information and Communication Technology Division ESCWA, UN-House P.O. Box 11-8575 Beirut, Lebanon Email: farah14@un.org Ibaa Oueichek Syrian Telecom Establishment Damascus, Syria, Email: oueichek@scs-net.org Comments are solicited and should be addressed to the working group's mailing list at ESCWA-ICTD@un.org and/or the author(s). Full Copyright Statement Copyright (C) The IETF Trust (2008). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY Farah, et al. Standards Track [Page 8] Internet-Draft Linguistic Guidelines February 2008 WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Disclaimer of validity The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. This document expires on August 2008. Farah, et al. Standards Track [Page 9]