Internet Draft D. Crocker draft-crocker-idn-idn-00.txt Brandenburg InternetWorking Expires in six months 23 June 2002 Internationalized Domain Names (IDN) Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract Globalization of the Internet requires that domain names be able use characters outside the ASCII repertoire. This document specifies internationalized domain names (IDNs) and defines initial domain name constructs in which IDNs can be used. IDNs use characters drawn from a large repertoire (Unicode). 0. Document Change Notes -- This is a revision to draft-ietf-idn-idna-09.txt. It is being distributed independently to facilitate discussion. The goal is to gain consensus about revisions to the IDN working group document, specifically for the following changes: a. Split the document into two, one for defining Internationalized Domain Names (IDN) and the other for defining an encoding method of IDNs, namely IDNA using ACE. b. Distinguish general IDN from its specific use for host names (IDN-Host). Use for host names is specified more precisely, in terms of a specific syntax BNF rule from the relevant existing DNS specification, so that IDN-Host will apply precisely to all DNS record fields and protocol units conforming to that BNF. 1. Introduction Until now, there has been no standard method for domain names to use characters outside the ASCII repertoire. This document defines enhancements to the definition of domain names, to support internationalized domain names (IDN). The details for doing protocol encoding of IDNs are specified separately. 2. Terminology The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and "MAY" in this document are to be interpreted as described in RFC 2119 [RFC2119]. "ASCII" means US-ASCII [USASCII], a coded character set containing 128 characters associated with code points in the range 0..7F. Unicode is an extension of ASCII: it includes all the ASCII characters and associates them with the same code points. Code point refers to an integral value associated with a character in a coded character set. Domain name is used as a general term for strings conforming to [STD13]. [STD13] talks about "domain names" and "host names", but many people use the terms interchangeably. Further, because [STD13] was not terribly clear, many people who are sure they know the exact definitions of each of these terms disagree on the definitions. This document uses the terms separately. Domain name slot refers to a protocol element or a function argument or a return value (and so on) explicitly designated for carrying a domain name. Examples of domain name slots include: the QNAME field of a DNS query; the name argument of the gethostbyname() library function; the part of an email address following the at-sign (@) in the From: field of an email message header; and the host portion of the URI in the src attribute of an HTML tag. General text that just happens to contain a domain name is not a domain name slot; for example, a domain name appearing in the plain text body of an email message is not occupying a domain name slot. Host name is a domain name conforming to STD13, with the naming character set limited to LDH. Internationalized host name (IDN-Host) is an IDN conforming to the STD13, except that it also supports non-ASCII characters from Unicode. Internationalized domain name" (IDN) is a domain name that has characters drawn from the restricted set of Unicode defined in <> Internationalized label is a label composed of characters from the Unicode character set; note, however, that not every string of Unicode characters can be an internationalized label. IDN-native is a domain name slot specified to hold an internationalized domain name. The designation may be static (for example, in the specification of the protocol or interface) or dynamic (for example, as a result of negotiation in an interactive session). Label is an individual part of a domain name. Labels are usually shown separated by dots; for example, the domain name "www.example.com" is composed of three labels: "www", "example", and "com". (The zero- length root label described in [STD13], which can be explicit as in "www.example.com." or implicit as in "www.example.com", is not considered a label in this specification.) Throughout this document the term "label" is shorthand for "text label", and "every label" means "every text label". In IDNA, not all text strings can be labels. LDH code points is defined to mean the codepoints associated with ASCII letters, digits, and the hyphen-minus; that is, U+002D, 30..39, 41..5A, and 61..7A. "LDH" is an abbreviation for "letters, digits, hyphen". Unicode is a coded character set [UNICODE] containing tens of thousands of characters. A single Unicode code point is denoted by "U+" followed by four to six hexadecimal digits, while a range of Unicode code points is denoted by two hexadecimal numbers separated by "..", with no prefixes. 3. International Domain Names (IDN) 3.1. Data representation This specification enhances the set of values for valid domain name labels from the restricted ASCII specified in [STD3], to include [Unicode]. Mechanisms for encoding Unicode values in Domain Names is specified separately. Hence this specification provides no detail for IDNs in "native" binary form (IDN- Native) or for "encoded" Unicode-based IDNs. 3.2. Dot as label separator For systems supporting IDN, wherever dot is permitted as a label separator, the following characters MUST be recognized as dots: U+002E (full stop), U+3002 (ideographic full stop), U+FF0E (fullwidth full stop), U+FF61 (halfwidth ideographic full stop). << // Are there also multiple Unicode characters permitted for at-sign? What about for slash ("/")? If not, then why is the domain name lexical analyzer now required to look for 4 characters rather than only one? This appears to be a case of putting into the protocol something that is, in fact, entirely a user-interface issue. That some user interfaces will choose to map U+3002 to ASCII dot does not mean that it needs to be in the protocol. // /Dave >> 4. References 4.1. Normative references [STD3] Bob Braden, "Requirements for Internet Hosts -- Communication Layers" (RFC 1122) and "Requirements for Internet Hosts -- Application and Support" (RFC 1123), STD 3, October 1989. [STD13] Paul Mockapetris, "Domain names - concepts and facilities" (RFC 1034) and "Domain names - implementation and specification" (RFC 1035), STD 13, November 1987. 4.2. Informative references [DNSSEC] Don Eastlake, "Domain Name System Security Extensions", RFC 2535, March 1999. [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate Requirement Levels", March 1997, RFC 2119. [UAX9] Unicode Standard Annex #9, The Bidirectional Algorithm, . [UNICODE] The Unicode Standard, Version 3.1.0: The Unicode Consortium. The Unicode Standard, Version 3.0. Reading, MA, Addison-Wesley Developers Press, 2000. ISBN 0-201-61633-5, as amended by: Unicode Standard Annex #27: Unicode 3.1, . [USASCII] Vint Cerf, "ASCII format for Network Interchange", October 1969, RFC 20. 5. Security Considerations Security on the Internet partly relies on the DNS. Thus, any change to the characteristics of the DNS can change the security of much of the Internet. This memo describes an algorithm that encodes characters that are not valid according to STD3 and STD13 into octet values that are valid. No security issues such as string length increases or new allowed values are introduced by the encoding process or the use of these encoded values, apart from those introduced by the ACE encoding itself. Domain names are used by users to connect to Internet servers. The security of the Internet would be compromised if a user entering a single internationalized name could be connected to different servers based on different interpretations of the internationalized domain name. 6. Authors' Addresses Patrik Faltstrom Cisco Systems Arstaangsvagen 31 J S-117 43 Stockholm Sweden paf@cisco.com Paul Hoffman Internet Mail Consortium and VPN Consortium 127 Segre Place Santa Cruz, CA 95060 USA phoffman@imc.org Adam M. Costello University of California, Berkeley idna-spec.amc @ nicemice.net