Internet Draft Edmon Chung, Neteka Inc. David Leung, Neteka Inc. June 2001 ACE Utilizing All 37 Alphanumeric Characters (ACE37) STATUS OF THIS MEMO This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The reader is cautioned not to depend on the values that appear in examples to be current or complete, since their purpose is primarily educational. Distribution of this memo is unlimited. The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract ACE37 is a combination of DUDE-02, AMC-W/V and LACE. ACE37 utilizes the simple one pass algorithm of DUDE, the character block considerations of AMC-W/V and the Base-32 compression of LACE. It also fully utilizes entire LDH set currently allowed in the DNS (A- z, 0-9 and "-") within its character repertoire to optimize performance and compression. Even for the worst-case scenario in ACE37, any name can have 21 characters including Chinese, Japanese and Korean names. Two Excel spreadsheets for ACE37 encoding and decoding can be found at http://www.dnsii.org/ace37/ace37-encode.xls and http://www.dnsii.org/ace37/ace37-decode.xls respectively. While DUDE-02 provides a very efficient differential mechanism, its compression is inefficient as it fails to take advantage of the base-32 scheme in using all 5-bits for character information. The AMC series is highly efficient in compression but requires complicated mode changes and therefore inefficient in process. LACE is rather moderate and requires a two-pass mechanism but utilizes base-32 for good compression. Chung & Leung [Page 1] ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 ACE37 uses simple character block shifting to achieve the compression efficiency of the AMC series, retains the one-pass and one mode XOR differential mechanism used by DUDE while embracing the base-32 compression used by LACE for efficient character bit information. Terminology The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and "MAY" in this document are to be interpreted as described in RFC 2119 [RFC2119]. LDH: Letters, Digits and Hyphens: a string of characters that consists only hyphens ("-"), English letters (A-z) and digits (0-9), which might not be a result of an algorithm for transcoding multilingual characters. For example: whatever-you-want.example ACE - ASCII Compatible Encoding: a string of characters resulting from a particular algorithm for transforming multilingual character information into an alphanumeric form acceptable by the existing DNS. For example: bq--3bhc2zmh.tld. In essence, ACE is a subset of LDH. Hexadecimal values are shown preceeded by "0x". For example, 0x60 is decimal 96. Binary values are shown preceeded by "0b" for example "0b1000" is decimal 8. As in the Unicode Standard [UNICODE], Unicode code points are denoted by "U+" followed by four to six hexadecimal digits, while a range of code points (or hexadecimal numbers) is denoted by two hexadecimal numbers separated by "..", with no prefixes. Octets: sequences of 8 bits; Quintets: sequences of 5 bits; Quartets: sequences of 4 bits; Duplets: sequences of 2 bits. XOR: bitwise exclusive or. Given 2 nonnegative integers A and B, A XOR B is the nonnegative integer value whose binary representation is 1 wherever A and B disagrees, and 0 wherever they agree. Table Of Contents 1. Introduction....................................................3 2. Code Block Shifting.............................................4 3. Base-32 Characters..............................................5 4. Base-4 Characters...............................................6 5. LDH Considerations..............................................9 6. Encoding Procedure..............................................9 7. Decoding Procedure.............................................11 8. Examples.......................................................13 9. Summary & Comparisons..........................................15 10. Security Considerations.......................................16 11. References....................................................16 Chung & Leung [Page 2] ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 1. Introduction ACE37 takes into account the recommendations and findings of the ACE design team to create a "super-ACE" that incorporates the key advantages of the various considered ACEs without complicated mode changes. The encoding (Section 6) and decoding (Section 7) process is largely similar to and as simple as DUDE-02. The encoding processes for ACE37 in comparison with DUDE-02 could be summarized: ACE37 Encoding Procedure | DUDE Encoding Procedure ---------------------------------+--------------------------------- (1) let initial prev = 0x00 | (1) let initial prev = 0x60 (2) if n = LDH output "-n" | (2) if n = hyphen output "-" (3) code block shift to obtain | (3) diff = prev XOR n ACE37 shifted n (Section 2)| (4) prepend "0" to the last (4) diff = prev XOR n | quartet and "1" to others (5) output in appropriate base-4 | (5) output a base-32 character and base-32 form | for each corresponding (Sections 3&4) | quintet (6) let prev = n | (6) let prev = n Similarly, the decoding process can be described and compared: ACE37 Decoding Procedure | DUDE Decoding Procedure ---------------------------------+--------------------------------- (1) let initial prev = 0x00 | (1) let initial prev = 0x60 (2) if char = hyphen discard "-" | (2) if char = hyphen consume and output next char | and output 0x002D (3) consume and convert char into| (3) consume and convert to duplets and quintets | quintets until encoun- (according to Sections 3&4)| erring a quintet with "0" (4) concatenate to form diff | as first bit (based on Sections 4.1&4.2)| (4) strip all first bits off (5) let prev = prev XOR diff | (5) concatente to form diff (6) reverse code block shifting | (6) let prev = prev XOR diff (7) output Unicode code point | (7) output Unicode code point The features of ACE37 include: Unique & Reversible - the ACE37 encoding scheme yields a unique and consistent result string for a given set of Unicode code points. The encoded string could be decoded back to the original Unicode code points without loss of character data. Simple - ACE37 utilizes a one-pass system and the XOR differential function to encode and decode. Code block shifting is done by a simple calculation instead of mapping or creation of arbitrary reference points. Complex mode changes are not required. Spacious - With the code block shifting coupled with a base-32 scheme, ACE37 can accommodate up to 21 unique Han characters (including CJK) within the 63 octets allowed by the DNS. Other Latin based scripts can reach up to 31 characters. Chung & Leung [Page 3] ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 Completeness - any sequence of Unicode code points (U+0000..U+10FFFF) could be encoded. Restrictions of allowed code points is not discussed, but is expected that Nameprep [Nameprep] will be used prior to ACE37 encoding. In essence, it captures the focus criterions discussed by the workgroup ACE design team - reversibility, simplicity and compression capability. Moreover, ACE37 utilizes a very simple code block shifting (Section 2) mechanism to allow up to any 21 CJK ideographs to be encoded within the 63-octet constraint. 2. Code Block Shifting While the DNS was not originally designed for multilingual characters, Unicode was not designed with the DNS in mind and therefore code points were apparently not allocated in an ACE- friendly way. The AMC series [AMC-W & AMC-V] utilizes a number of reference points to achieve better compression efficiency by anticipating and minimizing delta between characters. For ACE37, a much simpler rendering is used. More specifically, the entire character block U+3000..U+9FFF for CJK ideographs is shifted down by 0x3000. That is U+3000 will become 0x0000, U+4000 becomes 0x1000, and so on. To compensate for the downwards shift, the general script and symbol characters in U+0000..U+2FFF will be shifted upwards by 0x7000. Therefore, U+0100 will become 0x7100, U+2000 becomes 0x9000, and so on. All other code points (U+A000..U+10FFFF) are unchanged. Original Unicode Allocation | ACE37 Code Block Shifted --------------------------------|------------------------------- General Scripts U+0000 -+ | +- 0x0000 CJK Misc U+1000 | | | 0x1000 CJK Ideographs +- | -> | 0x2000 Symbols U+2000 -+ \ | / | 0x3000 \ |/ | 0x4000 CJK Misc U+3000 -+ \/ | 0x5000 CJK Ideographs U+4000 | /\ +- 0x6000 U+5000 | / |\ U+6000 +-- | \ +- 0x7000 General Scripts U+7000 | | -> | 0x8000 U+8000 | | | U+9000 -+ | +- 0x9000 Symbols | Hangul U+A000 -+ | +- 0xA000 Hangul U+B000 | | | 0xB000 U+C000 +----|---> | 0xC000 U+D000 | | | 0xD000 : : -+ | +- : : | Chung & Leung [Page 4] ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 This shifting effectively moves the entire Han library to within 0x6FFF and therefore could be represented in 15-bits or exactly 3 base-32 characters. (details on base-32 characters in Section 3) For example, the Chinese character for with the original Unicode code point at U+8F49, will be shifted to 0x5F49 and can be represented in 3 quintets, and in turn with 3 base-32 characters: Character: Unicode Code Point: U+8F49 ACE37 Shifted: 0x5F49 Corresponding Quartets: 0101 1111 0100 1001 Resulting Quintets: 10111 11010 01001 Base-32: nq9 (further discussed in Section 3) This in turn means that any Chinese character could be represented with 3 base-32 characters making the total possible characters within a label, even without further compression introduced by the XOR differential process (Section 6), to be at least 21. The ACE37 code block shifting process could be described as follows: for each input code point = n if n <= 9FFF n = n - 0x3000 /*downwards shifting*/ if n <= 0 n = 0x9FFF + n /*compensation for U+0000..U+2FFF*/ The character block shifting introduced here is extremely simple and utilizes simple calculation that requires no mapping function. At the same time, it achieves the goal in adjusting the Unicode allocation so that it becomes more ACE friendly. 3. Base-32 Characters Base-32 characters are used in LACE for compression, while DUDE-02 and the AMC series only utilizes it for quartet flagging to indicate the last quartet of each encoded code point. ACE37 utilizes base-32 characters for compression while base-4 characters, which will be introduced in Section 4, determine the compressed code point brackets. The following table shows the 32 base-32 characters and their corresponding quintets: Base-32 Character =to= Corresponding Quintet 0 = 00000 8 = 01000 g = 10000 o = 11000 1 = 00001 9 = 01001 h = 10001 p = 11001 2 = 00010 a = 01010 i = 10010 q = 11010 3 = 00011 b = 01011 j = 10011 r = 11011 4 = 00100 c = 01100 k = 10100 s = 11100 5 = 00101 d = 01101 l = 10101 t = 11101 6 = 00110 e = 01110 m = 10110 u = 11110 7 = 00111 f = 01111 n = 10111 v = 11111 Chung & Leung [Page 5] ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 With this layout of base-32 characters, it is also possible to implement a computation based base-32 conversion instead of having to resort to mapping and lookup tables: For each quintet = q if q <= 0x0F then hex dump q to form base-32 character if 0x10 <= q <= 0x1F then q = q - 0x10 and char(q + 0x67) to form base-32 character Note that 0x67 is the code value for the letter "g". Therefore, for example if the quintet is 0b10001 its base-32 character can be obtained by: 0x10 <= q=0b10001=0x11 <= 0x1F therefore q = q - 0x10 = 0x11 - 0x10 = 0x01 and base-32 character = char(0x01 + 0x67) char(0x68) = "h" 4. Base-4 Characters ACE37 goes beyond the 32 characters (base-32) to include the remaining 4 characters {w,x,y,z} in the alphabet. These base-4 characters enable ACE37 to better utilize the existing "resources" (the allowed characters) to represent IDN character information, therefore making it's encoding more efficient. The set of base-4 characters are {w,x,y,z} and will be used to represent the following duplets (duplets are groups containing 2 bits): Base-4 Character =to= Corresponding Duplet w = 00 x = 01 y = 10 z = 11 4.1 Base-4 Indicators Base-4 characters while carrying character information, also doubles as an indicator for code point brackets. In DUDE-02, an extra bit was pre-pended to each quartet. The last quartet of each encoded code point will be pre-pended with "0", marking the end of the code point. In ACE37, base-4 characters will determine the length (number of ACE37 characters) of the encoded code point. Actually, to be more precise, the encoded bits are in fact the "diff" and not the code point itself (diff carries the same meaning as in DUDE-02 and is further discussed in Sections 6 & 7) Chung & Leung [Page 6] ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 The following table explains how base-4 characters are combined with base-32 characters to form a representation of a diff (key: b4=base- 4, b32=base-32): diff value |bits| ACE37 Form -------------------------|----|---------------------------- diff<=0x7F | 7 | 0x80<=diff<=0x7FFF | 15 | 0x8000<=diff<=0x1FFFF | 17 | w 0x20000<=diff<=0xFFFFF | 20 | ww 0x100000<=diff<=0x10FFFF | 22 | w Note that the "bits" column represents the maximum number of significant bits for the given diff value. For example when diff<=0x7F, the maximum value is 0b1111111, therefore the number of significant bits is 7. Note also that to encode a 17-bit diff, the letter "w" is used as an indicator to distinguish the sequence from the 7 bit diff where a base-32 character is expected to follow a base-4 character. Since "w" represents "00" that has no value, it will not be used in the base-4 representation for a 17-bit diff (if a "00" is used, it means that there are only 15 significant bits and therefore should use the 15 bit diff form). This is the case for the 20-bit form as well. The "w" is used as an arbitrary indicator in the 22-bit form and MUST be discarded during decoding. By analyzing the ACE37 form, an encoded string could be successfully returned to its original form. There is no overlap and the form can be determined precisely. The following 5 rules dictate the 5 different ACE37 forms: (1) Encode: if diff<=0x7F Decode: if first character is AND next character NOT Then it MUST be in 7-bit form: (2) Encode: if 0x80<=diff<=0x7FFF Decode: if first character is Then it MUST be a 15-bit form: (3) Encode: if 0x8000<=diff<=0x1FFFF Decode: if first character is "w" AND next character is AND NOT "w" Then it MUST be in 17-bit form: w (4) Encode: if 0x20000<=diff<=0xFFFFF Decode: if first character is "w" AND next character is "w" Then it MUST be in 20-bit form: ww (5) Encode: if 0x80<=diff<=0x7FFF Decode: if first character is AND NOT "w" AND next character is "w" Then it MUST be 22-bit form: w Chung & Leung [Page 7] ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 Note that the ACE37 scheme can effectively encode a diff of up to 22 significant bits or 0x3FFFFF. The Unicode code points are expected to range only between 0x0000..0x10FFFF, therefore ACE37 will be able to handle any Unicode code point. Additionally, base-4 characters (and sometimes base-32 characters) could be used for mixed-case annotation. This optional mixed-case annotation mechanism is discussed in Appendix B. 4.2 First Code Point Considerations There are additional considerations for the first code point that is encoded or decoded to ensure that if the first code point is within the first Unicode plane (U+0000..U+FFFF), it will not occupy more than 4 ACE37 characters. This special consideration affects only Rules (1), (3) and (4) explained in Section 4.1. Rule (1) is discarded for the first code point, therefore any diff under 0x7FFF will be in the form . The form for Rule (3) becomes simply without the "w" indicator. Similarly, the form for Rule (4) becomes w with one less "w". The first code point considerations can be summarized in the following 4 rules: (a) Encode: if diff<=0x7FFF Decode: if first character is Then it MUST be in 15-bit form: (b) Encode: if 0x8000<=diff<=0x1FFFF Decode: if first character is AND NOT "w" Then it MUST be in 17-bit form: (c) Encode: if 0x20000<=diff<=0xFFFFF Decode: if first character is "w" Then it MUST be in 20-bit form: w (d) Encode & Decode: same as Rule (5) in Section 4.1 Besides special considerations for base-4 character usage, prev setting is also specially considered for the first code point. As laid out in Section 6, in order to detect for the first code point, the prev is evaluated. If prev = 0x00, it is assumed that it is the first code point as 0x00 SHOULD not be a permitted character for input. When an LDH is the first code point, there is a need to make a special consideration. Regularly, if n = LDH is encountered (Section 5), it will be output as "-n" and prev is not changed. However, if the first code point is an LDH, after outputting "-n", prev is updated to = lowercase(n). This is to ensure and maintain that only the first code point coming in will have a prev = 0x00. Chung & Leung [Page 8] ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 5. LDH Considerations Finally, the 37th character of the entire LDH repertoire, the hyphen will be used to indicate LDH exceptions. Extending the hyphen consideration of DUDE-02, ACE37 gives special consideration for the entire LDH repertoire. All LDH characters will be encoded "as is" with the addition of a leading hyphen. For example, the character "a" will be encoded within ACE37 as "-a". The hyphen character "-" will be encoded as "--". This ensures that each LDH character will only take up 2 character spaces within an ACE37 encoded string and also will allow administrators to see the actual characters, similar to the AMC series. Unlike the AMC series however, the hyphen is not used to indicate an ongoing mode change, but only the following character. Therefore retaining the simplicity of the DUDE-02 single-mode, single-pass philosophy. 6. Encoding Procedure Similar to DUDE, all ordering of bits and quartets is big-endian. The following describes the encoding procedure: Set initial value for prev = 0x00 for each input code point = n if n is an LDH {A-z, 0-9, -} output "-n" (Section 5: LDH Considerations) if prev = 0x00 (Section 4.2: First Code Point) let prev = lowercase(n) else perform code block shifting (Section 2: Code Block Shifting) let diff = prev XOR n (n after code block shifting) if diff<=0x7F --------------------------------------+ and if this is the first code point (Section 4.2)| then output 15-bit form: | else, output 7-bit form: | if 0x80<=diff<=0x7FFF +-(Section 4: output 15-bit form: | Base-4 if 0x8000<=diff<=0x1FFFF | Characters) and if this is the first code point (Section 4.2)| output 17-bit form: w | if 0x20000<=diff<=0xFFFFF | output 20-bit form: ww | if 0x100000<=diff<=0x10FFFF | output 22-bit form: w ---+ let prev = n end and obtain next n and return to: "for each input code point = n" The following is a more comprehensive pseudo code: let prev = 0x00 for each input integer n (in order) do begin if n = "-" or "0..9" or "A..Z" or "a..z" then output "hyphen"+"char(n)" Chung & Leung [Page 9] ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 if prev = 0x00 let prev = lowercase(n) else begin if n = 0x00 then error and abort if n <= 9FFF n = n - 0x30 if n < 0 then n = 9FFF + n let diff = prev XOR n if diff <= 0x7F if prev = 0x00 then output with 3 base-32 characters else, output first 2 bits with a base-4 character {wxyz} and remaining 5 bits with 1 base-32 character if 0x80 <= diff <= 0x7FFF then output all 15 bits with base-32 characters if 0x8000 <= diff <= 0xFFFF if prev = 0x00 then output first 2 bits with a base-4 {xyz} (except w) and output remaining 15 bits with base-32 else, output "w" and output first 2 bits with a base-4 {xyz} (except w] and output remaining 15 bits with base-32 if 0x10000 <= diff <= 0x1FFFF then output "w" and output first 2 bits with a base-4 {xyz} (except w) and output remaining 15 bits with base-32 if 0x20000 <= diff <= 0xFFFFFF then output "w" and output all 20 bits with base-32 characters if 0x100000 <= diff <= 0x10FFFF then output first 2 bits with a base-4 {xyz} (except w) and output "w" and output remaining 15 bits with base-32 let prev = n end end Nameprep [NAMEPREP] is not discussed in this document, but is expected that it be implemented for IDN. Hence, regardless of the code point presented, an encoder MUST not produce an incorrect output. The encoder must fail if it encounters a negative input value. Chung & Leung [Page 10] ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 The initial value used is 0x00 so that all domains beginning with a CJK ideograph or within row 0 (U+0000..U+0FFF) will be shorter. Note that after the code block shifting (Section 2), the entire Han library is within 0x0000..0x6FFF, while row 0 is fitted to 0x7000..0x7FFF. Therefore by using an initial value of 0x00 the diff for all Han and row 0 characters will be less than 0x7FFF. The initial value is also used as a check point for the first code point considerations (Section 4.2). Additionally, an optional mixed-case annotation mechanism is discussed in Appendix B. 7. Decoding Procedure A thorough description of the decoding rules, except for the final reversal of the code block shifting has been presented in Sections 4.1 and 4.2. The following description is a brief representation of the decoding procedure: let prev = 0x00 while the input string is not exhausted if present character = hyphen (Section 5: LDH discard and output next character Considerations) else, depending on the presented form (Section 4) convert into duplets and quintets (Section 4 & 3) and concatenate to form diff let prev = prev XOR diff reverse code block shifting: (Section 2) if prev<=0x9FFF and if prev<=0x6FFF output character = prev + 0x3000 else, output character = prev - 0x7000 else output character = prev output character End The following is a more comprehensive pseudo code for the decoding precedure: let prev = 0x00 while the input string is not exhausted do begin if present character = hyphen /*Section 5:LDH Considerations*/ then consume and discard hyphen and obtain the next character and output character if prev = 0x00 /*Section 4.2:First Code Point*/ let prev = code block shifted lowercase output character else, if present character = Base-32 characters (0..v) consume present character and next 2 characters and convert them to quintets according to Base-32 Chung & Leung [Page 11] ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 concatenate the resulting quintets to form diff /*15 bit form, 0x80<=diff<=0x7FFF*/ if present character = Base-4 characters {xyz} and NOT w consume present character and convert it to a duplet according to Base-4 if prev = 0x00 obtain and consume next 3 characters and convert them to quintets according to Base-32 concatenate duplet with the 3 quintets to form diff /*first code point: 17 bit form, 0x8000<=diff<=0x1FFFF*/ else, if next character = Base-32 character (0..v) then consume and convert to quintet according to Base-32 concatenate duplet with the quintet to form diff /*7 bit form, diff<=0x7F*/ else, obtain next character if next character = Base-4 characters {xyz} and NOT w then fail and indicate error else, if next character = w then consume and discard w and obtain next 4 characters consume and convert characters to quintets according to Base-32 concatenate duplet with the 4 quintets to form diff /*22 bit form, 0x100000<=diff<=0x10FFFF*/ if present character = w discard "w" and obtain next character if next character = Base-4 characters {xyz} and NOT w and if prev = 0x00 obtain and consume next 4 characters and convert characters to quintets based on Base-32 concatenate the 4 quintets to form diff /*first code point: 20 bit form,*/ /*0x20000<=diff<=0xFFFFFF */ else, consume and convert to duplet according to Base-4 and obtain and consume next 3 characters and convert to quintets according to Base-32 concatenate duplet with the 3 quintets to form diff /*17 bit form, 0x8000<=diff<=0x1FFFF*/ else, if next character = w then consume and discard w and obtain and consume next 4 characters and convert to quintets according to Base-32 concatenate duplet the 4 quintets to form diff /*20 bit form, 0x20000<=diff<=0xFFFFFF*/ Chung & Leung [Page 12] ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 else, if next character = Base-32 character (0..v) then convert to quintet according to Base-32 set quintet to diff /*7 bit form, diff<=0x7F*/ fail upon encountering a non-ACE37 character or end-of-input let prev = prev XOR diff if prev <= 0x9FFF /*reversal of the code */ and if prev <= 6FFF /*block shifting described*/ output = prev + 0x3000 /*in Section 2 */ else, output = prev - 0x7000 else, output prev end end encode the output sequence and compare it to the input string fail if they do not match (case insensitively) 8. Examples ACE37 is likely to be implemented with an ACE prefix in the form "xx--". The actual prefix to be used is not discussed in this document. The following examples are taken from the mailing list as well as from DUDE-02 and the AMC series. The resulting ACE37 string is compared with that using DUDE: (A) JPNIC (the registry of .jp domain) Unicode: U+793E U+56E3 U+6CD5 U+4EBA U+65E5 U+672C U+30CD U+30C3 U+30C8 U+30EF U+30FC U+30AF U+30A4 U+30F3 U+30D5 U+30A9 U+30E1 U+30FC U+30B7 U+30E7 U+30F3 U+30BB U+30F3 U+30BF U+30FC ACE37: i9urut6hm8jfaqv0m9dv1wewbx7wjyjwbynx6zsy8wtybygwky8y8ycy3 (57 char) DUDE-02: (error: result string exceeds 59 characters*) Note: 59 characters is the maximum allowable when the ACE prefix "xx--" is included (B) A health-insurance organization in Tokyo Unicode: U+6771 U+4EAC U+90FD U+60C5 U+5831 U+30B5 U+30FC U+30D3 U+30B9 U+7523 U+696D U+5065 U+5EB7 U+4FDD U+967A U+7D44 U+5408 ACE37: drhaetvihk1o67ka44y9xfzahcqv2e6883micbaud7apuqac (48 char) DUDE-02: (error: result string exceeds 59 characters) Chung & Leung [Page 13] ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 (C) 6 hangul syllables Unicode: U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC ACE37: xg9orfsqssvfg3i8t2c (19 char) DUDE-02: 6txiy79ny53nz79a8wizwwn (23 char) (D) majikoi5 (Latin, hiragana, kanji) Unicode: U+006D U+0061 U+006A U+0069 U+3067 U+006B U+006F U+0069 U+3059 U+308B U+0035 U+79D2 U+524D ACE37: -m-a-j-is0a-k-o-xu06i-5iapqsv (30 char) DUDE-02: pnmdvssqvssnegvsva7cvs5qz38hu53r (32 char) (E) de (Latin, katakana) Unicode: U+30D1 U+30D5 U+30A3 U+30FC U+0064 U+0065 U+30EB U+30F3 U+30D0 ACE37: 06hw4zmyv-d-ewnwox3 (19 char) DUDE-02: vs5bezgxrvs3ibvs2qtiud (22 char) (F) (hiragana, katakana) Unicode: U+305D U+306E U+30B9 U+30D4 U+30FC U+30C9 U+3067 ACE37: 02txj06nzdx8xl05e (17 char) DUDE-02: vsvpvd7hypuivf4q (16 char) (G) 2 Arbitrary Plane Two Code Points Unicode: U+261AF U+261BF ACE37: w4odfwg (7 char) DUDE-02: uyt6rta (7 char) (H) Czech: Proprostnemluvesky Unicode: U+0050 U+0072 U+006F U+010D U+0070 U+0072 U+006F U+0073 U+0074 U+011B U+006E U+0065 U+006D U+006C U+0075 U+0076 U+00ED U+010D U+0065 U+0073 U+006B U+0079 ACE37: -p-r-o0bt-p r-o-s-twm-n-e-m-l-u-v0fm0f0-e-s-k-y (47 char) DUDE-02: vauctptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc (44 char) (I) Chinese Unicode: U+4ED5 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D U+6587 ACE37: 7mmfm7oh3n7is3ts5gh57h47ata (27 char) DUDE-02: w85gt86huuudv69c7szp7s5a6w4h6w2hu54k (36 char) Chung & Leung [Page 14] ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 9. Summary & Comparisons In summary, ACE37 is based on the DUDE-02 process with an improved compression scheme for code point sequences that are less likely to cluster too closely together, such as CJK ideographs. Since it is the design team's indication that generally 30 characters should be good enough and that there are a lot of concern from the Asian community that 14-15 characters is definitely limiting and that few indication from the Latin community that length is really a concern, ACE37 have set its objective to increase the possible number of characters in a worse case scenario closer to 20 characters. ACE37 have succeeded in creating a very simple variation based on the primary ACEs identified by the design team to create an ACE that achieves dramatically better performance for CJK characters while maintaining the simplicity of DUDE. Key Improvements of ACE37 over DUDE-02 - much more spacious for Han characters. Improved worst-case scenario to 21 Han ideographs by introducing code block shifting and utilizing fully base-32 characters - no need to arbitrarily pre-pend flagging bits to identify code point brackets. Instead base-4 characters and diff forms are used - base-32 and base-4 characters can be easily computed instead of mapped using lookup tables Key Improvements of ACE37 over the AMC series - a more simple process, utilizing the one-pass differential mechanism from DUDE-02 - a much more simple code block shifting process is used in ACE37 to achieve a similar goal for the complex multiple reference point system used by the AMC series - base-32 and base-4 characters can be easily computed instead of mapped using lookup tables Key Improvements of ACE37 over LACE - a more simple process, utilizing the one-pass differential mechanism from DUDE-02 - much more spacious for Han characters. Improved worst-case scenario to 21 Han ideographs by introducing code block shifting and utilizing fully base-32 characters - base-32 and base-4 characters can be easily computed instead of mapped using lookup tables Two Excel spreadsheet for ACE37 encoding and decoding can be found at http://www.dnsii.org/ace37/ace37-encode.xls and http://www.dnsii.org/ace37/ace37-decode.xls respectively. This illustrates the simplicity of ACE37 and provides a handy tool for checking ACE37 encoding and decoding algorithms. The ACE37-encode spreadsheet also includes a DUDE-encode worksheet. Chung & Leung [Page 15] ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 10. Security Considerations This document does not talk about DNS security issues, and it is believed that the proposal does not introduce additional security problems not already existent and/or anticipated by adding multilingual characters to DNS and/or using ACE. 11. References [AMC-W] Adam M. Costello, "AMC-ACE-W version 0.1.0", May 31, 2001. [AMC-V] Adam M. Costello, "AMC-ACE-V version 0.1.0", May 31, 2001. [DUDE-02] Mark Welter, Brian W. Spolarich & Adam M. Costello, "Differential Unicode Domain Encoding (DUDE)", June 7, 2001. [LACE] Mark Davis, IBM & Paul Hoffman, IMC & VPNC, "LACE: Length- based ASCII Compatible Encoding for IDN", January 5, 2001. [Nameprep]Paul Hoffman, IMC & VPNC & Marc Blanchet, ViaGenie, "Preparation of Internationalized Host Names", February 24, 2001 Appendix A. Acknowledgements The ACE37 draft is a combination of DUDE-02, the AMC series and LACE, and takes into consideration the report of the ACE design team. The authors would therefore like to thank the authors of DUDE-02 - Mark Welter, Brian W. Spolarich & Adam M. Costello; the authors of the AMC series - Adam M.Costello; the authors of LACE - Mark Davis & Paul Hoffman; and, the ACE design team and its advisors - Adam M. Costello, Paul Hoffman, Makoto Ishisone, David Laurence, Brian Spolarich, Rick Wesson, Marc Blanchet, Patrik Faltstrom and Erik Nordmark for their inspirations. Appendix B. Mixed-case annotation This section is taken from DUDE and modified for ACE37 In order to use ACE37 to represent case-insensitive Unicode strings, higher layers need to case-fold the Unicode strings prior to ACE37 encoding. The encoded string can, however, use mixed-case base-4 characters as an annotation telling how to convert the folded Unicode string into a mixed-case Unicode string for display purposes. Each Unicode code point (unless it is an LDH) is represented by a sequence of base-4 and base-32 characters, the first of which is mostly a base-4 character, which is always a letter {wxyz} (as opposed to a digit). If that letter is uppercase, it is a suggestion that the Unicode character be mapped to uppercase (if Chung & Leung [Page 16] ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 possible); if the letter is lowercase, it is a suggestion that the Unicode character be mapped to lowercase (if possible). If the code point is an LDH, for example "a", it will be represented as "-a". To mark the case for an LDH, simply set the LDH to the desired case following the "-". Fir example if an uppercase "A" is desired, the encoded form SHOULD be "-A". Note that there is a possibility that no base-4 character is present for a code point representation. That is the case for a 15-bit diff form. In this case, the base-32 characters will be used for case suggestion (if possible), similar to that discussed for using a base-4 character. However, also note that there is a very remote possibility that all 3 base-32 characters are digits. If this happens, case unfolding will be aborted. Since case annotation is an optional feature and used for display purposes only, this is not considered to be a major concern. Moreover, the possibility of this happening is truly remote at only (32639/27)/1114109 or just 0.1% chance of happening. ACE37 encoders and decoders are not required to support these annotations, and higher layers need not use them. For example: In order to suggest that example (H) in Section 8: "Examples" be displayed as: Czech: Proprost nemLUVesky one could capitalize the ACE37 encoding as: ACE37: -P-r-o0BT-p-r-o-s-tWM-n-e-m-L-U-V0fm0f0-e-s-k-y (47 char) Authors: Edmon Chung Neteka Inc. 2462 Yonge St. Toronto, Ontario, Canada M4P 2H5 edmon@neteka.com David Leung Neteka Inc. 2462 Yonge St. Toronto, Ontario, Canada M4P 2H5 david@neteka.com Chung & Leung [Page 17]