INTERNET-DRAFT Soobok Lee draft-ietf-idn-lsb-ace-01.txt Expires 2002-Jan-03 2001-Jul-03 Improving ACE using code point reordering v1.0 Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Distribution of this document is unlimited. Please send comments to the authors or to the idn working group at idn@ops.ietf.org. Abstract This document describes code point reordering to improve ACE compression algorithms. Being based on character frequency and character adjacency statistics for major characater sets, this reordering can be easily implemented only with simple character mapping tables without adding complexity to existing ACE algorithms. When applied to DUDE and AMC-ACE-W, this reordering greatly improves both ACEs' compression ratios for Hangul, Chinese, Vietnamese, Katakana and European domains. Interestingly, reordered DUDE shows better or equal compression ratio than both bare AMC-ACE-W and reordered AMC-ACE-W. Contents Differences from version 0.9 Overview Hangul Basic Latin Extended Latin and Combining Diacritical Marks Unified Han Other character sets Modified Encoding procedure of DUDE implementation of this idea Modified Decoding procedure of DUDE implementation of this idea Modified Encoding and Decoding algorithms of AMC-ACE-W of this idea Example strings Security considerations References Author LDUDE: Example implementation into DUDE-02 LAMCW: Example implementation into AMC-ACE-W Differences from version 0.9 version 1.0 differs from version 0.9 in four respects: 1) For Hangul, it does not use Hangul jamo frequency order any more: instead, it adopts new reorderding based on both hangul character frequency and adjacency in words used as business names in Korea. 2) For Unified Han, like for Hangul, it adopts new reorderding table that reflects han character frequency and adjacency in words used as business names in China/Taiwan/Japan. 3) new supports for Japanese katakana 4) additional implementation of reordering with AMC-ACE-W Overview Pursuing shorter ACE labels is justified to save memory resources and to reduce internet traffic even for domains of average length in various application/core internet protocols. Both 11172 Hangul syllables and 24000 or more CJK Han syllables occupy roughly half of the entire unicode space. Their lexicographical ordering( not in frequency ordering) makes various ACE compression technique work poorly for them, because. they are spread evenly through out those wide ranges. The most frequent 256 Hangul syllables has cumulative frequency sum of 88.2% and for the case of top 512 ones , it reaches 99.9%. The most frequent 256 Han letters has cumulative frequency sum of 58.2% and for the cases of top 512,1024,2048 and 4096 ones, it reaches 72.8,85.9,95.4 and 99.4%, respectively. Even, Latin characters code range, including 'a' - 'z' has lexicographical order that does not reflect the fact that 's','t' and 'r' are more frequently used than 'j','k' and 'h'. Most ACE algorithms show good compression ratio when frequently used characters are located in narrow code ranges. Especially, to reduce DUDE XOR distance, we can make the narrow area fit in aligned blocks of 16,256 or 4096 code points. Unified Han and Hangul Most frequently used 4096 Traditional Chinese/Simplified Chinese/ Japanese Kanji letters are reordered into single aligned block of 4096 code points. Their combinations are estimated to form almost 99% of modern chinese business names. Most frequently used 888 Hangeul syllables are reordered into the lower portion of single aligned block of 4096 code points. Their combinations are expected to cover almost 99% of modern hangul business names. In fact, every block of 256 code points in these reordered areas is designed to reflect not only character frequency order, but also to reflect adjacency preference derived from statistics on major business category names or famous regional names in eastern asia. For example, there is a frequent korean industrial category name 'jeon-ja' (electronics). In pure frequent-oriented ordering, its two component hangul syllables 'jeon' and 'ja' should have been put far apart from each other. But in this new adjacency-adjusted frequency ordering, they are put together in a single row (u+???0 ~ u+???f) in order to reduce the XOR distance toward single quintet(for DUDE). The han/hangul frequency mapping tables and its statistical data are constructed from business names found in internet directory sites at {cn|tw|kr}.yahoo.com. Japanese Katakana and Hiragana Japaneses hiragana (u+3040 ~ u+309f) shows relatively even frequency distribution in japanese business names. And it is often replaced with its Kanji (Japanese Han letter) equivalent in registerd business names. I have no mapping table for hiragana, yet. Japaneses katakana (u+30a0 ~ u+30ff) has been widely used to express foreign or english words in Japanese. Most frequent 10 katakanas' cumulative frequency is estimated to be around 40%. The katakana frequency mapping table is constructed from business names found in internet directories sites such as www.yahoo.co.jp. Basic Latin Basic Latin row u+0070 ~ u+007f has 'p','r','s','t' and 'u' which are more frequently used in European nouns than '`','j','k','f' and 'g' in u+0060 ~ u+006f row which includes most frequently used 'a'~'o' . If these two sets of 5 characters are swapped character-wise, 'p','r','s','t','u' go into the u+0060 ~ u+006f row. Any character sequence only from this single aligned block of 16 codes has XOR-distance or code window length shorter than 0x10 and makes DUDE and other ACEs do good compression. Extended Latin and Combining Diacritical Marks First 6 rows from Latin Extension A(u+0100 ~ u+015f) and 6 rows from Basic Latin & Latin-1 Supplement (u+0000 ~ u+002f and u+0080 ~ u+00a0) are swapped. First 3 rows from Combining Diacritical Marks(u+0300 ~ u+032f) and 3 rows from Latin-1 Supplement (u+00B0 ~ u+00df) are also swapped. This makes frequently used parts of Latin Extended-A and Combining Diacritical Marks go into first align block of 256 codes points (u+0000 ~ u+00ff). Any character sequences from this single block make XOR-distance or code window length much shorter than 0x100. This improvement benefits especially East-European and Vietnamese that use Latin Extented A and Combining Diacritical Marks. Other character sets For Arabic,Cyrillic and Hindi etc, we can devise similiar frequency mapping tables as that for katakana. Modified Encoding procedure of DUDE implementation of this idea All ordering of nybbles and quintets is big-endian (most significant first). A nybble is 4 bits. XOR is bitwise exclusive or. This modification is hyphen-safe. Hyphen encoding and decoding are not affected by this modification. let prev = 96 for each input integer n (in order) do begin if n == 45 then output hyphen minus else begin n = reorder(n) // ******** ADDED ********** let diff = prev XOR n extract the least significant nybbles of diff, as few as are sufficient to hold all the nonzero bits (but at least one) prepend 0 to the last nybble and 1 to the rest output base-32 characters corresponding to the quintets let prev = n end end The encoder must either correctly handle all integer values that can be represented in the type of its input, or it must check whether the input contains values that it cannot handle and return an error if so. Under no circumstances may it produce incorrect output. Modified Decoding procedure of DUDE implementation of this idea let prev = 96 while the input string is not exhausted do begin if the next character is hyphen-minus then output 45 else begin input characters and convert them to quintets until encountering a quintet beginning with 0 fail upon encountering a non-base-32 character or end-of-input strip the first bit of each quintet concatenate the resulting nybbles to form diff let prev = prev XOR diff output restore_order(prev) // ******** MODIFIED ********** end end encode the output sequence and compare it to the input string fail if they are not equal Modified Encoding and decoding algorithms of AMC-ACE-W for this idea (This modification does not affect literal mode of AMC-ACE-W). procedure initialize(refpoint,style,literal): let refpoint[1..5] = (0xE0, 0xA0, 0, 0, 0x10000) let style = 0 let literal = false procedure update(refpoint,style,n,k): # Update the active style and reference points based on # the latest code point (n) and the number of base-32 # characters used to represent it (k). let style = k < 3 ? 0 : k > 3 ? 1 : style let refpoint[1] = (n >> 4) << 4 if (k > 2) then let refpoint[2] = n is in 00A0..017F ? 0xA0 : (n >> 8) << 8 if (k > 3) then let refpoint[3] = n is in 3000..9FFF ? 0x4E00 : style == 1 and n is in 0xA000..0xD7FF ? 0x8800 : (n >> 12) << 12 procedure encode: constant maxdelta[0][1..5] = (0xF, 0xFF, 0xFFF, 0xFFFF, 0xFFFFF) constant maxdelta[1][2..5] = ( 0xFF, 0x4FFF, 0xFFFF, 0xFFFFF) initialize(refpoint,style,literal) for each input code point n (in order) do begin # Check code point range to avoid array bounds errors later: if n is not in 0..10FFFF then fail if n == 0x2D then output two hyphen-minuses else if n represents an LDH character then begin # Letter/digit is encoded literally, so get into literal mode. if not literal then output hyphen-minus let literal = true output the character represented by n end else begin # Non-LDH code point is encoded in base-32. # Compute the number of base-32 characters to use: n = reorder(n) // ADDED ************************* for k = 1 + style to infinity do begin let delta = n - refpoint[k] if delta is in 0..maxdelta[style][k] then break end # Switch to base-32 mode if necessary: if literal then output hyphen-minus let literal = false # Check for the extended delta of style 1 window 3: if k == 3 and delta >= 0x1000 then represent (delta - 0x1000) in base 32 as three quintets else begin # Normal case, four bits per quintet: represent delta in base 16 as k quartets prepend 0 to the last quartet and 1 to each of the others end output a base-32 character corresponding to each quintet update(refpoint,style,n,k) end end procedure decode: initialize(refpoint,style,literal) while the input string is not exhausted do begin read the next character into c # Unpaired hyphen-minus toggles the mode: if c is hyphen-minus and the next character is not then read the next character into c and toggle literal # Double hyphen-minus represents 0x2D: if c is hyphen-minus then read the next character and append 0x2D to history else if literal then append the code point of c to history else begin # Decode a base-32 sequence. convert c to a quintet while a quintet beginning with 0 has not been seen do read and convert up to four more characters concatenate the lowest four bits of each quintet to form delta # Check for the extended delta of style 1 window 3: if style == 1 and there was only one quintet then begin read two characters and convert them to two more quintets concatenate delta and the two quintets to form a new delta let delta = delta + 0x1000 end let k = the number of quintets decoded let n = refpoint[k] + delta update(n,k) output restore_order(n) // MODIFIED ***************** end end # Enforce the uniqueness of the encoding: encode the output sequence and compare it to the input string fail if they are not equal Example strings About 30%~58% improvement in DUDE compression ratio is achieved in these Hangul examples. LDUDE and LAMCW denote reordering-applied DUDE-02 and AMC-ACE-W, respectively. (AMCW for AMC-ACE-W). Most examples show LDUDE outperforms LAMCW. (K1) Korean String 1: ( 24 hangul syllables ) u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774 u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74 u+C5BC u+B9C8 u+B098 u+C88B DUDE-02 : 6txiy79ny53nz79a8wizwwnzzuavyizv3atuuiz2vby27jz66iz8sit\ usauiyz5i23az96iz6ze3xaz2td ( 82 chars ) LDUDE : 5suhxb9jt2pydtwetwkxhtsrxhbyhvsmvvk7r2ityd6atqt8etvittk ( 55 chars, 33.9% shorter ) AMCW : 6tvifgem42ixihhakfnh6nhhem5wrk6fmpmpwim6m5wrmwxn5u8eivw\ mp6iqige2nem ( 67 chars ) LAMCW : 5swhtg8r5tycsb5swfgirxi5sxhsabyg5vypgcz2isa5tyd4d5p5sxj\ gmbgd5 ( 61 chars ) (K2) Korean String 2: ( 9 hangul syllables ) U+D55C U+AD6D U+C778 U+D130 U+B137 U+C815 U+BCF4 U+C13C U+D130 DUDE-02 : 7xvNz2vBy4tFtywIyssHz3uCzw8Bz76ItssN ( 36 chars ) LDUDE : 5syAB3BIJ7BB7NF ( 15 chars, 58.3% shorter ) AMCW : 7xxNFmpM52QjsGjzNaxJhwKj6Qjs ( 28 chars ) LAMCW : 5ssAsB3AIBwAB3PI ( 16 chars ) (K3) Korean String 3: ( 18 hangul syllables ) U+C804 U+AD6D U+C2E4 U+C9C1 U+B178 U+C219 U+C790 U+B300 U+CC45 U+C885 U+AD50 U+C2DC U+BBFC U+B2E8 U+CCB4 U+D611 U+C758 U+D68C DUDE-02 : 62yEyxyJy92J5uFz25JzvyBx2Jzw3Az9wFw6Ayx7Fy92Nz3uA3tEz8\ xNt44FttwJtt7E ( 68 chars ) LDUDE : 5szAtBtvBt7Mt2Qv4Qu7KtFt5It3MuEvAtvDyJCtuC4G4J ( 46 chars, 32% shorter ) AMCW : 62sEFmpKzeNqbGm2Ks3M6sG2aPcfNefFksKy6I96GziPfwRstM42Rwn ( 55 chars ) LAMCW : 5stAsB5tvAGhmGmgG2mGatsE5t7JGbhsDvD5tsAyIK5swJ8RwG ( 50 chars ) (K4) Korean String 4: ( 7 hangul syllables ) U+D558 U+C774 U+B2C9 U+C2A4 U+BC18 U+B3C4 U+CCB4 DUDE-02 : 7xvItuuNzx5PzsyPz85N97Nz9zA ( 27 chars ) LDUDE : 5s3C4F5Q7PtwRtMK ( 16 chars, 40% shorter ) AMCW : 7xxIM5wGyjKxeJa2G8ePfw ( 22 chars ) LAMCW : 5s9CxH8JvE5tzMyAK ( 17 chars ) (K5) Korean String 5: ( 13 hangul syllables ) U+D658 U+ACBD U+C6B4 U+B3D9 U+C5F0 U+D569 U+BC18 U+D575 U+D2B9 U+BCC4 U+C704 U+C6D0 U+D68C DUDE-02 : 7yvIz48Fy4sJzxyPzyuJts3Jy3zBy3yPz6Ny8zPz56At7EtsxN ( 50 chars ) LDUDE : 5s7NB4EDvHFtxDv5Kv6NtIt4R5GwK ( 29 chars, 42% shorter ) AMCW : 7yxIFf7MxwG83MrsRmjJa2RmxQx3JgeM2eMysRwn ( 40 chars ) LAMCW : 5s5N5PtJKuPI5tzMGybGiptF5s5KsNwG ( 32 chars ) About 35%~50% improvement in DUDE compression ratio is achieved in these UniHan examples. (TC1) Traditional Chinese String 1: ( 16 letters ) u+5354 u+91c7 u+5065 u+5eb7 u+4e8b u+696d u+670d u+52d9 u+7db2 u+002d u+5354 u+91c7 u+6709 u+9650 u+516c u+53f8 DUDE-02 : xvve6u3d6t4c87ctsvnuz8g8yavx7eu9ym-u88g6u3d9y6q9txj6z\ vnu3e ( 58 chars) LDUDE : xs8qy7ny9jhyi6f6bb8h-4iy7nyxkbed ( 32 chars, 44.8% shorter) AMCW : xvxen8huyfafzs2mc5pcipw7jh7u--xxen8hcijqcsvynx9i ( 48 chars ) LAMCW : xs2q2xcu4m4n6esb6abug--2q2xcusijpq ( 34 chars ) (TC2) Traditional Chinese String 2: ( 21 letters ) u+5317 u+4eac u+5e02 u+91ab u+85e5 u+7d93 u+6fdf u+6280 u+8853 u+7d93 u+71df u+516c u+53f8 u+5fa1 u+91ab u+7db2 u+7d61 u+83ef u+91ab u+7db2 u+8def DUDE-02 : xvzht75mts4q694jtwwq92zgtuwn7xr847d9x6a6wnus5du3e6xj6\ 8sk86tj7d982qtuwe86tj9sxp ( 78 chars) LDUDE : xtwicfz6b99a38g27c2vdd8cz7mzuqdt6izuiy6iz5nz5fy6by6ib ( 53 chars, 32.0% shorter) AMCW : xvths4naacn7mj9fh6veq9beakuvh6ve89vynx9iapbn7mh7uyb2v\ 8rn7mh7um9r ( 64 chars ) LAMCW : xtuiukr28q5tqu9i4ukutjk9i3uduspqv6g28quug33kuur28quugh ( 54 chars ) (TC3) Traditional Chinese String 3: ( 18 letters ) u+795e u+8fb2 u+7db2 u+990a u+8eab u+4fdd u+5065 u+7db2 u+5065 u+5eb7 u+4e16 u+754c u+5065 u+5eb7 u+8a2d u+8a08 u+5bb6 u+60e0 DUDE-02 : z3vq9y8n9usa8w5itz4b6tzgt95iu77hu77h87cts4bv5xkuxuj87\ c7w3kuf7t5qv5xg ( 68 chars ) LDUDE : xwsiw5e9kzyqz8fhb2p2phtvgxtbwuah8qbtwmyg ( 40 chars, 41.1% shorter ) AMCW : z3xqnpuh7uq2knfmt7puyfh7uuyfafzstgf4nuyfafzmbpsi75gys\ 8a ( 55 chars ) LAMCW : xwyiu7nug3wiu4pkmug4mnv3ky2mu4mnwcdvsiyq ( 40 chars ) (SC1) Simplified Chinese String 1 : ( 16 letters ) u+4e2d u+534e u+4eba u+6c11 u+5171 u+548c u+56fd u+5bf9 u+5916 u+8d38 u+6613 u+7ecf u+6d4e u+5408 u+4f5c u+90e8 DUDE-02 : w8wpt7ydt79euu4mv7yax9puzb7seu8r7wuq85umt27ntv2bv3wgt\ 5xe795e ( 60 chars ) LDUDE : xswjuzru6nu7fv7kv4gutrwgb7mbwiu6cuzqqxm ( 39 chars, 35.0% shorter ) AMCW : w8up29ps5kdst5uh7ygsup29pm3cb39n8tknpb39hkygswhdysupa\ qd ( 55 chars ) LAMCW : xsujwxgu3kwwrv3fwvduunykm5ab9jwvmuwfmta ( 39 chars ) (SC2) Simplified Chinese String 2 : ( 18 letters ) u+4e2d u+56fd u+4eba u+6c11 u+5927 u+5b66 u+4e2d u+56fd u+8d22 u+653f u+91d1 u+878d u+653f u+7b56 u+7814 u+7a76 u+4e2d u+5fc3 DUDE-02 : w8wpt27at2whuu4mvxvguwbtxwmt27a757r82tp9w8qtyxn8u5ct\ 8yjvwcuycvwxmtt8q ( 69 chars ) LDUDE : xswjf5gu7fu6rb4ifz8dx6ju8gnu8kwugy8fd8rd ( 40 chars, 42.2% shorter ) AMCW : w8up29ps5kdst5uh7ygsup29pm3cb39n8tknpb39hkygswhdysupaqd ( 55 chars ) LAMCW : xsujun3kwwru2abujn36rwsgu8anwsg2uau6fgujk ( 41 chars ) About 20%~35% improvement in DUDE compression ratio is achieved in these Japanese Kanji/Katakana examples. (JP1) Japanese String 1: ( 25 letters ) U+793E U+56E3 U+6CD5 U+4EBA U+65E5 U+672C U+30CD U+30C3 U+30C8 U+30EF U+30FC U+30AF U+30A4 U+30F3 U+30D5 U+30A9 U+30E1 U+30FC U+30B7 U+30E7 U+30F3 U+30BB U+30F3 U+30BF U+30FC DUDE-02 : z3xQu97Pv4vGuuyRu5xRu6Jxz8BQMuHtDxDMxHuGzNwItPwMxAtE\ wIwIwNwD (60 chars) LDUDE : xs8Nu2Cu4RvMGBysxGyCKtHtQCPFtAyPyKtPBGPyAyAyFyR ( 47 chars, 21.6% shorter) AMCW : z3vQ28DDyxs5KB9fCjnvs6P6DI8R9N4RE9D7F4J8B9N5H8H9D5M9\ D5R9N ( 57 chars ) LAMCW : xs2NwsQu4B3KNPvs6M4JD5E4KIFA5A7P5H4KMPA6A4A6F4K ( 47 chars ) (JP2) Japanese String 2: ( 15 letters ) U+8CA1 U+56E3 U+6CD5 U+4EBA U+5317 U+6D77 U+9053 U+81EA U+7136 U+4FDD U+8B77 U+63A8 U+9032 U+5354 U+4F1A DUDE-02 : 266B74wCv4vGuuyRt74Pv8yA97uEtt5J9s7Nv88M6w4K827R9v3K\ 6vyGt6wQ (60 chars) LDUDE : xs3Hu9Ju4RvMt5CFvuGvsRxtGw5Iz2Ev6BzIwtJE ( 40 chars, 33.3% shorter) AMCW : 264B28DDyxs5KxtHD5zNuvI9kE3yt7PMmzBpiNtuxxEttK ( 46 chars ) LAMCW : xs9HwsQu4B3KvuIPwsMvsEytCu4K3uQy8R3Hu2QK ( 40 chars ) (JP3) Japanese String 3: ( 17 letters ) U+6771 U+4EAC U+90FD U+60C5 U+5831 U+30B5 U+30FC U+30D3 U+30B9 U+7523 U+696D U+5065 U+5EB7 U+4FDD U+967A U+7D44 U+5408 DUDE-02 : yztBu37P78xB9svIv29Ey22EwJuRyKwx3Kt6wQv3sI87CttyK734\ H85vQu3wN (61 chars) LDUDE : xttHxPvtFu9CDyssAyEyHyRys9PxQ4KHGEu4CuwJ ( 40 chars, 34.4% shorter) AMCW : z3vQ28DDyxs5KB9fCjnvs6P6DI8R9N4RE9D7F4J8B9N5H8H9D5M9\ D5R9N ( 57 chars ) LAMCW : xs2NwsQu4B3KNPvs6M4JD5E4KIFA5A7P5H4KMPA6A4A6F4K ( 47 chars ) LDUDE also shows the same good compression ratio for Latin family of scripts. (L1) Vietnamese: ( 38 syllables using diacritical marks ) Taisaohokhngthchi\ noitingVit U+0054 u+0061 u+0323 u+0069 u+0073 u+0061 u+006F u+0068 u+006F u+0323 u+006B u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+00EA u+0309 u+0063 u+0068 u+0069 u+0309 u+006E u+006F u+0301 u+0069 u+0074 u+0069 u+00EA u+0301 u+006E u+0067 U+0056 u+0069 u+00EA u+0323 u+0074 DUDE-02 : vEvfvwcvwktktcqhhvwnvwid3n3kjtdtn2cv8dvykmbvyavyhbvyqv\ yitptp2dv8mvyrjvBvr2dv6jvxh ( 82 chars ) LDUDE : uGuh5c5kckqhh5n4atm3n3ktmtdq2cxd7kmb7a7hb7q7irr2dxm7rt\ muDvr2dvj5f (66 chars , 16 chars(19%) shorter) (L2) Spanish: ( using basic Latin & Latin Supplement ) PorqunopuedensimplementehablarenEspaol U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070 u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070 u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061 u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070 u+0061 u+00F1 u+006F u+006C DUDE-02 : vAvrtpde3n2hbtrftabbmtptketptnjiimtktbpjdqptdthmuMvgdt\ b3a3qd (61 chars) LDUDE : uAurftmtg2q2hbrhcbbmfcepnjiimidpjdqpmrmuMuqmb3a3qd (51 chars, 10 chars (16%) shorter) (L3) Czech: (using Latin Extended A) Proprostnemluvesky U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074 u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D u+0065 u+0073 u+006B u+0079 DUDE-02 : vAuctptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc (45 chars) LDUDE : uAukfycypkfepzpzfmibmtb3m8ayiqtik (34 chars, 24% shorter) Security considerations ACE-encoded reordered code points are restored in reverse ACE translation and this improvement do not introduce any new security problems into ACE. References [DUDE02] Mark Welter, Brian Spolarich, Adam Costello, "DUDE: Differential Unicode Domain Encoding", 2001-May-31, draft-ietf-idn-dude-02. [AMCACEW] Adam Costello, "AMC-ACE-W version 0.1.0", 2001-May-31, draft-ietf-idn-amc-ace-w-00, latest version at http://www.cs.berkeley.edu/~amc/charset/amc-ace-w. [UNICODE] The Unicode Consortium, "The Unicode Standard", http://www.unicode.org/unicode/standard/standard.html. [IDNA] Patrik Falstrom, Paul Hoffman, "Internationalizing Host Names In Applications (IDNA)", draft-ietf-idn-idna-01 [NAMEPREP] Paul Hoffman, Marc Blanchet, "Preparation of Internationalized Host Names", Feb 2001, draft-ietf-idn-nameprep-03 Author Soobok Lee Postel Services, Inc. http://www.postel.co.kr Tel: +82-11-9774-2737 LDUDE: Example implementation into DUDE-02 This idea is applicable to any ACEs. LDUDE is a name for DUDE-02 implementation of this idea. Embedded hangul,han and Latin frequency tables are subject to change with further studies in the next revision of this draft. In Unix, save this example source code into ldude.c % cc -o ldude ldude.c % ./ldude -e < input_file > output_file % ./ldude -d < output_file An input file should contains u+????-form code points delimited with spaces or newlines. /* begin of ldude.c */ /******************************************************/ /* ldude.c 1.0 (2001-Jul-3) */ /* Soobok Lee */ /* dude.c from Adam M. Costello */ /******************************************************/ /* This is ANSI C code (C89) implementing */ /* DUDE (draft-ietf-idn-ldude-01). */ /************************************************************/ /* Public interface (would normally go in its own .h file): */ #include #include enum dude_status { dude_success, dude_bad_input, dude_big_output /* Output would exceed the space provided. */ }; enum case_sensitivity { case_sensitive, case_insensitive }; #if UINT_MAX >= 0x1FFFFF typedef unsigned int u_code_point; #else typedef unsigned long u_code_point; #endif enum dude_status dude_encode( unsigned int input_length, const u_code_point input[], const unsigned char uppercase_flags[], unsigned int *output_size, char output[] ); /* dude_encode() converts Unicode to DUDE (without any */ /* signature). The input must be represented as an array */ /* of Unicode code points (not code units; surrogate pairs */ /* are not allowed), and the output will be represented as */ /* null-terminated ASCII. The input_length is the number of code */ /* points in the input. The output_size is an in/out argument: */ /* the caller must pass in the maximum number of characters */ /* that may be output (including the terminating null), and on */ /* successful return it will contain the number of characters */ /* actually output (including the terminating null, so it will be */ /* one more than strlen() would return, which is why it is called */ /* output_size rather than output_length). The uppercase_flags */ /* array must hold input_length boolean values, where nonzero */ /* means the corresponding Unicode character should be forced */ /* to uppercase after being decoded, and zero means it is */ /* caseless or should be forced to lowercase. Alternatively, */ /* uppercase_flags may be a null pointer, which is equivalent */ /* to all zeros. The encoder always outputs lowercase base-32 */ /* characters except when nonzero values of uppercase_flags */ /* require otherwise. The return value may be any of the */ /* dude_status values defined above; if not dude_success, then */ /* output_size and output may contain garbage. On success, the */ /* encoder will never need to write an output_size greater than */ /* input_length*k+1 if all the input code points are less than 1 */ /* << (4*k), because of how the encoding is defined. */ enum dude_status dude_decode( enum case_sensitivity case_sensitivity, char scratch_space[], const char input[], unsigned int *output_length, u_code_point output[], unsigned char uppercase_flags[] ); /* dude_decode() converts DUDE (without any signature) to */ /* Unicode. The input must be represented as null-terminated */ /* ASCII, and the output will be represented as an array of */ /* Unicode code points. The case_sensitivity argument influences */ /* the check on the well-formedness of the input string; it */ /* must be case_sensitive if case-sensitive comparisons are */ /* allowed on encoded strings, case_insensitive otherwise. */ /* The scratch_space must point to space at least as large */ /* as the input, which will get overwritten (this allows the */ /* decoder to avoid calling malloc()). The output_length is */ /* an in/out argument: the caller must pass in the maximum */ /* number of code points that may be output, and on successful */ /* return it will contain the actual number of code points */ /* output. The uppercase_flags array must have room for at */ /* least output_length values, or it may be a null pointer if */ /* the case information is not needed. A nonzero flag indicates */ /* that the corresponding Unicode character should be forced to */ /* uppercase by the caller, while zero means it is caseless or */ /* should be forced to lowercase. The return value may be any */ /* of the dude_status values defined above; if not dude_success, */ /* then output_length, output, and uppercase_flags may contain */ /* garbage. On success, the decoder will never need to write */ /* an output_length greater than the length of the input (not */ /* counting the null terminator), because of how the encoding is */ /* defined. */ /**********************************************************/ /* Implementation (would normally go in its own .c file): */ #include /* Character utilities: */ /* base32[q] is the lowercase base-32 character representing */ /* the number q from the range 0 to 31. Note that we cannot */ /* use string literals for ASCII characters because an ANSI C */ /* compiler does not necessarily use ASCII. */ static const char base32[] = { 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, /* a-k */ 109, 110, /* m-n */ 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, /* p-z */ 50, 51, 52, 53, 54, 55, 56, 57 /* 2-9 */ }; /* base32_decode(c) returns the value of a base-32 character, in the */ /* range 0 to 31, or the constant base32_invalid if c is not a valid */ /* base-32 character. */ enum { base32_invalid = 32 }; static unsigned int base32_decode(char c) { if (c < 50) return base32_invalid; if (c <= 57) return c - 26; if (c < 97) c += 32; if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid; return c - 97 - (c > 108) - (c > 111); } /* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */ /* are equal, 1 otherwise. If case_sensitivity is case_insensitive, */ /* then ASCII A-Z are considered equal to a-z respectively. */ static int unequal( enum case_sensitivity case_sensitivity, const char s1[], const char s2[] ) { char c1, c2; if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0; for (;;) { c1 = *s1; c2 = *s2; if (c1 >= 65 && c1 <= 90) c1 += 32; if (c2 >= 65 && c2 <= 90) c2 += 32; if (c1 != c2) return 1; if (c1 == 0) return 0; ++s1, ++s2; } } /* LANGUAGE-SPECIFIC IMPROVEMENTS TO DUDE BASED ON CODE REORDERING */ int isHANGUL(u_code_point s) { int SIndex = s - 0xAC00; if (SIndex < 0 || SIndex >= 11172) { return 0; } return 1; }; int isUNIHAN(u_code_point s) { if (s >= 0x4E00 && s <= 0x9FAF) { return 1; } return 0; }; int isKATAKANA(u_code_point s) { if (s >= 0x30A0 && s <= 0x30FF) { return 1; } return 0; }; int isHINDI(u_code_point s) { if (s >= 0x0900 && s <= 0x0970) { return 1; } return 0; }; int isLatins(u_code_point s) { if (s < 0x370) { return 1; } return 0; }; // Most frequent 888 Hangeul syllables in Korean BizName #define HG 888 u_code_point hangeul_freq[HG] = { 0xd55c,0xad6d,0xd559,0xad50,0xb300,0xace0,0xb4f1,0xcd08, 0xc911,0xb824,0xd654,0xd604,0xc6d0,0xbb38,0xc721,0xbcd1, 0xc804,0xc790,0xae30,0xacf5,0xc0b0,0xc5c5,0xacc4,0xbb3c, 0xb958,0xc6b4,0xb3d9,0xcc28,0xc220,0xd56d,0xbd80,0xd68d, 0xac74,0xc124,0xcee8,0xd305,0xac15,0xc0dd,0xba85,0xc885, 0xd569,0xc601,0xb18d,0xbb34,0xc5ed,0xc5f0,0xb9f9,0xc120, 0xc11c,0xc6b8,0xbe44,0xc2dc,0xc2a4,0xd15c,0xd14d,0xd0dd, 0xc8fc,0xc2dd,0xd3ec,0xce20,0xbc30,0xb2ec,0xc368,0xaf43, 0xc815,0xbcf4,0xd1b5,0xc2e0,0xc0c1,0xc0ac,0xd68c,0xc138, 0xc6a9,0xd611,0xcd9c,0xd310,0xc9c4,0xb791,0xb9e4,0xd5d8, 0xb0b4,0xc154,0xc1fc,0xd551,0xb0a0,0xb110,0xb370,0xc774, 0xd648,0xb9c8,0xbc14,0xc624,0xc0bf,0xc9d0,0xc2ed,0xc548, 0xc18c,0xd504,0xd2b8,0xc6e8,0xbbf8,0xb514,0xc5b4,0xc544, 0xd53c,0xd30c,0xcf54,0xb9ac,0xceec,0xce7c,0xcf00,0xba54, 0xd22c,0xc740,0xd589,0xce74,0xb4dc,0xadf8,0xb8f9,0xb9b0, 0xc6d4,0xb79c,0xc5ec,0xc88b,0xace8,0xce90,0xb9bc,0xd578, 0xac1c,0xbc1c,0xc5d8,0xc9c0,0xae00,0xb85c,0xbc8c,0xc810, 0xd574,0xd138,0xd0c8,0xd1a0,0xd3f0,0xc678,0xacfc,0xc694, 0xc778,0xb137,0xb2f7,0xd154,0xb808,0xcf64,0xcef4,0xd4e8, 0xd130,0xc5d4,0xd14c,0xbc45,0xd06c,0xc13c,0xb2e5,0xd0c0, 0xc7a5,0xc57d,0xd488,0xc81c,0xc194,0xb8e8,0xc158,0xbc29, 0xc1a1,0xc77c,0xd074,0xb7fd,0xb355,0xd615,0xd328,0xd3c9, 0xc0bc,0xc131,0xb0a8,0xbd81,0xac8c,0xc784,0xd50c,0xb77c, 0xc6cc,0xb7ec,0xc704,0xc628,0xd658,0xacbd,0xcda9,0xbdf0, 0xc1c4,0xc564,0xc528,0xc640,0xce58,0xb125,0xc5d0,0xc5e0, 0xd050,0xc54c,0xd2f0,0xc720,0xbe0c,0xc5d1,0xbe14,0xd29c, 0xbcc0,0xd638,0xbc95,0xb960,0xae08,0xad11,0xcc9c,0xc18d, 0xc591,0xd65c,0xccad,0xc988,0xc139,0xd734,0xcf5c,0xb354, 0xd0dc,0xd398,0xb274,0xb9e5,0xbca8,0xcd95,0xc6f0,0xbca0, 0xb860,0xb2c9,0xad7f,0xc9c1,0xc2f8,0xc820,0xbe5b,0xc758, 0xbc84,0xc6f9,0xd558,0xac00,0xc744,0xbc31,0xb124,0xd035, 0xc288,0xc218,0xd37c,0xcee4,0xbba4,0xb2c8,0xb9c1,0xb450, 0xbbfc,0xb4e0,0xb95c,0xc655,0xd45c,0xc900,0xc584,0xd2f1, 0xd765,0xd0d1,0xc870,0xbcf5,0xad6c,0xd2b9,0xbaa9,0xb78c, 0xbd09,0xd6c4,0xd0b9,0xd038,0xd48d,0xbcc4,0xc554,0xc96c, 0xd070,0xd61c,0xc5b8,0xb798,0xc560,0xbca4,0xcc98,0xd3f4, 0xaddc,0xd6fc,0xbc00,0xc5c4,0xcde8,0xb984,0xcc3d,0xc30d, 0xb2dd,0xd2f8,0xcea0,0xc824,0xc728,0xd0a4,0xc6c5,0xd64d, 0xc2e4,0xc708,0xd30d,0xcc38,0xd5e4,0xb7f4,0xc625,0xad00, 0xb3cc,0xc608,0xd380,0xc62c,0xc2b9,0xc11d,0xb839,0xb9db, 0xc4f0,0xc0e4,0xadf9,0xd5a5,0xd53d,0xb80c,0xd718,0xb9de, 0xcda4,0xbe4c,0xcd94,0xb9cc,0xd1b1,0xb108,0xafbc,0xba38, 0xc6b0,0xc724,0xd329,0xd480,0xc82f,0xc874,0xc8e4,0xce85, 0xb4e4,0xbcf8,0xbc94,0xb825,0xc559,0xaca8,0xcfe0,0xd584, 0xb3c4,0xb098,0xbaa8,0xb2e4,0xc7ac,0xad8c,0xb178,0xbab0, 0xb2e8,0xc9d1,0xccb4,0xc74c,0xb8cc,0xc99d,0xac70,0xae40, 0xb2f9,0xc57c,0xb974,0xbc15,0xc800,0xac80,0xc785,0xb529, 0xb86f,0xcca0,0xbd88,0xbc18,0xbc88,0xc775,0xbd84,0xc791, 0xc0f5,0xb9ad,0xba55,0xac04,0xad70,0xd6a8,0xb2f4,0xb204, 0xcf58,0xd478,0xc0c8,0xd560,0xac10,0xd0c1,0xcfe8,0xc5fc, 0xc5f4,0xac08,0xc545,0xd5c8,0xd544,0xb809,0xd63c,0xb294, 0xb3c5,0xd568,0xcf13,0xc0c9,0xcd0c,0xb4c0,0xb7ed,0xac01, 0xc735,0xb780,0xc2ec,0xba74,0xba3c,0xaca9,0xce68,0xc871, 0xd76c,0xd669,0xd5ec,0xcc44,0xc9c8,0xc789,0xc561,0xb0c9, 0xb840,0xc83c,0xb208,0xd314,0xcc30,0xc801,0xc555,0xacac, 0xd640,0xc8fd,0xc808,0xbe59,0xd540,0xc5bc,0xc2f1,0xb864, 0xadfc,0xd5cc,0xc300,0xc190,0xbe45,0xac1d,0xd0a8,0xcc99, 0xc2ac,0xb09a,0xad74,0xce60,0xc811,0xc2a8,0xc26c,0xb9bd, 0xb85d,0xb784,0xb179,0xace1,0xacb0,0xd2bc,0xd134,0xd0c4, 0xce5c,0xcc45,0xcc2c,0xc6cd,0xc6c0,0xc568,0xc12c,0xb77d, 0xd3b8,0xd32c,0xd150,0xc7a1,0xbe48,0xb9d0,0xb7c9,0xb180, 0xd38c,0xbbf9,0xbaac,0xba40,0xb989,0xb799,0xb144,0xae38, 0xce21,0xc6c3,0xc308,0xc12f,0xc0b4,0xbc0d,0xb978,0xb760, 0xb378,0xb09c,0xd034,0xbc25,0xb9dd,0xb728,0xb2a5,0xb290, 0xd790,0xcd98,0xc637,0xc21c,0xb9e8,0xb9d8,0xb298,0xb150, 0xae09,0xac24,0xd2c0,0xcea1,0xc20d,0xc1e0,0xbcbd,0xbc38, 0xb871,0xb81b,0xb7a8,0xb304,0xd6c8,0xd3ed,0xd0f1,0xcf10, 0xcef5,0xcd5c,0xcd1d,0xc82c,0xc36c,0xc140,0xc0d8,0xbe75, 0xbe60,0xbe10,0xbd95,0xb7f0,0xb7b5,0xb610,0xb3c8,0xb374, 0xb12c,0xb099,0xb044,0xd788,0xd2f4,0xd1a4,0xd0d0,0xc9dc, 0xc58f,0xc2b4,0xc1a5,0xb3d4,0xafc0,0xadc0,0xd508,0xd3fc, 0xd3d0,0xd39c,0xd399,0xd31c,0xd1a8,0xd131,0xce94,0xcd09, 0xccd0,0xcca8,0xcc60,0xcc3e,0xcc29,0xc9f8,0xc9d5,0xc81d, 0xc7a0,0xc644,0xc2b5,0xbc34,0xb9c9,0xb828,0xb2d8,0xb205, 0xae4c,0xd608,0xd31d,0xc90c,0xc88c,0xc73c,0xc5fd,0xc14b, 0xc0f7,0xbc1d,0xba64,0xb561,0xb524,0xb118,0xb0ad,0xb07c, 0xade0,0xac9c,0xac78,0xcfe1,0xcf69,0xcf04,0xc9f1,0xc695, 0xc573,0xc55e,0xc53d,0xc329,0xc290,0xc19c,0xc0ad,0xbb18, 0xb86c,0xb7fc,0xb545,0xb17c,0xaebc,0xae68,0xacf6,0xd799, 0xd761,0xd655,0xd5db,0xd56b,0xd1f4,0xd0b4,0xce78,0xcc0c, 0xc990,0xc63b,0xc61b,0xc384,0xbd99,0xbd90,0xbcfc,0xb8e9, 0xb7a9,0xb69c,0xb5cc,0xb5a1,0xb518,0xb515,0xb451,0xb3fc, 0xb371,0xb358,0xb2ed,0xb188,0xb0e5,0xaf42,0xace4,0xd720, 0xd700,0xd234,0xd1a1,0xcf70,0xcf08,0xce04,0xc9d3,0xc98c, 0xc813,0xc7bc,0xc70c,0xc570,0xc500,0xc3e0,0xc3d8,0xc2f9, 0xc27d,0xc250,0xc22f,0xc058,0xbe68,0xbe54,0xbcbc,0xbabd, 0xba58,0xba4d,0xb9b4,0xb8f8,0xb460,0xb380,0xb1cc,0xb192, 0xb140,0xb128,0xb0c5,0xb0a9,0xb05d,0xaf2c,0xae54,0xad34, 0xac90,0xd575,0xd401,0xd3a8,0xd1b0,0xd0e0,0xcfc4,0xccbc, 0xcc4c,0xcc1c,0xcbd4,0xc9da,0xc989,0xc717,0xc635,0xc5ff, 0xc232,0xbafc,0xb8b0,0xb7ad,0xb5bc,0xb530,0xb4dd,0xb465, 0xb41c,0xb2d0,0xb057,0xb04c,0xad81,0xac13,0xd749,0xd6cc, 0xd6a1,0xd601,0xd5f4,0xd54c,0xd47c,0xd3ab,0xd384,0xd31f, 0xd300,0xd15d,0xd140,0xd0ed,0xd0ec,0xcffc,0xcf8c,0xce89, 0xce84,0xce75,0xce69,0xcd78,0xcd2c,0xcc10,0xc9dd,0xc999, 0xc8e0,0xc878,0xc7dd,0xc7c1,0xc7ad,0xc7a3,0xc794,0xc641, 0xc639,0xc610,0xc5b5,0xc58d,0xc575,0xc530,0xc38c,0xc2f6, 0xc2ef,0xc258,0xc22d,0xc219,0xc0cc,0xc0b6,0xbfcc,0xbf55, 0xbe7c,0xbe57,0xbdd4,0xbd24,0xbca7,0xbc1f,0xbc1b,0xbbac, 0xbab8,0xba67,0xb9f7,0xb9d1,0xb9bf,0xb98e,0xb987,0xb86d, 0xb81d,0xb818,0xb801,0xb730,0xb6f0,0xb6b1,0xb54c,0xb534, 0xb454,0xb3cb,0xb385,0xb364,0xb2f5,0xb2db,0xb214,0xb18b, 0xb11d,0xb0c4,0xb0b5,0xaee8,0xae45,0xacfd,0xac71,0xac19, 0xac11,0xd79d,0xd78c,0xd69f,0xd48b,0xd3a0,0xd301,0xd0e4, 0xd0d5,0xd03c,0xcf65,0xcf1c,0xcea3,0xcd1b,0xcc64,0xcabd, 0xc9c7,0xc950,0xc918,0xc8c4,0xc80a,0xc7c8,0xc74d,0xc719, 0xc6b1,0xc651,0xc619,0xc5e3,0xc580,0xc557,0xc52c,0xc388, 0xc2fc,0xc19d,0xc178,0xc174,0xc0ec,0xc0d0,0xc068,0xbf08, 0xbed0,0xbcd5,0xbc40,0xbc2d,0xbbff,0xbbc0,0xbb58,0xbb44, 0xba5c,0xba4b,0xba39,0xb9f5,0xb9d9,0xb97c,0xb959,0xb93c, 0xb8e1,0xb819,0xb738,0xb527,0xb51c,0xb458,0xb284,0xb1e8 }; #define HANGUL_REORDER_BASE 0XB000 u_code_point reorder_hangul(u_code_point s) { u_code_point i=HANGUL_REORDER_BASE; int k=0; for(k=0; k=0 && k=0 && k=0 && k=0 && k> 4, k = 1; tmp != 0; ++k, tmp >>= 4); fprintf(stderr,"diff %x,%x = prev %x ^ codept %x \n", k,diff,prev,codept); if (max_out - out < k) return dude_big_output; shift = uppercase_flags && uppercase_flags[in] ? 32 : 0; /* shift controls the case of the last base-32 digit. */ /* Each quintet has the form 1xxxx except the last is 0xxxx. */ /* Computing the base-32 digits in reverse order is easiest. */ out += k; output[out - 1] = base32[diff & 0xF] - shift; for (j = 2; j <= k; ++j) { diff >>= 4; output[out - j] = base32[0x10 | (diff & 0xF)]; } prev = codept; } /* Append the null terminator: */ if (max_out - out < 1) return dude_big_output; output[out++] = 0; *output_size = out; return dude_success; } /* Decoder: */ enum dude_status dude_decode( enum case_sensitivity case_sensitivity, char scratch_space[], const char input[], unsigned int *output_length, u_code_point output[], unsigned char uppercase_flags[] ) { u_code_point prev, q, diff; char c; unsigned int max_out, in, out, scratch_size; enum dude_status status; prev = 0x60; max_out = *output_length; for (c = input[in = 0], out = 0; c != 0; c = input[++in], ++out) { /* At the start of each iteration, in and out are the number of */ /* items already input/output, or equivalently, the indices of */ /* the next items to be input/output. */ if (max_out - out < 1) return dude_big_output; if (c == 0x2D) output[out] = c; /* hyphen-minus is literal */ else { /* Base-32 sequence. Decode quintets until 0xxxx is found: */ for (diff = 0; ; c = input[++in]) { q = base32_decode(c); if (q == base32_invalid){ return dude_bad_input; }; diff = (diff << 4) | (q & 0xF); if (q >> 4 == 0) break; } // prev = output[out] = prev ^ diff; prev = prev ^ diff; output[out] = restore_order(prev); // LSB } /* Case of last character determines uppercase flag: */ if (uppercase_flags) uppercase_flags[out] = c >= 65 && c <= 90; } /* Enforce the uniqueness of the encoding by re-encoding */ /* the output and comparing the result to the input: */ scratch_size = ++in; status = dude_encode(out, output, uppercase_flags, &scratch_size, scratch_space); if (status != dude_success || scratch_size != in || unequal(case_sensitivity, scratch_space, input) ) return dude_bad_input; *output_length = out; return dude_success; } /******************************************************************/ /* Wrapper for testing (would normally go in a separate .c file): */ #include #include #include #include /* For testing, we'll just set some compile-time limits rather than */ /* use malloc(), and set a compile-time option rather than using a */ /* command-line option. */ enum { unicode_max_length = 256, ace_max_size = 256, test_case_sensitivity = case_insensitive /* suitable for host names */ }; static void usage(char **argv) { fprintf(stderr, "%s -e reads code points and writes a DUDE string.\n" "%s -d reads a DUDE string and writes code points.\n" "Input and output are plain text in the native character set.\n" "Code points are in the form u+hex separated by whitespace.\n" "A DUDE string is a newline-terminated sequence of LDH characters\n" "(without any signature).\n" "The case of the u in u+hex is the force-to-uppercase flag.\n" , argv[0], argv[0]); exit(EXIT_FAILURE); } static void fail(const char *msg) { fputs(msg,stderr); exit(EXIT_FAILURE); } static const char too_big[] = "input or output is too large, recompile with larger limits\n"; static const char invalid_input[] = "invalid input\n"; static const char io_error[] = "I/O error\n"; /* The following string is used to convert LDH */ /* characters between ASCII and the native charset: */ static const char ldh_ascii[] = "................" "................" ".............-.." "0123456789......" ".ABCDEFGHIJKLMNO" "PQRSTUVWXYZ....." ".abcdefghijklmno" "pqrstuvwxyz"; int main(int argc, char **argv) { enum dude_status status; int r; char *p; if (argc != 2) usage(argv); if (argv[1][0] != '-') usage(argv); if (argv[1][2] != 0) usage(argv); if (argv[1][1] == 'e') { u_code_point input[unicode_max_length]; unsigned long codept; unsigned char uppercase_flags[unicode_max_length]; char output[ace_max_size], uplus[3]; unsigned int input_length, output_size, i; /* Read the input code points: */ input_length = 0; for (;;) { r = scanf("%2s%lx", uplus, &codept); if (ferror(stdin)) fail(io_error); if (r == EOF || r == 0) break; if (r != 2 || uplus[1] != '+' || codept > (u_code_point)-1) { fail(invalid_input); } if (input_length == unicode_max_length) fail(too_big); if (uplus[0] == 'u') uppercase_flags[input_length] = 0; else if (uplus[0] == 'U') uppercase_flags[input_length] = 1; else fail(invalid_input); input[input_length++] = codept; } /* Encode: */ output_size = ace_max_size; status = dude_encode(input_length, input, uppercase_flags, &output_size, output); if (status == dude_bad_input) fail(invalid_input); if (status == dude_big_output) fail(too_big); assert(status == dude_success); /* Convert to native charset and output: */ for (p = output; *p != 0; ++p) { i = *p; assert(i <= 122 && ldh_ascii[i] != '.'); *p = ldh_ascii[i]; } r = puts(output); fprintf(stderr,"length: %d\n", strlen(output)); if (r == EOF) fail(io_error); return EXIT_SUCCESS; } if (argv[1][1] == 'd') { char input[ace_max_size], scratch[ace_max_size], *pp; u_code_point output[unicode_max_length]; unsigned char uppercase_flags[unicode_max_length]; unsigned int input_length, output_length, i; /* Read the DUDE input string and convert to ASCII: */ fgets(input, ace_max_size, stdin); if (ferror(stdin)) fail(io_error); if (feof(stdin)) fail(invalid_input); input_length = strlen(input); if (input[input_length - 1] != '\n') fail(too_big); input[--input_length] = 0; for (p = input; *p != 0; ++p) { pp = strchr(ldh_ascii, *p); if (pp == 0) fail(invalid_input); *p = pp - ldh_ascii; } /* Decode: */ output_length = unicode_max_length; status = dude_decode(test_case_sensitivity, scratch, input, &output_length, output, uppercase_flags); if (status == dude_bad_input) fail(invalid_input); if (status == dude_big_output) fail(too_big); assert(status == dude_success); /* Output the result: */ for (i = 0; i < output_length; ++i) { r = printf("%s+%04lX\n", uppercase_flags[i] ? "U" : "u", (unsigned long) output[i] ); if (r < 0) fail(io_error); } return EXIT_SUCCESS; } usage(argv); return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ } /* end of ldude.c */ LAMCW: Example implementation into AMC-ACE-W This idea is applicable to any ACEs. LAMCW is a name for AMC-ACE-W implementation of this idea. Embedded hangul,han and Latin frequency tables are subject to change with further studies in the next revision of this draft. In Unix, save this example source code into ldude.c % cc -o lamcw lamcw.c % ./lamcw -e < input_file > output_file % ./lamcw -d < output_file An input file should contains u+????-form code points delimited with spaces or newlines. /* begin of lamcw.c */ /******************************************************/ /* lamcw.c 1.0 (2001-Jul-3) */ /* Soobok Lee */ /* amcw.c from Adam M. Costello */ /******************************************************/ /* This is ANSI C code (C89) implementing AMC-ACE-W version 0.1.*. */ /************************************************************/ /* Public interface (would normally go in its own .h file): */ #include enum amc_ace_status { amc_ace_success, amc_ace_bad_input, amc_ace_big_output /* Output would exceed the space provided. */ }; enum case_sensitivity { case_sensitive, case_insensitive }; #if UINT_MAX >= 0x1FFFFF typedef unsigned int u_code_point; #else typedef unsigned long u_code_point; #endif enum amc_ace_status amc_ace_w_encode( unsigned int input_length, const u_code_point input[], const unsigned char uppercase_flags[], unsigned int *output_size, char output[] ); /* amc_ace_w_encode() converts Unicode to AMC-ACE-W (without */ /* any signature). The input must be represented as an array */ /* of Unicode code points (not code units; surrogate pairs */ /* are not allowed), and the output will be represented as */ /* null-terminated ASCII. The input_length is the number of */ /* code points in the input. The output_size is an in/out */ /* argument: the caller must pass in the maximum number of */ /* characters that may be output (including the terminating */ /* null), and on successful return it will contain the number of */ /* characters actually output (including the terminating null, */ /* so it will be one more than strlen() would return, which is */ /* why it is called output_size rather than output_length). The */ /* uppercase_flags array must hold input_length boolean values, */ /* where nonzero means the corresponding Unicode character should */ /* be forced to uppercase after being decoded, and zero means it */ /* is caseless or should be forced to lowercase. Alternatively, */ /* uppercase_flags may be a null pointer, which is equivalent */ /* to all zeros. The letters a-z and A-Z are always encoded */ /* literally, regardless of the corresponding flags. The encoder */ /* always outputs lowercase base-32 characters except when */ /* nonzero values of uppercase_flags require otherwise. The */ /* return value may be any of the amc_ace_status values defined */ /* above; if not amc_ace_success, then output_size and output may */ /* contain garbage. On success, the encoder will never need to */ /* write an output_size greater than input_length*5+1, because of */ /* how the encoding is defined. */ enum amc_ace_status amc_ace_w_decode( enum case_sensitivity case_sensitivity, char scratch_space[], const char input[], unsigned int *output_length, u_code_point output[], unsigned char uppercase_flags[] ); /* amc_ace_w_decode() converts AMC-ACE-W (without any signature) */ /* to Unicode. The input must be represented as null-terminated */ /* ASCII, and the output will be represented as an array of */ /* Unicode code points. The case_sensitivity argument influences */ /* the check on the well-formedness of the input string; it */ /* must be case_sensitive if case-sensitive comparisons are */ /* allowed on encoded strings, case_insensitive otherwise. */ /* The scratch_space must point to space at least as large */ /* as the input, which will get overwritten (this allows the */ /* decoder to avoid calling malloc()). The output_length is */ /* an in/out argument: the caller must pass in the maximum */ /* number of code points that may be output, and on successful */ /* return it will contain the actual number of code points */ /* output. The uppercase_flags array must have room for at */ /* least output_length values, or it may be a null pointer */ /* if the case information is not needed. A nonzero flag */ /* indicates that the corresponding Unicode character should */ /* be forced to uppercase by the caller, while zero means it */ /* is caseless or should be forced to lowercase. The letters */ /* a-z and A-Z are output already in the proper case, but their */ /* flags will be set appropriately so that applying the flags */ /* would be harmless. The return value may be any of the */ /* amc_ace_status values defined above; if not amc_ace_success, */ /* then output_length, output, and uppercase_flags may contain */ /* garbage. On success, the decoder will never need to write */ /* an output_length greater than the length of the input (not */ /* counting the null terminator), because of how the encoding is */ /* defined. */ /**********************************************************/ /* Implementation (would normally go in its own .c file): */ #include /* base32[q] is the lowercase base-32 character representing */ /* the number q from the range 0 to 31. Note that we cannot */ /* use string literals for ASCII characters because an ANSI C */ /* compiler does not necessarily use ASCII. */ static const char base32[] = { 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, /* a-k */ 109, 110, /* m-n */ 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, /* p-z */ 50, 51, 52, 53, 54, 55, 56, 57 /* 2-9 */ }; /* base32_decode(c) returns the value of a base-32 character, in the */ /* range 0 to 31, or the constant base32_invalid if c is not a valid */ /* base-32 character. */ enum { base32_invalid = 32 }; static unsigned int base32_decode(char c) { if (c < 50) return base32_invalid; if (c <= 57) return c - 26; if (c < 97) c += 32; if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid; return c - 97 - (c > 108) - (c > 111); } /* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */ /* are equal, 1 otherwise. If case_sensitivity is case_insensitive, */ /* then ASCII A-Z are considered equal to a-z respectively. */ static int unequal( enum case_sensitivity case_sensitivity, const char s1[], const char s2[] ) { char c1, c2; if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0; for (;;) { c1 = *s1; c2 = *s2; if (c1 >= 65 && c1 <= 90) c1 += 32; if (c2 >= 65 && c2 <= 90) c2 += 32; if (c1 != c2) return 1; if (c1 == 0) return 0; ++s1, ++s2; } } /* LANGUAGE-SPECIFIC IMPROVEMENTS TO DUDE BASED ON CODE REORDERING */ int isHANGUL(u_code_point s) { int SIndex = s - 0xAC00; if (SIndex < 0 || SIndex >= 11172) { return 0; } return 1; }; int isUNIHAN(u_code_point s) { if (s >= 0x4E00 && s <= 0x9FAF) { return 1; } return 0; }; int isKATAKANA(u_code_point s) { if (s >= 0x30A0 && s <= 0x30FF) { return 1; } return 0; }; int isHINDI(u_code_point s) { if (s >= 0x0900 && s <= 0x0970) { return 1; } return 0; }; int isLatins(u_code_point s) { if (s < 0x370) { return 1; } return 0; }; // Most frequent 888 Hangeul syllables in Korean BizName #define HG 888 u_code_point hangeul_freq[HG] = { 0xd55c,0xad6d,0xd559,0xad50,0xb300,0xace0,0xb4f1,0xcd08, 0xc911,0xb824,0xd654,0xd604,0xc6d0,0xbb38,0xc721,0xbcd1, 0xc804,0xc790,0xae30,0xacf5,0xc0b0,0xc5c5,0xacc4,0xbb3c, 0xb958,0xc6b4,0xb3d9,0xcc28,0xc220,0xd56d,0xbd80,0xd68d, 0xac74,0xc124,0xcee8,0xd305,0xac15,0xc0dd,0xba85,0xc885, 0xd569,0xc601,0xb18d,0xbb34,0xc5ed,0xc5f0,0xb9f9,0xc120, 0xc11c,0xc6b8,0xbe44,0xc2dc,0xc2a4,0xd15c,0xd14d,0xd0dd, 0xc8fc,0xc2dd,0xd3ec,0xce20,0xbc30,0xb2ec,0xc368,0xaf43, 0xc815,0xbcf4,0xd1b5,0xc2e0,0xc0c1,0xc0ac,0xd68c,0xc138, 0xc6a9,0xd611,0xcd9c,0xd310,0xc9c4,0xb791,0xb9e4,0xd5d8, 0xb0b4,0xc154,0xc1fc,0xd551,0xb0a0,0xb110,0xb370,0xc774, 0xd648,0xb9c8,0xbc14,0xc624,0xc0bf,0xc9d0,0xc2ed,0xc548, 0xc18c,0xd504,0xd2b8,0xc6e8,0xbbf8,0xb514,0xc5b4,0xc544, 0xd53c,0xd30c,0xcf54,0xb9ac,0xceec,0xce7c,0xcf00,0xba54, 0xd22c,0xc740,0xd589,0xce74,0xb4dc,0xadf8,0xb8f9,0xb9b0, 0xc6d4,0xb79c,0xc5ec,0xc88b,0xace8,0xce90,0xb9bc,0xd578, 0xac1c,0xbc1c,0xc5d8,0xc9c0,0xae00,0xb85c,0xbc8c,0xc810, 0xd574,0xd138,0xd0c8,0xd1a0,0xd3f0,0xc678,0xacfc,0xc694, 0xc778,0xb137,0xb2f7,0xd154,0xb808,0xcf64,0xcef4,0xd4e8, 0xd130,0xc5d4,0xd14c,0xbc45,0xd06c,0xc13c,0xb2e5,0xd0c0, 0xc7a5,0xc57d,0xd488,0xc81c,0xc194,0xb8e8,0xc158,0xbc29, 0xc1a1,0xc77c,0xd074,0xb7fd,0xb355,0xd615,0xd328,0xd3c9, 0xc0bc,0xc131,0xb0a8,0xbd81,0xac8c,0xc784,0xd50c,0xb77c, 0xc6cc,0xb7ec,0xc704,0xc628,0xd658,0xacbd,0xcda9,0xbdf0, 0xc1c4,0xc564,0xc528,0xc640,0xce58,0xb125,0xc5d0,0xc5e0, 0xd050,0xc54c,0xd2f0,0xc720,0xbe0c,0xc5d1,0xbe14,0xd29c, 0xbcc0,0xd638,0xbc95,0xb960,0xae08,0xad11,0xcc9c,0xc18d, 0xc591,0xd65c,0xccad,0xc988,0xc139,0xd734,0xcf5c,0xb354, 0xd0dc,0xd398,0xb274,0xb9e5,0xbca8,0xcd95,0xc6f0,0xbca0, 0xb860,0xb2c9,0xad7f,0xc9c1,0xc2f8,0xc820,0xbe5b,0xc758, 0xbc84,0xc6f9,0xd558,0xac00,0xc744,0xbc31,0xb124,0xd035, 0xc288,0xc218,0xd37c,0xcee4,0xbba4,0xb2c8,0xb9c1,0xb450, 0xbbfc,0xb4e0,0xb95c,0xc655,0xd45c,0xc900,0xc584,0xd2f1, 0xd765,0xd0d1,0xc870,0xbcf5,0xad6c,0xd2b9,0xbaa9,0xb78c, 0xbd09,0xd6c4,0xd0b9,0xd038,0xd48d,0xbcc4,0xc554,0xc96c, 0xd070,0xd61c,0xc5b8,0xb798,0xc560,0xbca4,0xcc98,0xd3f4, 0xaddc,0xd6fc,0xbc00,0xc5c4,0xcde8,0xb984,0xcc3d,0xc30d, 0xb2dd,0xd2f8,0xcea0,0xc824,0xc728,0xd0a4,0xc6c5,0xd64d, 0xc2e4,0xc708,0xd30d,0xcc38,0xd5e4,0xb7f4,0xc625,0xad00, 0xb3cc,0xc608,0xd380,0xc62c,0xc2b9,0xc11d,0xb839,0xb9db, 0xc4f0,0xc0e4,0xadf9,0xd5a5,0xd53d,0xb80c,0xd718,0xb9de, 0xcda4,0xbe4c,0xcd94,0xb9cc,0xd1b1,0xb108,0xafbc,0xba38, 0xc6b0,0xc724,0xd329,0xd480,0xc82f,0xc874,0xc8e4,0xce85, 0xb4e4,0xbcf8,0xbc94,0xb825,0xc559,0xaca8,0xcfe0,0xd584, 0xb3c4,0xb098,0xbaa8,0xb2e4,0xc7ac,0xad8c,0xb178,0xbab0, 0xb2e8,0xc9d1,0xccb4,0xc74c,0xb8cc,0xc99d,0xac70,0xae40, 0xb2f9,0xc57c,0xb974,0xbc15,0xc800,0xac80,0xc785,0xb529, 0xb86f,0xcca0,0xbd88,0xbc18,0xbc88,0xc775,0xbd84,0xc791, 0xc0f5,0xb9ad,0xba55,0xac04,0xad70,0xd6a8,0xb2f4,0xb204, 0xcf58,0xd478,0xc0c8,0xd560,0xac10,0xd0c1,0xcfe8,0xc5fc, 0xc5f4,0xac08,0xc545,0xd5c8,0xd544,0xb809,0xd63c,0xb294, 0xb3c5,0xd568,0xcf13,0xc0c9,0xcd0c,0xb4c0,0xb7ed,0xac01, 0xc735,0xb780,0xc2ec,0xba74,0xba3c,0xaca9,0xce68,0xc871, 0xd76c,0xd669,0xd5ec,0xcc44,0xc9c8,0xc789,0xc561,0xb0c9, 0xb840,0xc83c,0xb208,0xd314,0xcc30,0xc801,0xc555,0xacac, 0xd640,0xc8fd,0xc808,0xbe59,0xd540,0xc5bc,0xc2f1,0xb864, 0xadfc,0xd5cc,0xc300,0xc190,0xbe45,0xac1d,0xd0a8,0xcc99, 0xc2ac,0xb09a,0xad74,0xce60,0xc811,0xc2a8,0xc26c,0xb9bd, 0xb85d,0xb784,0xb179,0xace1,0xacb0,0xd2bc,0xd134,0xd0c4, 0xce5c,0xcc45,0xcc2c,0xc6cd,0xc6c0,0xc568,0xc12c,0xb77d, 0xd3b8,0xd32c,0xd150,0xc7a1,0xbe48,0xb9d0,0xb7c9,0xb180, 0xd38c,0xbbf9,0xbaac,0xba40,0xb989,0xb799,0xb144,0xae38, 0xce21,0xc6c3,0xc308,0xc12f,0xc0b4,0xbc0d,0xb978,0xb760, 0xb378,0xb09c,0xd034,0xbc25,0xb9dd,0xb728,0xb2a5,0xb290, 0xd790,0xcd98,0xc637,0xc21c,0xb9e8,0xb9d8,0xb298,0xb150, 0xae09,0xac24,0xd2c0,0xcea1,0xc20d,0xc1e0,0xbcbd,0xbc38, 0xb871,0xb81b,0xb7a8,0xb304,0xd6c8,0xd3ed,0xd0f1,0xcf10, 0xcef5,0xcd5c,0xcd1d,0xc82c,0xc36c,0xc140,0xc0d8,0xbe75, 0xbe60,0xbe10,0xbd95,0xb7f0,0xb7b5,0xb610,0xb3c8,0xb374, 0xb12c,0xb099,0xb044,0xd788,0xd2f4,0xd1a4,0xd0d0,0xc9dc, 0xc58f,0xc2b4,0xc1a5,0xb3d4,0xafc0,0xadc0,0xd508,0xd3fc, 0xd3d0,0xd39c,0xd399,0xd31c,0xd1a8,0xd131,0xce94,0xcd09, 0xccd0,0xcca8,0xcc60,0xcc3e,0xcc29,0xc9f8,0xc9d5,0xc81d, 0xc7a0,0xc644,0xc2b5,0xbc34,0xb9c9,0xb828,0xb2d8,0xb205, 0xae4c,0xd608,0xd31d,0xc90c,0xc88c,0xc73c,0xc5fd,0xc14b, 0xc0f7,0xbc1d,0xba64,0xb561,0xb524,0xb118,0xb0ad,0xb07c, 0xade0,0xac9c,0xac78,0xcfe1,0xcf69,0xcf04,0xc9f1,0xc695, 0xc573,0xc55e,0xc53d,0xc329,0xc290,0xc19c,0xc0ad,0xbb18, 0xb86c,0xb7fc,0xb545,0xb17c,0xaebc,0xae68,0xacf6,0xd799, 0xd761,0xd655,0xd5db,0xd56b,0xd1f4,0xd0b4,0xce78,0xcc0c, 0xc990,0xc63b,0xc61b,0xc384,0xbd99,0xbd90,0xbcfc,0xb8e9, 0xb7a9,0xb69c,0xb5cc,0xb5a1,0xb518,0xb515,0xb451,0xb3fc, 0xb371,0xb358,0xb2ed,0xb188,0xb0e5,0xaf42,0xace4,0xd720, 0xd700,0xd234,0xd1a1,0xcf70,0xcf08,0xce04,0xc9d3,0xc98c, 0xc813,0xc7bc,0xc70c,0xc570,0xc500,0xc3e0,0xc3d8,0xc2f9, 0xc27d,0xc250,0xc22f,0xc058,0xbe68,0xbe54,0xbcbc,0xbabd, 0xba58,0xba4d,0xb9b4,0xb8f8,0xb460,0xb380,0xb1cc,0xb192, 0xb140,0xb128,0xb0c5,0xb0a9,0xb05d,0xaf2c,0xae54,0xad34, 0xac90,0xd575,0xd401,0xd3a8,0xd1b0,0xd0e0,0xcfc4,0xccbc, 0xcc4c,0xcc1c,0xcbd4,0xc9da,0xc989,0xc717,0xc635,0xc5ff, 0xc232,0xbafc,0xb8b0,0xb7ad,0xb5bc,0xb530,0xb4dd,0xb465, 0xb41c,0xb2d0,0xb057,0xb04c,0xad81,0xac13,0xd749,0xd6cc, 0xd6a1,0xd601,0xd5f4,0xd54c,0xd47c,0xd3ab,0xd384,0xd31f, 0xd300,0xd15d,0xd140,0xd0ed,0xd0ec,0xcffc,0xcf8c,0xce89, 0xce84,0xce75,0xce69,0xcd78,0xcd2c,0xcc10,0xc9dd,0xc999, 0xc8e0,0xc878,0xc7dd,0xc7c1,0xc7ad,0xc7a3,0xc794,0xc641, 0xc639,0xc610,0xc5b5,0xc58d,0xc575,0xc530,0xc38c,0xc2f6, 0xc2ef,0xc258,0xc22d,0xc219,0xc0cc,0xc0b6,0xbfcc,0xbf55, 0xbe7c,0xbe57,0xbdd4,0xbd24,0xbca7,0xbc1f,0xbc1b,0xbbac, 0xbab8,0xba67,0xb9f7,0xb9d1,0xb9bf,0xb98e,0xb987,0xb86d, 0xb81d,0xb818,0xb801,0xb730,0xb6f0,0xb6b1,0xb54c,0xb534, 0xb454,0xb3cb,0xb385,0xb364,0xb2f5,0xb2db,0xb214,0xb18b, 0xb11d,0xb0c4,0xb0b5,0xaee8,0xae45,0xacfd,0xac71,0xac19, 0xac11,0xd79d,0xd78c,0xd69f,0xd48b,0xd3a0,0xd301,0xd0e4, 0xd0d5,0xd03c,0xcf65,0xcf1c,0xcea3,0xcd1b,0xcc64,0xcabd, 0xc9c7,0xc950,0xc918,0xc8c4,0xc80a,0xc7c8,0xc74d,0xc719, 0xc6b1,0xc651,0xc619,0xc5e3,0xc580,0xc557,0xc52c,0xc388, 0xc2fc,0xc19d,0xc178,0xc174,0xc0ec,0xc0d0,0xc068,0xbf08, 0xbed0,0xbcd5,0xbc40,0xbc2d,0xbbff,0xbbc0,0xbb58,0xbb44, 0xba5c,0xba4b,0xba39,0xb9f5,0xb9d9,0xb97c,0xb959,0xb93c, 0xb8e1,0xb819,0xb738,0xb527,0xb51c,0xb458,0xb284,0xb1e8 }; #define HANGUL_REORDER_BASE 0XB000 u_code_point reorder_hangul(u_code_point s) { u_code_point i=HANGUL_REORDER_BASE; int k=0; for(k=0; k=0 && k=0 && k=0 && k=0 && k 3 ? 1 : *style; refpoint[1] = (n >> 4) << 4; if (k > 2) refpoint[2] = n - 0xA0 < 0xE0 ? 0xA0 : (n >> 8) << 8; if (k > 3) refpoint[3] = n - 0x3000 < 0x7000 ? 0x4E00 : *style == 1 && n - 0xA000 < 0x3800 ? 0x8800 : (n >> 12) << 12; } /* Main encode function: */ enum amc_ace_status amc_ace_w_encode( unsigned int input_length, const u_code_point input[], const unsigned char uppercase_flags[], unsigned int *output_size, char output[] ) { unsigned int style, literal, max_out, in, out, k, j; u_code_point n, delta; const u_code_point maxdelta[2][6] = {{0,0xF,0xFF,0xFFF,0xFFFF,0xFFFFF}, {0,0,0xFF,0x4FFF,0xFFFF,0xFFFFF}}; char shift; /* Initialize the state: */ u_code_point refpoint[6] = {0, 0xE0, 0xA0, 0, 0, 0x10000}; style = literal = 0; max_out = *output_size; for (in = out = 0; in < input_length; ++in) { /* At the start of each iteration, in and out are the number of */ /* items already input/output, or equivalently, the indices of */ /* the next items to be input/output. */ n = input[in]; /* Check the code point range to avoid array bounds errors later: */ if (n > 0x10FFFF) return amc_ace_bad_input; if (n == 0x2D) { /* Hyphen-minus is doubled. */ if (max_out - out < 2) return amc_ace_big_output; output[out++] = 0x2D; output[out++] = 0x2D; } else if ( n <= 122 && ( n >= 97 || n == 45 || (n >= 48 && n <= 57) || (n >= 65 && n <= 90) ) ) { /* Encode an LDH character literally. */ if (max_out - out < 1 + !literal) return amc_ace_big_output; /* Switch to literal mode if necessary: */ if (!literal) output[out++] = 0x2D; literal = 1; output[out++] = n; } else { /* Encode a non-LDH character using base-32. */ /* First compute the number of base-32 characters (k): */ n = reorder(n); // ADDED ***************** for (k = 1 + style; ; ++k) { delta = n - refpoint[k]; if (delta <= maxdelta[style][k]) break; } if (max_out - out < k + literal) return amc_ace_big_output; /* Switch to base-32 mode if necessary: */ if (literal) output[out++] = 0x2D; literal = 0; shift = uppercase_flags && uppercase_flags[in] ? 32 : 0; /* Check for the extended delta of style 1 window 3: */ if (k == 3 && delta >= 0x1000) { /* The top 16k of window 3 is encoded as 0xxxx xxxxx xxxxx. */ delta -= 0x1000; output[out++] = base32[delta >> 10] - shift; output[out++] = base32[(delta >> 5) & 0x1F]; output[out++] = base32[delta & 0x1F]; } else { /* Each quintet has the form 1xxxx except the last is 0xxxx. */ /* Computing the base-32 digits in reverse order is easiest. */ out += k; output[out - 1] = base32[delta & 0xF] - shift; for (j = 2; j <= k; ++j) { delta >>= 4; output[out - j] = base32[0x10 | (delta & 0xF)]; } } update(refpoint, &style, n, k); } } /* Append the null terminator: */ if (max_out - out < 1) return amc_ace_big_output; output[out++] = 0; *output_size = out; return amc_ace_success; } /* Main decode function: */ enum amc_ace_status amc_ace_w_decode( enum case_sensitivity case_sensitivity, char scratch_space[], const char input[], unsigned int *output_length, u_code_point output[], unsigned char uppercase_flags[] ) { u_code_point q, delta; char c; unsigned int style, literal, max_out, in, out, k, scratch_size; enum amc_ace_status status; /* Initialize the state: */ u_code_point refpoint[6] = {0, 0xE0, 0xA0, 0, 0, 0x10000}; style = literal = 0; max_out = *output_length; for (c = input[in = 0], out = 0; c != 0; c = input[++in], ++out) { /* At the start of each iteration, in and out are the number of */ /* items already input/output, or equivalently, the indices of */ /* the next items to be input/output. c is the same as input[in] */ /* except when "extra" characters have been consumed (see below). */ if (c == 0x2D && input[in + 1] != 0x2D) { /* Unpaired hyphen-minus toggles mode. */ literal = !literal; c = input[++in]; } if (max_out - out < 1) return amc_ace_big_output; if (c == 0x2D) { /* Double hyphen-minus represents a hyphen-minus. */ ++in; output[out] = 0x2D; } else { if (literal) output[out] = c; else { /* Decode a base-32 sequence. */ /* First decode quintets until 0xxxx is found: */ for (delta = 0, k = 1; ; c = input[++in], ++k) { q = base32_decode(c); if (q == base32_invalid || k > 5) return amc_ace_bad_input; delta = (delta << 4) | (q & 0xF); if (q >> 4 == 0) break; } if (style == 1 && k == 1) { /* Style 1 has no window 1, so it must be the extended */ /* delta of window 3, encoded as 0xxxx xxxxx xxxxx. */ /* Consume the two "extra" characters: */ for (; k < 3; ++k) { q = base32_decode(input[++in]); if (q == base32_invalid) return amc_ace_bad_input; delta = (delta << 5) | q; } delta += 0x1000; } output[out] = refpoint[k] + delta; update(refpoint, &style, output[out], k); output[out] = restore_order(output[out]); // ADDED } } /* Case of last non-extra character determines uppercase flag: */ if (uppercase_flags) uppercase_flags[out] = c >= 65 && c <= 90; } /* Enforce the uniqueness of the encoding by re-encoding */ /* the output and comparing the result to the input: */ scratch_size = ++in; status = amc_ace_w_encode(out, output, uppercase_flags, &scratch_size, scratch_space); if (status != amc_ace_success || scratch_size != in || unequal(case_sensitivity, scratch_space, input) ) return amc_ace_bad_input; *output_length = out; return amc_ace_success; } /******************************************************************/ /* Wrapper for testing (would normally go in a separate .c file): */ #include #include #include #include /* For testing, we'll just set some compile-time limits rather than */ /* use malloc(), and set a compile-time option rather than using a */ /* command-line option. */ enum { unicode_max_length = 256, ace_max_size = 256, test_case_sensitivity = case_insensitive /* suitable for host names */ }; static void usage(char **argv) { fprintf(stderr, "%s -e reads code points and writes an AMC-ACE-W string.\n" "%s -d reads an AMC-ACE-W string and writes code points.\n" "Input and output are plain text in the native character set.\n" "Code points are in the form u+hex separated by whitespace.\n" "An AMC-ACE-W string is a newline-terminated sequence of LDH\n" "characters (without any signature).\n" "The case of the u in u+hex is the force-to-uppercase flag.\n" , argv[0], argv[0]); exit(EXIT_FAILURE); } static void fail(const char *msg) { fputs(msg,stderr); exit(EXIT_FAILURE); } static const char too_big[] = "input or output is too large, recompile with larger limits\n"; static const char invalid_input[] = "invalid input\n"; static const char io_error[] = "I/O error\n"; /* The following string is used to convert LDH */ /* characters between ASCII and the native charset: */ static const char ldh_ascii[] = "................" "................" ".............-.." "0123456789......" ".ABCDEFGHIJKLMNO" "PQRSTUVWXYZ....." ".abcdefghijklmno" "pqrstuvwxyz"; int main(int argc, char **argv) { enum amc_ace_status status; int r; char *p; if (argc != 2) usage(argv); if (argv[1][0] != '-') usage(argv); if (argv[1][2] != 0) usage(argv); if (argv[1][1] == 'e') { u_code_point input[unicode_max_length]; unsigned long codept; unsigned char uppercase_flags[unicode_max_length]; char output[ace_max_size], uplus[3]; unsigned int input_length, output_size, i; /* Read the input code points: */ input_length = 0; for (;;) { r = scanf("%2s%lx", uplus, &codept); if (ferror(stdin)) fail(io_error); if (r == EOF || r == 0) break; if (r != 2 || uplus[1] != '+' || codept > (u_code_point)-1) { fail(invalid_input); } if (input_length == unicode_max_length) fail(too_big); if (uplus[0] == 'u') uppercase_flags[input_length] = 0; else if (uplus[0] == 'U') uppercase_flags[input_length] = 1; else fail(invalid_input); input[input_length++] = codept; } /* Encode: */ output_size = ace_max_size; status = amc_ace_w_encode(input_length, input, uppercase_flags, &output_size, output); if (status == amc_ace_bad_input) fail(invalid_input); if (status == amc_ace_big_output) fail(too_big); assert(status == amc_ace_success); /* Convert to native charset and output: */ for (p = output; *p != 0; ++p) { i = *p; assert(i <= 122 && ldh_ascii[i] != '.'); *p = ldh_ascii[i]; } r = puts(output); if (r == EOF) fail(io_error); return EXIT_SUCCESS; } if (argv[1][1] == 'd') { char input[ace_max_size], scratch[ace_max_size], *pp; u_code_point output[unicode_max_length]; unsigned char uppercase_flags[unicode_max_length]; unsigned int input_length, output_length, i; /* Read the AMC-ACE-W input string and convert to ASCII: */ fgets(input, ace_max_size, stdin); if (ferror(stdin)) fail(io_error); if (feof(stdin)) fail(invalid_input); input_length = strlen(input); if (input[input_length - 1] != '\n') fail(too_big); input[--input_length] = 0; for (p = input; *p != 0; ++p) { pp = strchr(ldh_ascii, *p); if (pp == 0) fail(invalid_input); *p = pp - ldh_ascii; } /* Decode: */ output_length = unicode_max_length; status = amc_ace_w_decode(test_case_sensitivity, scratch, input, &output_length, output, uppercase_flags); if (status == amc_ace_bad_input) fail(invalid_input); if (status == amc_ace_big_output) fail(too_big); assert(status == amc_ace_success); /* Output the result: */ for (i = 0; i < output_length; ++i) { r = printf("%s+%04lX\n", uppercase_flags[i] ? "U" : "u", (unsigned long) output[i] ); if (r < 0) fail(io_error); } return EXIT_SUCCESS; } usage(argv); return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ } /* end of lamcw.c */