Internet Draft Serge Winitzki draft-winitzki-koi8c-encoding-00.txt Expires: April 2002 Extended Cyrillic Character Set KOI8-C Status of this Memo This memo is an Internet-Draft and is subject to all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Author Serge Winitzki Abstract This document provides information about character encoding KOI8-C (KOI8 Cyrillic) proposed for use with Russian (including old orthography), Ukrainian, Belorussian, Serbian, Macedonian languages with special punctuation marks. KOI8-C is compatible with KOI8-R [1] and KOI8-U [2] in the area of Russian, Ukrainian and Belorussian letters, and extends these with letters for old Russian orthography, Yugoslavian cyrillic letters and typographical symbols in positions compatible with CP1251 for use in legacy applications. Proposed MIME character set name: koi8-c Introduction This document provides information about a proposed new character encoding KOI8-C, an extension of the KOI8-R and KOI8-U standards. This extension provides support for all Russian letters (including those needed for old Russian orthography), as well as Cyrillic letters used in Belorussian, Macedonian, Serbian and Ukrainian languages, and certain frequently used typographic characters borrowed from the CP1251 encoding. The KOI8-C encoding is compatible with the existing KOI8-RU and CP1251 encodings in the relevant characters. Motivation The KOI8 family of encodings has long been used for electronic exchange of Cyrillic texts [1,2]. The following considerations have led the author to propose an extension to KOI8. 1) A large area of the KOI8 encoding table (most of the 0x80-0xBF range) is, for historical reasons, occupied by symbols of pseudographics which are unused in modern software. These symbols are missing in most KOI8 font implementations without any impact on user productivity. These places in the encoding table could be utilized to represent more frequently used characters. 2) The recent dominance of the "MS Windows" operating environment resulted in a wide adoption of word processors that use the "code page 1251" encoding to render Cyrillics. Many Internet documents are thus converted to KOI8 from CP1251 and frequently include certain typographical signs such as apostrophes, quotes, or dashes, not represented in the KOI8 encodings but left without change by automatic converters. These typographical symbols fall in the unused KOI8 pseudographics area. 3) Texts in old Russian orthography (pre-1918) contain four Cyrillic letters not represented by any of the widely used Cyrillic encodings. Although Unicode-based tools would in principle be adequate for rendering these characters, the current software is mostly lacking the necessary support. It would be convenient to have an 8-bit encoding representing the old Russian characters and to be able to place them directly into a font encoding map and a keyboard layout compatible with a wide range of current software. Implementation The author has implemented the KOI8-C encoding according to these guidelines: (1) compatibility with KOI8-R and KOI8-U character sets, (2) compatibility with CP1251 character set in the area of typographical symbols and Yugoslavian Cyrillics; (3) need to be able to convert fonts to other Cyrillic encodings. The lower part of the KOI8-C character set is a complete copy of ASCII in the range of printable characters (0x20 -- 0x7F). The range (0x00 -- 0x1F) is occupied by pseudographics and other rarely used special symbols. The upper part of the KOI8-C character set contains all Russian, Belarussian and Ukrainian letters at positions defined in KOI8-R and KOI8-U; frequently used typographical symbols (quotes, dashes, and currency symbols) and Yugoslavian Cyrillics as defined by the CP1251 encoding; and old Russian letters. Most box drawing characters from KOI8-R, as well as some mathematical symbols, were removed. The resulting character set contains all ISO 8859-5 characters except for SOFT HYPHEN and covers CP1251 except for 5 punctuation characters (all also in CP1252). The Web page contains the author's development efforts related to the KOI8-C encoding and texts in old Russian orthography. The free bitmap fonts of the Cronyx family for the X window system were adapted to the KOI8-C encoding, implementing a full KOI8-C map (256 characters) in all fonts (the "xcyr" project). An extension of the keyboard layout containing the old Russian letters was proposed. A spellchecking dictionary for the old Russian orthography using the KOI8-C encoding was developed. Relation to other efforts This encoding was designed as a modification of [1,2]. An independent font development project "CYR-RFX" is using an alternative encoding "KOI8-O" with similar objectives of compatibility with KOI8-R and CP1251 but not containing any Yugoslavian Cyrillic characters. Specification of the KOI8-C codepage The description of all characters of upper half part of KOI8-C codepage is given according to ISO 10646 Unicode Character Set (UCS). # 0x01 U25C6 # BLACK DIAMOND 0x02 U2592 # MEDIUM SHADE 0x03 U00D7 # MULTIPLICATION SIGN 0x04 U00F7 # DIVISION SIGN 0x05 U2030 # PER MILLE SIGN 0x06 U2248 # ALMOST EQUAL TO 0x07 U00B5 # MICRO SIGN 0x08 U00B1 # PLUS-MINUS SIGN 0x09 U00B6 # PILCROW SIGN 0x0A U2021 # DOUBLE DAGGER 0x0B U2518 # BOX DRAWINGS LIGHT UP AND LEFT 0x0C U2510 # BOX DRAWINGS LIGHT DOWN AND LEFT 0x0D U250C # BOX DRAWINGS LIGHT DOWN AND RIGHT 0x0E U2514 # BOX DRAWINGS LIGHT UP AND RIGHT 0x0F U253C # BOX DRAWINGS LIGHT VERTICAL AND HORIZONTAL 0x10 UFFFD # REPLACEMENT CHARACTER 0x11 UFFFD # REPLACEMENT CHARACTER 0x12 U2500 # BOX DRAWINGS LIGHT HORIZONTAL 0x13 UFFFD # REPLACEMENT CHARACTER 0x14 UFFFD # REPLACEMENT CHARACTER 0x15 U251C # BOX DRAWINGS LIGHT VERTICAL AND RIGHT 0x16 U2524 # BOX DRAWINGS LIGHT VERTICAL AND LEFT 0x17 U2534 # BOX DRAWINGS LIGHT UP AND HORIZONTAL 0x18 U252C # BOX DRAWINGS LIGHT DOWN AND HORIZONTAL 0x19 U2502 # BOX DRAWINGS LIGHT VERTICAL 0x1A U2264 # LESS-THAN OR EQUAL TO 0x1B U2265 # GREATER-THAN OR EQUAL TO 0x1C U03C0 # GREEK SMALL LETTER PI 0x1D U2260 # NOT EQUAL TO 0x1E U00A4 # CURRENCY SIGN 0x1F U00B2 # SUPERSCRIPT TWO 0x20 U0020 # SPACE 0x21 U0021 # EXCLAMATION MARK 0x22 U0022 # QUOTATION MARK 0x23 U0023 # NUMBER SIGN 0x24 U0024 # DOLLAR SIGN 0x25 U0025 # PERCENT SIGN 0x26 U0026 # AMPERSAND 0x27 U0027 # APOSTROPHE 0x28 U0028 # LEFT PARENTHESIS 0x29 U0029 # RIGHT PARENTHESIS 0x2A U002A # ASTERISK 0x2B U002B # PLUS SIGN 0x2C U002C # COMMA 0x2D U002D # HYPHEN-MINUS 0x2E U002E # FULL STOP 0x2F U002F # SOLIDUS 0x30 U0030 # DIGIT ZERO 0x31 U0031 # DIGIT ONE 0x32 U0032 # DIGIT TWO 0x33 U0033 # DIGIT THREE 0x34 U0034 # DIGIT FOUR 0x35 U0035 # DIGIT FIVE 0x36 U0036 # DIGIT SIX 0x37 U0037 # DIGIT SEVEN 0x38 U0038 # DIGIT EIGHT 0x39 U0039 # DIGIT NINE 0x3A U003A # COLON 0x3B U003B # SEMICOLON 0x3C U003C # LESS-THAN SIGN 0x3D U003D # EQUALS SIGN 0x3E U003E # GREATER-THAN SIGN 0x3F U003F # QUESTION MARK 0x40 U0040 # COMMERCIAL AT 0x41 U0041 # LATIN CAPITAL LETTER A 0x42 U0042 # LATIN CAPITAL LETTER B 0x43 U0043 # LATIN CAPITAL LETTER C 0x44 U0044 # LATIN CAPITAL LETTER D 0x45 U0045 # LATIN CAPITAL LETTER E 0x46 U0046 # LATIN CAPITAL LETTER F 0x47 U0047 # LATIN CAPITAL LETTER G 0x48 U0048 # LATIN CAPITAL LETTER H 0x49 U0049 # LATIN CAPITAL LETTER I 0x4A U004A # LATIN CAPITAL LETTER J 0x4B U004B # LATIN CAPITAL LETTER K 0x4C U004C # LATIN CAPITAL LETTER L 0x4D U004D # LATIN CAPITAL LETTER M 0x4E U004E # LATIN CAPITAL LETTER N 0x4F U004F # LATIN CAPITAL LETTER O 0x50 U0050 # LATIN CAPITAL LETTER P 0x51 U0051 # LATIN CAPITAL LETTER Q 0x52 U0052 # LATIN CAPITAL LETTER R 0x53 U0053 # LATIN CAPITAL LETTER S 0x54 U0054 # LATIN CAPITAL LETTER T 0x55 U0055 # LATIN CAPITAL LETTER U 0x56 U0056 # LATIN CAPITAL LETTER V 0x57 U0057 # LATIN CAPITAL LETTER W 0x58 U0058 # LATIN CAPITAL LETTER X 0x59 U0059 # LATIN CAPITAL LETTER Y 0x5A U005A # LATIN CAPITAL LETTER Z 0x5B U005B # LEFT SQUARE BRACKET 0x5C U005C # REVERSE SOLIDUS 0x5D U005D # RIGHT SQUARE BRACKET 0x5E U005E # CIRCUMFLEX ACCENT 0x5F U005F # LOW LINE 0x60 U0060 # GRAVE ACCENT 0x61 U0061 # LATIN SMALL LETTER A 0x62 U0062 # LATIN SMALL LETTER B 0x63 U0063 # LATIN SMALL LETTER C 0x64 U0064 # LATIN SMALL LETTER D 0x65 U0065 # LATIN SMALL LETTER E 0x66 U0066 # LATIN SMALL LETTER F 0x67 U0067 # LATIN SMALL LETTER G 0x68 U0068 # LATIN SMALL LETTER H 0x69 U0069 # LATIN SMALL LETTER I 0x6A U006A # LATIN SMALL LETTER J 0x6B U006B # LATIN SMALL LETTER K 0x6C U006C # LATIN SMALL LETTER L 0x6D U006D # LATIN SMALL LETTER M 0x6E U006E # LATIN SMALL LETTER N 0x6F U006F # LATIN SMALL LETTER O 0x70 U0070 # LATIN SMALL LETTER P 0x71 U0071 # LATIN SMALL LETTER Q 0x72 U0072 # LATIN SMALL LETTER R 0x73 U0073 # LATIN SMALL LETTER S 0x74 U0074 # LATIN SMALL LETTER T 0x75 U0075 # LATIN SMALL LETTER U 0x76 U0076 # LATIN SMALL LETTER V 0x77 U0077 # LATIN SMALL LETTER W 0x78 U0078 # LATIN SMALL LETTER X 0x79 U0079 # LATIN SMALL LETTER Y 0x7A U007A # LATIN SMALL LETTER Z 0x7B U007B # LEFT CURLY BRACKET 0x7C U007C # VERTICAL LINE 0x7D U007D # RIGHT CURLY BRACKET 0x7E U007E # TILDE 0x7F U00AC # NOT SIGN 0x80 U0402 # CYRILLIC CAPITAL LETTER DJE 0x81 U0403 # CYRILLIC CAPITAL LETTER GJE 0x82 U00B8 # CEDILLA 0x83 U0453 # CYRILLIC SMALL LETTER GJE 0x84 U201E # DOUBLE LOW-9 QUOTATION MARK 0x85 U2026 # HORIZONTAL ELLIPSIS 0x86 U2020 # DAGGER 0x87 U00A7 # SECTION SIGN 0x88 U20AC # EURO SIGN 0x89 U00A8 # DIAERESIS 0x8A U0409 # CYRILLIC CAPITAL LETTER LJE 0x8B U2039 # SINGLE LEFT-POINTING ANGLE QUOTATION MARK 0x8C U040A # CYRILLIC CAPITAL LETTER NJE 0x8D U040C # CYRILLIC CAPITAL LETTER KJE 0x8E U040B # CYRILLIC CAPITAL LETTER TSHE 0x8F U040F # CYRILLIC CAPITAL LETTER DZHE 0x90 U0452 # CYRILLIC SMALL LETTER DJE 0x91 U2018 # LEFT SINGLE QUOTATION MARK 0x92 U2019 # RIGHT SINGLE QUOTATION MARK 0x93 U201C # LEFT DOUBLE QUOTATION MARK 0x94 U201D # RIGHT DOUBLE QUOTATION MARK 0x95 U2022 # BULLET 0x96 U2013 # EN DASH 0x97 U2014 # EM DASH 0x98 U00A3 # POUND SIGN 0x99 U00B7 # MIDDLE DOT 0x9A U0459 # CYRILLIC SMALL LETTER LJE 0x9B U203A # SINGLE RIGHT-POINTING ANGLE QUOTATION MARK 0x9C U045A # CYRILLIC SMALL LETTER NJE 0x9D U045C # CYRILLIC SMALL LETTER KJE 0x9E U045B # CYRILLIC SMALL LETTER TSHE 0x9F U045F # CYRILLIC SMALL LETTER DZHE 0xA0 U00A0 # NO-BREAK SPACE 0xA1 U0475 # CYRILLIC SMALL LETTER IZHITSA 0xA2 U0463 # CYRILLIC SMALL LETTER YAT' 0xA3 U0451 # CYRILLIC SMALL LETTER IO 0xA4 U0454 # CYRILLIC SMALL LETTER UKRAINIAN IE 0xA5 U0455 # CYRILLIC SMALL LETTER DZE 0xA6 U0456 # CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I 0xA7 U0457 # CYRILLIC SMALL LETTER YI 0xA8 U0458 # CYRILLIC SMALL LETTER JE 0xA9 U00AE # REGISTERED SIGN 0xAA U2122 # TRADE MARK SIGN 0xAB U00AB # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK 0xAC U0473 # CYRILLIC SMALL LETTER FITA 0xAD U0491 # CYRILLIC SMALL LETTER GHE WITH UPTURN 0xAE U045E # CYRILLIC SMALL LETTER SHORT U 0xAF U00B4 # ACUTE ACCENT 0xB0 U00B0 # DEGREE SIGN 0xB1 U0474 # CYRILLIC CAPITAL LETTER IZHITSA 0xB2 U0462 # CYRILLIC CAPITAL LETTER YAT' 0xB3 U0401 # CYRILLIC CAPITAL LETTER IO 0xB4 U0404 # CYRILLIC CAPITAL LETTER UKRAINIAN IE 0xB5 U0405 # CYRILLIC CAPITAL LETTER DZE 0xB6 U0406 # CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I 0xB7 U0407 # CYRILLIC CAPITAL LETTER YI 0xB8 U0408 # CYRILLIC CAPITAL LETTER JE 0xB9 U2116 # NUMERO SIGN 0xBA U00A2 # CENT SIGN 0xBB U00BB # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK 0xBC U0472 # CYRILLIC CAPITAL LETTER FITA 0xBD U0490 # CYRILLIC CAPITAL LETTER GHE WITH UPTURN 0xBE U040E # CYRILLIC CAPITAL LETTER SHORT U 0xBF U00A9 # COPYRIGHT SIGN 0xC0 U044E # CYRILLIC SMALL LETTER YU 0xC1 U0430 # CYRILLIC SMALL LETTER A 0xC2 U0431 # CYRILLIC SMALL LETTER BE 0xC3 U0446 # CYRILLIC SMALL LETTER TSE 0xC4 U0434 # CYRILLIC SMALL LETTER DE 0xC5 U0435 # CYRILLIC SMALL LETTER IE 0xC6 U0444 # CYRILLIC SMALL LETTER EF 0xC7 U0433 # CYRILLIC SMALL LETTER GHE 0xC8 U0445 # CYRILLIC SMALL LETTER HA 0xC9 U0438 # CYRILLIC SMALL LETTER I 0xCA U0439 # CYRILLIC SMALL LETTER SHORT I 0xCB U043A # CYRILLIC SMALL LETTER KA 0xCC U043B # CYRILLIC SMALL LETTER EL 0xCD U043C # CYRILLIC SMALL LETTER EM 0xCE U043D # CYRILLIC SMALL LETTER EN 0xCF U043E # CYRILLIC SMALL LETTER O 0xD0 U043F # CYRILLIC SMALL LETTER PE 0xD1 U044F # CYRILLIC SMALL LETTER YA 0xD2 U0440 # CYRILLIC SMALL LETTER ER 0xD3 U0441 # CYRILLIC SMALL LETTER ES 0xD4 U0442 # CYRILLIC SMALL LETTER TE 0xD5 U0443 # CYRILLIC SMALL LETTER U 0xD6 U0436 # CYRILLIC SMALL LETTER ZHE 0xD7 U0432 # CYRILLIC SMALL LETTER VE 0xD8 U044C # CYRILLIC SMALL LETTER SOFT SIGN 0xD9 U044B # CYRILLIC SMALL LETTER YERU 0xDA U0437 # CYRILLIC SMALL LETTER ZE 0xDB U0448 # CYRILLIC SMALL LETTER SHA 0xDC U044D # CYRILLIC SMALL LETTER E 0xDD U0449 # CYRILLIC SMALL LETTER SHCHA 0xDE U0447 # CYRILLIC SMALL LETTER CHE 0xDF U044A # CYRILLIC SMALL LETTER HARD SIGN 0xE0 U042E # CYRILLIC CAPITAL LETTER YU 0xE1 U0410 # CYRILLIC CAPITAL LETTER A 0xE2 U0411 # CYRILLIC CAPITAL LETTER BE 0xE3 U0426 # CYRILLIC CAPITAL LETTER TSE 0xE4 U0414 # CYRILLIC CAPITAL LETTER DE 0xE5 U0415 # CYRILLIC CAPITAL LETTER IE 0xE6 U0424 # CYRILLIC CAPITAL LETTER EF 0xE7 U0413 # CYRILLIC CAPITAL LETTER GHE 0xE8 U0425 # CYRILLIC CAPITAL LETTER HA 0xE9 U0418 # CYRILLIC CAPITAL LETTER I 0xEA U0419 # CYRILLIC CAPITAL LETTER SHORT I 0xEB U041A # CYRILLIC CAPITAL LETTER KA 0xEC U041B # CYRILLIC CAPITAL LETTER EL 0xED U041C # CYRILLIC CAPITAL LETTER EM 0xEE U041D # CYRILLIC CAPITAL LETTER EN 0xEF U041E # CYRILLIC CAPITAL LETTER O 0xF0 U041F # CYRILLIC CAPITAL LETTER PE 0xF1 U042F # CYRILLIC CAPITAL LETTER YA 0xF2 U0420 # CYRILLIC CAPITAL LETTER ER 0xF3 U0421 # CYRILLIC CAPITAL LETTER ES 0xF4 U0422 # CYRILLIC CAPITAL LETTER TE 0xF5 U0423 # CYRILLIC CAPITAL LETTER U 0xF6 U0416 # CYRILLIC CAPITAL LETTER ZHE 0xF7 U0412 # CYRILLIC CAPITAL LETTER VE 0xF8 U042C # CYRILLIC CAPITAL LETTER SOFT SIGN 0xF9 U042B # CYRILLIC CAPITAL LETTER YERU 0xFA U0417 # CYRILLIC CAPITAL LETTER ZE 0xFB U0428 # CYRILLIC CAPITAL LETTER SHA 0xFC U042D # CYRILLIC CAPITAL LETTER E 0xFD U0429 # CYRILLIC CAPITAL LETTER SHCHA 0xFE U0427 # CYRILLIC CAPITAL LETTER CHE 0xFF U042A # CYRILLIC CAPITAL LETTER HARD SIGN Security Considerations This memo raises no known security issues. Acknowledgments The author is grateful to Markus Kuhn (Computer Science Laboratory, University of Cambridge, UK) for help on creating the KOI8-C encoding table. References [1] Chernov, A., "Registration of a Cyrillic Character Set", RFC 1489, July 1993. [2] KOI8-U Ukrainian Character Set, RFC 2319. 1998. Author's Address Serge Winitzki 4 Arizona Ter. #2 Arlington, MA 02474 USA