Network Working Group M. Blanchet Internet-Draft Viagenie Obsoletes: 3454 (if approved) July 5, 2010 Intended status: Standards Track Expires: January 6, 2011 Precis Framework: Handling Internationalized Strings in Protocols draft-blanchet-precis-framework-00.txt Abstract Using Unicode codepoints in protocol strings requires preparation of the string. This document describes the Precis Protocol Framework that prepares various classes of strings used in protocol elements. A protocol specification chooses a class of strings and then implements the corresponding preparation steps described in this document. This document is based on the IDNAbis approach. It obsoletes the Stringprep algorithm. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 6, 2011. Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must Blanchet Expires January 6, 2011 [Page 1] Internet-Draft Precis Framework July 2010 include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English. Blanchet Expires January 6, 2011 [Page 2] Internet-Draft Precis Framework July 2010 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. String Classes . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Domain U-Label, A-Label and Name . . . . . . . . . . . . . . . 5 4. Email Addresses . . . . . . . . . . . . . . . . . . . . . . . 5 5. Restricted Identifier . . . . . . . . . . . . . . . . . . . . 5 6. Less-Restrictive Identifier . . . . . . . . . . . . . . . . . 5 7. Normalization Form and Case Folding . . . . . . . . . . . . . 5 8. Codepoint Properties . . . . . . . . . . . . . . . . . . . . . 5 9. Category definitions Used to Calculate Derived Property Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 9.1. LetterDigits (A) . . . . . . . . . . . . . . . . . . . . . 7 9.2. Unstable (B) . . . . . . . . . . . . . . . . . . . . . . . 8 9.3. IgnorableProperties (C) . . . . . . . . . . . . . . . . . 8 9.4. IgnorableBlocks (D) . . . . . . . . . . . . . . . . . . . 8 9.5. LDH (E) . . . . . . . . . . . . . . . . . . . . . . . . . 9 9.6. Exceptions (F) . . . . . . . . . . . . . . . . . . . . . . 9 9.7. BackwardCompatible (G) . . . . . . . . . . . . . . . . . . 10 9.8. JoinControl (H) . . . . . . . . . . . . . . . . . . . . . 10 9.9. OldHangulJamo (I) . . . . . . . . . . . . . . . . . . . . 11 9.10. Unassigned (J) . . . . . . . . . . . . . . . . . . . . . . 11 10. Calculation of the Derived Property . . . . . . . . . . . . . 11 11. Codepoints . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 12.1. IDNA derived property value registry . . . . . . . . . . . 12 12.2. IDNA Context Registry . . . . . . . . . . . . . . . . . . 12 12.2.1. Template for context registry . . . . . . . . . . . . 13 13. Security Considerations . . . . . . . . . . . . . . . . . . . 13 14. Discussion home for this draft . . . . . . . . . . . . . . . . 13 15. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13 Appendix A. Contextual Rules Registry . . . . . . . . . . . . . 13 Appendix A.1. ZERO WIDTH NON-JOINER . . . . . . . . . . . . . . . 16 Appendix A.2. ZERO WIDTH JOINER . . . . . . . . . . . . . . . . . 16 Appendix A.3. MIDDLE DOT . . . . . . . . . . . . . . . . . . . . . 16 Appendix A.4. GREEK LOWER NUMERAL SIGN (KERAIA) . . . . . . . . . 17 Appendix A.5. HEBREW PUNCTUATION GERESH . . . . . . . . . . . . . 17 Appendix A.6. HEBREW PUNCTUATION GERSHAYIM . . . . . . . . . . . . 17 Appendix A.7. KATAKANA MIDDLE DOT . . . . . . . . . . . . . . . . 18 Appendix A.8. ARABIC-INDIC DIGITS . . . . . . . . . . . . . . . . 18 Appendix A.9. EXTENDED ARABIC-INDIC DIGITS . . . . . . . . . . . . 18 Appendix B. Codepoints 0x0000 - 0x10FFFF . . . . . . . . . . . . 19 Appendix B.1. Codepoints in Unicode Character Database (UCD) format . . . . . . . . . . . . . . . . . . . . . . . 19 16. Informative References . . . . . . . . . . . . . . . . . . . . 19 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 20 Blanchet Expires January 6, 2011 [Page 3] Internet-Draft Precis Framework July 2010 1. Introduction [draft-ietf-blanchet-newprep-problem-statement] describes the rationale behind updating Stringprep[RFC3454] to a new framework. Current Stringprep profiles and their corresponding protocol specifications share similar class of strings. This framework is based on the assumption that the use of internationalized strings in most protocols can be grouped into a few set of string classes. By defining a few string classes and their corresponding preparation algorithms instead of specific profiles for each protocol, o protocols specifications do not need to have a special i18n section or implementation, since they would reference one of this document string classes and corresponding processing. o protocols benefit for sharing implementation code and tables. o end-users will have a better knowledge of which codepoints are allowed in various contexts (instead of a specific profile per protocol as of with Stringprep profiles o versioning for future versions of the Unicode database is simpler o protocols that have familiarity with others (such as username identifiers used in various authentication schemes in protocols) can use the same string class, and therefore obtain consistency for end-users and implementors. This framework takes heavily on the IDNAbis tables[IDNABISTABLES], therefore, could help implementors by sharing common code for all string classes, including domain labels and names. EDITOR NOTE:This current version of the document copy a lot of normative text from draft-ietf-idnabis-tables. The editor would highly prefer reference instead of copy, but at least for the purpose of discussion, copied text. Moreover, the idnabis-table draft contains references to IDN labels in many places which may make problematic for normative reference. To be looked at as we go. 2. String Classes The following classes of strings are identified: o domain U-label o domain A-label o domain name o email address o restricted identifier o less-restrictive identifier Blanchet Expires January 6, 2011 [Page 4] Internet-Draft Precis Framework July 2010 3. Domain U-Label, A-Label and Name TBD:define the class. For these string classes, implement [IDNA2008]. 4. Email Addresses TBD:define the class by instantiating and refering to the EAI, SMTP. For this classes of strings, implement [EAI]? 5. Restricted Identifier This class of strings, named RI in this document, corresponds to an identifier which contains language-type characters, no spacing characters, no "@", no "punctuation", no display characters. The normative description of this class is in the corresponding mapping tables. In section XX below, allowed Unicode codepoints for this string class are identified as PVALID or RI_PVALID. Disallowed codepoints are identified as DISALLOWED or RI_DISALLOWED. 6. Less-Restrictive Identifier This class of strings, named LRI in this document, corresponds to an identifier which contains language-type characters, no spacing characters, no "@", but contains various "punctuation" and display characters. The normative description of this class is in the corresponding mapping tables. In section XX below, allowed Unicode codepoints for this string class are identified as PVALID or LRI_PVALID. Disallowed codepoints are identified as DISALLOWED or LRI_DISALLOWED. 7. Normalization Form and Case Folding TBD: discuss NFC vs NFKC, case folding", 8. Codepoint Properties This document reviews and classifies the collections of code points Blanchet Expires January 6, 2011 [Page 5] Internet-Draft Precis Framework July 2010 in the Unicode character set by examining various properties of the code points. It then defines an algorithm for determining a derived property value. It specifies a procedure, and not a table, of code points so that the algorithm can be used to determine code point sets independent of the version of Unicode that is in use. This document is not intended to specify precisely how these property values are to be applied in protocol strings. That information should be defined in the protocol specification that instantiate a string class of this document. The value of the property is to be interpreted as follows. o PROTOCOL VALID: Those that are allowed to be used in any string class. Code points with this property value are permitted for general use in any string class. The abbreviated term PVALID is used to refer to this value in the rest of this document. o SPECIFIC CLASS PROTOCOL VALID: Those that are allowed to be used in specific string classes. Code points with this property value are permitted for use in specific string classes. The abbreviated term *_PVALID, where * = (RI, LRI) is used to refer to this value in the rest of this document. o CONTEXTUAL RULE REQUIRED: Some characteristics of the character, such as it being invisible in certain contexts or problematic in others, requires that it not be used in labels unless specific other characters or properties are present. The abbreviated term CONTEXT is used to refer to this value in the rest of this document. There are two subdivisions of CONTEXTUAL RULE REQUIRED, one for Join_controls (called CONTEXTJ) and for other characters (called CONTEXTO). o DISALLOWED: Those that should clearly not be included in any string class. Code points with this property value are not permitted in any string class. o SPECIFIC CLASS DISALLOWED: Those that should clearly not be included in specific string classes. Code points with this property value are not permitted in any string class. The abbreviated term *_DISALLOWED, where * = (RI, LRI) is used to refer to this value in the rest of this document. o UNASSIGNED: Those code points that are not designated (i.e. are unassigned) in the Unicode Standard. The mechanisms described here allow determination of the value of the property for future versions of Unicode (including characters added after Unicode 5.2). Changes in Unicode properties that do not affect the outcome of this process do not affect this framework. For example, a character can have its Unicode General_Category value (see [Unicode52]) change from So to Sm, or from Lo to Ll, without affecting the algorithm results. Moreover, even if such changes were Blanchet Expires January 6, 2011 [Page 6] Internet-Draft Precis Framework July 2010 to result, the BackwardCompatible list (Section 9.7) can be adjusted to ensure the stability of the results. Some code points need to be allowed in exceptional circumstances, but should be excluded in all other cases; these rules are also described in other documents. The most notable of these are the Join Control characters, U+200D ZERO WIDTH JOINER and U+200C ZERO WIDTH NON- JOINER. Both of them have the derived property value CONTEXTJ. A character with the derived property value CONTEXTJ or CONTEXTO (CONTEXTUAL RULE REQUIRED) is not to be used unless an appropriate rule has been established and the context of the character is consistent with that rule. It is invalid to either register a string containing these characters or even to look one up unless such contextual rule is found and satisfied. Please see Appendix A, The Contextual Rules Registry, for more information. 9. Category definitions Used to Calculate Derived Property Value The derived property obtains its value based on a two-step procedure. First, characters are placed in one or more character categories based on either core properties defined by the Unicode Standard or by treating the codepoint as an exception and addressing the codepoint by its codepoint value. These categories are not mutually exclusive. In the second step, set operations are used with these categories to determine the values for an string class specific property. Those operations are specified in Section 10. Unicode property names and property value names may have short abbreviations, such as gc for the General_Category property, and Ll for the Lowercase_Letter property value of the gc property. In the following specification of categories, the operation which returns the value of a particular Unicode character property for a code point is designated by using the formal name of that property (from PropertyAliases.txt) followed by '(cp)'. For example, the value of the General_Category property for a code point is indicated by General_Category(cp). 9.1. LetterDigits (A) A: General_Category(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc} These rules identifies characters commonly used in mnemonics and often informally described as "language characters". For more information, see section 4.5 of [Unicode5]. Blanchet Expires January 6, 2011 [Page 7] Internet-Draft Precis Framework July 2010 The categories used in this rule are: o Ll - Lowercase_Letter o Lu - Uppercase_Letter o Lo - Other_Letter o Nd - Decimal_Number o Lm - Modifier_Letter o Mn - Nonspacing_Mark o Mc - Spacing_Mark 9.2. Unstable (B) B: toNFKC(toCaseFold(toNFKC(cp))) != cp This category is used to group the characters that are not stable under NFKC normalization and casefolding. In general, these code points are not suitable for use in any string class. The toCaseFold() operation is defined in Section 3.13 of [Unicode5]. The toNFKC() operation returns the code point in normalization form KC. For more information, see Section 5 of [TR15]. 9.3. IgnorableProperties (C) C: Default_Ignorable_Code_Point(cp) = True or White_Space(cp) = True or Noncharacter_Code_Point(cp) = True This category is used to group code points that are not recommended for use in identifiers. In general, these code points are not suitable for identifiers. The definition for Default_Ignorable_Code_Point can be found in DerivedCoreProperties.txt [1] and is at the time of Unicode 5.2: Other_Default_Ignorable_Code_Point + Cf (Format characters) + Variation_Selector - White_Space - FFF9..FFFB (Annotation Characters) - 0600..0603, 06DD, 070F (exceptional Cf characters that should be visible) 9.4. IgnorableBlocks (D) D: Block(cp) is in {Combining Diacritical Marks for Symbols, Musical Symbols, Ancient Greek Musical Notation} This category is used to identifying code points that are not useful in mnemonics but may be useful for some string classes. Blanchet Expires January 6, 2011 [Page 8] Internet-Draft Precis Framework July 2010 The definition of blocks can be found in Blocks.txt [2] 9.5. LDH (E) E: cp is in {002D, 0030..0039, 0061..007A} This category is used in the second step to preserve the traditional "hostname" (LDH) characters ('-', 0-9 and a-z). In general, these code points are suitable for use for identifiers. 9.6. Exceptions (F) F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660, 0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668, 0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6, 06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 0F0B, 3007, 302E, 302F, 3031, 3032, 3033, 3034, 3035, 303B, 30FB} This category explicitly lists code points for which the category cannot be assigned using only the core property values that exist in the Unicode standard. The values are according to the table below: PVALID -- Would otherwise have been DISALLOWED 00DF; PVALID # LATIN SMALL LETTER SHARP S 03C2; PVALID # GREEK SMALL LETTER FINAL SIGMA 06FD; PVALID # ARABIC SIGN SINDHI AMPERSAND 06FE; PVALID # ARABIC SIGN SINDHI POSTPOSITION MEN 0F0B; PVALID # TIBETAN MARK INTERSYLLABIC TSHEG 3007; PVALID # IDEOGRAPHIC NUMBER ZERO CONTEXTO -- Would otherwise have been DISALLOWED 00B7; CONTEXTO # MIDDLE DOT 0375; CONTEXTO # GREEK LOWER NUMERAL SIGN (KERAIA) 05F3; CONTEXTO # HEBREW PUNCTUATION GERESH 05F4; CONTEXTO # HEBREW PUNCTUATION GERSHAYIM 30FB; CONTEXTO # KATAKANA MIDDLE DOT CONTEXTO -- Would otherwise have been PVALID 0660; CONTEXTO # ARABIC-INDIC DIGIT ZERO 0661; CONTEXTO # ARABIC-INDIC DIGIT ONE 0662; CONTEXTO # ARABIC-INDIC DIGIT TWO 0663; CONTEXTO # ARABIC-INDIC DIGIT THREE 0664; CONTEXTO # ARABIC-INDIC DIGIT FOUR 0665; CONTEXTO # ARABIC-INDIC DIGIT FIVE Blanchet Expires January 6, 2011 [Page 9] Internet-Draft Precis Framework July 2010 0666; CONTEXTO # ARABIC-INDIC DIGIT SIX 0667; CONTEXTO # ARABIC-INDIC DIGIT SEVEN 0668; CONTEXTO # ARABIC-INDIC DIGIT EIGHT 0669; CONTEXTO # ARABIC-INDIC DIGIT NINE 06F0; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT ZERO 06F1; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT ONE 06F2; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT TWO 06F3; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT THREE 06F4; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT FOUR 06F5; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT FIVE 06F6; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT SIX 06F7; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT SEVEN 06F8; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT EIGHT 06F9; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT NINE DISALLOWED -- Would otherwise have been PVALID 0640; DISALLOWED # ARABIC TATWEEL 07FA; DISALLOWED # NKO LAJANYALAN 302E; DISALLOWED # HANGUL SINGLE DOT TONE MARK 302F; DISALLOWED # HANGUL DOUBLE DOT TONE MARK 3031; DISALLOWED # VERTICAL KANA REPEAT MARK 3032; DISALLOWED # VERTICAL KANA REPEAT WITH VOICED SOUND MARK 3033; DISALLOWED # VERTICAL KANA REPEAT MARK UPPER HALF 3034; DISALLOWED # VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HA 3035; DISALLOWED # VERTICAL KANA REPEAT MARK LOWER HALF 303B; DISALLOWED # VERTICAL IDEOGRAPHIC ITERATION MARK 9.7. BackwardCompatible (G) G: cp is in {} This category includes the code points that property values in versions of Unicode after 5.2 have changed in such a way that the derived property value would no longer be PVALID or DISALLOWED. If changes are made to future versions of Unicode so that code points might change property value from PVALID or DISALLOWED, then this table can be updated and keep special exception values so that the property values for code points stay stable. 9.8. JoinControl (H) H: Join_Control(cp) = True This category consists of Join Control characters (i.e., they are not in LetterDigits (Section 9.1)) but are still required in strings under some circumstances. Blanchet Expires January 6, 2011 [Page 10] Internet-Draft Precis Framework July 2010 9.9. OldHangulJamo (I) I: Hangul_Syllable_Type(cp) is in {L, V, T} This category consists of all conjoining Hangul Jamo (Leading Jamo, Vowel Jamo, and Trailing Jamo). Elimination of conjoining Hangul Jamos from the set of PVALID characters results in restricting the set of Korean PVALID characters just to preformed, modern Hangul syllable characters. Old Hangul syllables, which must be spelled with sequences of conjoining Hangul Jamos, are not PVALID for string classes. 9.10. Unassigned (J) J: General_Category(cp) is in {Cn} and Noncharacter_Code_Point(cp) = False This category consists of code points in the Unicode character set that are not (yet) assigned. It should be noted that Unicode distinguishes between 'unassigned code points' and 'unassigned characters'. The unassigned code points are all but (Cn - Noncharacters), while the unassigned *characters* are all but (Cn + Cs). 10. Calculation of the Derived Property Possible values of the property are: o PVALID o RI_PVALID o LRI_PVALID o CONTEXTJ o CONTEXTO o DISALLOWED o RI_DISALLOWED o LRI_DISALLOWED o UNASSIGNED The algorithm to calculate the value of the derived property is as follows. If the names of a rule (such as Exception) is used, that implies the set of codepoints that the rule define, while the same name as a function call (such as Exception(cp)) imply the value cp has in the Exceptions table. If .cp. .in. Exceptions Then Exceptions(cp); Else If .cp. .in. BackwardCompatible Then BackwardCompatible(cp); Blanchet Expires January 6, 2011 [Page 11] Internet-Draft Precis Framework July 2010 Else If .cp. .in. Unassigned Then UNASSIGNED; Else If .cp. .in. LDH Then PVALID; Else If .cp. .in. JoinControl Then CONTEXTJ; Else If .cp. .in. Unstable Then DISALLOWED; Else If .cp. .in. IgnorableProperties Then DISALLOWED; Else If .cp. .in. IgnorableBlocks Then LRI_PVALID; Else If .cp. .in. OldHangulJamo Then DISALLOWED; Else If .cp. .in. LetterDigits Then PVALID; Else DISALLOWED; 11. Codepoints The Categories and Rules defined in Section 9 and Section 10 apply to all Unicode code points. The table in Appendix B shows, for illustrative purposes, the consequences of the categories and classification rules, and the resulting property values. The list of code points that can be found in Appendix B is non- normative. Section 9 and Section 10 are normative. 12. IANA Considerations 12.1. IDNA derived property value registry IANA is to create a registry with the derived properties for the versions of Unicode that is released after (and including) version 5.2. The derived property value is to be calculated in cooperation with a designated expert[RFC5226] according to the specifications in Section 9 and Section 10 and not by copying the non-normative table found in Appendix B. If during this process (creation of the table of derived property values) followed by a designated expert review, either non-backward compatible changes to the table of derived properties are discovered, or otherwise problems during the creation of the table arises, that is to be flagged to the IESG. Changes to the rules (as specified in Section 9 and Section 10), including BackwardCompatible (Section 9.7) (a set that is at release of this document is empty), require IETF Review, as described in [RFC 5226]. 12.2. IDNA Context Registry For characters that are defined in IDNA derived property value registry (Section 12.1) as CONTEXTO or CONTEXTJ and therefore requiring a contextual rule IANA will create and maintain a list of approved contextual rules. Additions or changes to these rules Blanchet Expires January 6, 2011 [Page 12] Internet-Draft Precis Framework July 2010 require IETF Review, as described in [RFC5226]. A table from which that registry can be initialized, and some further discussion appears in Appendix A. 12.2.1. Template for context registry The following information is to be given when a new rule is created. Name: Unique name of the rule Code point: Rule should be applied when this codepoint exist in label Overview: Description in plain english on what the rule verifies Lookup: Should rule be applied at time of lookup? Rule Set: The set of rules, as described in 13. Security Considerations TBD 14. Discussion home for this draft This document is discussed in the precis@ietf.org mailing list (This section to be removed when published as RFC). 15. Acknowledgements The author of this document would like to acknowledge the comments and contributions of the following people: ... Since this document copies a lot of text and the algorithms from IDNAbis tables, therefore all authors and contributors to the idnabis work are deeply acknowledged. Appendix A. Contextual Rules Registry As discussed in Section 12.2, a registry of rules that define the contexts in which particular PROTOCOL-VALID characters, characters associated with a requirement for Contextual Information, are permitted. These rules are expressed as tests on the label in which the characters appear (all, or any part of, the label may be tested). The grammatical rules are expressed in pseudo code. The conventions used for that pseudo code are explained here. Blanchet Expires January 6, 2011 [Page 13] Internet-Draft Precis Framework July 2010 Each rule is constructed as a Boolean expression that evaluates to either True or False. A simple "True;" or "False;" rule sets the default result value for the rule set. Subsequent conditional rules that evaluate to True or False may re-set the result value. A special value "Undefined" is used to deal with any error conditions, such as an attempt to test a character before the start of a label or after the end of a label. If any term of a rule evaluates to Undefined, further evaluation of the rule immediately terminates, as the result value of the rule will itself be Undefined. cp represents the codepoint to be tested. FirstChar is a special term which denotes the first codepoint in a string. LastChar is a special term which denotes the last codepoint in a string. .eq. represents the equality relation. A .eq. B evaluates to True if A equals B. .is. represents checking position in a string. A .is. B evaluates to True if A and B have same position in the same string. .ne. represents the non-equality relation. A .ne. B evaluates to True if A is not equal to B. .in. represents the set inclusion relation. A .in. B evaluates to True if A is a member of the set B. A functional notation, Function_Name(cp), is used to express either string positions within a string, Boolean character property tests of a codepoint, or a regular expression match. When such function names refer to Boolean character property tests, the function names use the exact Unicode character property name for the property in question, and "cp" is evaluated as the Unicode value of the codepoint to be tested, rather than as its position in the string. When such function names refer to string positions within a string, "cp" is evaluated as its position in the string. RegExpMatch(X) takes as its parameter X a schematic regular Blanchet Expires January 6, 2011 [Page 14] Internet-Draft Precis Framework July 2010 expression consisting of a mix of Unicode character property values and literal Unicode codepoints. Script(cp) returns the value of the Unicode Script property, as defined in Scripts.txt in the Unicode Character Database. Canonical_Combining_Class(cp) returns the value of the Unicode Canonical_Combining_Class property, as defined in UnicodeData.txt in the Unicode Character Database. Before(cp) returns the codepoint of the character immediately preceding cp in logical order in the string representing the string. Before(FirstChar) evaluates to Undefined. After(cp) returns the codepoint of the character immediately following cp in logical order in the string representing the string. After(LastChar) evaluates to Undefined. Note that "Before" and "After" do not refer to the visual display order of the character in a string, which may be reversed or otherwise modified by the bidirectional algorithm for strings including characters from scripts written right-to-left. Instead, 'Before' and 'After' refer to the network order of the character in the string. The clauses "Then True" and "Then False" imply exit from the pseudo- code routine with the corresponding result. Repeated evaluation for all characters in a string makes use of the special construct: For All Characters: Expression; End For; This construct requires repeated evaluation of "Expression" for each codepoint in the string, starting from FirstChar and proceeding to LastChar. The different fields in the rules are to be interpreted as follows: Code point: The codepoint, or codepoints, that this rule is to be applied to. Normally, this implies that if any of the codepoints in a string is as defined, then the rules should be applied. If evaluated to True, the codepoint is ok as used; if evaluated to False, it is not o.k. Blanchet Expires January 6, 2011 [Page 15] Internet-Draft Precis Framework July 2010 Overview: A description of the goal with the rule, in plain English. Lookup: True if application of this rule is recommended at lookup time; False otherwise. Rule Set: The rule set itself, as described above. Appendix A.1. ZERO WIDTH NON-JOINER Code point: U+200C Overview: This may occur in a formally cursive script (such as Arabic) in a context where it breaks a cursive connection as required for orthographic rules, as in the Persian language, for example. It also may occur in Indic scripts in a consonant conjunct context (immediately following a virama), to control required display of such conjuncts. Lookup: True Rule Set: False; If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True; If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C (Joining_Type:T)*(Joining_Type:{R,D})) Then True; Appendix A.2. ZERO WIDTH JOINER Code point: U+200D Overview: This may occur in Indic scripts in a consonant conjunct context (immediately following a virama), to control required display of such conjuncts. Lookup: True Rule Set: False; If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True; Appendix A.3. MIDDLE DOT Code point: U+00B7 Blanchet Expires January 6, 2011 [Page 16] Internet-Draft Precis Framework July 2010 Overview: Between 'l' (U+006C) characters only, used to permit the Catalan character ela geminada to be expressed Lookup: False Rule Set: False; If Before(cp) .eq. U+006C And After(cp) .eq. U+006C Then True; Appendix A.4. GREEK LOWER NUMERAL SIGN (KERAIA) Code point: U+0375 Overview: The script of the following character MUST be Greek. Lookup: False Rule Set: False; If Script(After(cp)) .eq. Greek Then True; Appendix A.5. HEBREW PUNCTUATION GERESH Code point: U+05F3 Overview: The script of the preceding character MUST be Hebrew. Lookup: False Rule Set: False; If Script(Before(cp)) .eq. Hebrew Then True; Appendix A.6. HEBREW PUNCTUATION GERSHAYIM Code point: U+05F4 Overview: The script of the preceding character MUST be Hebrew. Lookup: False Rule Set: False; Blanchet Expires January 6, 2011 [Page 17] Internet-Draft Precis Framework July 2010 If Script(Before(cp)) .eq. Hebrew Then True; Appendix A.7. KATAKANA MIDDLE DOT Code point: U+30FB Overview: Note that the Script of Katakana Middle Dot is not any of "Hiragana", "Katakana" or "Han". The effect of this rule is to require at least one character in the label to be in one of those scripts. Lookup: False Rule Set: False; For All Characters: If Script(cp) .in. {Hiragana, Katakana, Han} Then True; End For; Appendix A.8. ARABIC-INDIC DIGITS Code point: 0660..0669 Overview: Can not be mixed with Extended Arabic-Indic Digits. Lookup: False Rule Set: True; For All Characters: If cp .in. 06F0..06F9 Then False; End For; Appendix A.9. EXTENDED ARABIC-INDIC DIGITS Code point: 06F0..06F9 Overview: Can not be mixed with Arabic-Indic Digits. Lookup: False Rule Set: True; For All Characters: Blanchet Expires January 6, 2011 [Page 18] Internet-Draft Precis Framework July 2010 If cp .in. 0660..0669 Then False; End For; Appendix B. Codepoints 0x0000 - 0x10FFFF If one applies the rules (Section 10) to the code points 0x0000 to 0x10FFFF to Unicode 5.2, the result is as follows. This list is non-normative, and only included for illustrative purposes. Specifically, what is displayed in the third column is not the formal name of the codepoint (as defined in section 4.8 of [Unicode52]). The differences exists for example for the codepoints that have the codepoint value as part of the name (example: CJK UNIFIED IDEOGRAPH-4E00) and the naming of Hangul syllables. For many codepoints, what you see is the official name. Appendix B.1. Codepoints in Unicode Character Database (UCD) format 0000..10FFFF; TBD! 16. Informative References [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of Internationalized Strings ("stringprep")", RFC 3454, December 2002. [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, "Internationalizing Domain Names in Applications (IDNA)", RFC 3490, March 2003. [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)", RFC 3491, March 2003. [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)", RFC 3492, March 2003. [RFC3722] Bakke, M., "String Profile for Internet Small Computer Systems Interface (iSCSI) Names", RFC 3722, April 2004. [RFC3920] Saint-Andre, P., Ed., "Extensible Messaging and Presence Protocol (XMPP): Core", RFC 3920, October 2004. [RFC4011] Waldbusser, S., Saperia, J., and T. Hongal, "Policy Based Management MIB", RFC 4011, March 2005. Blanchet Expires January 6, 2011 [Page 19] Internet-Draft Precis Framework July 2010 [RFC4013] Zeilenga, K., "SASLprep: Stringprep Profile for User Names and Passwords", RFC 4013, February 2005. [RFC4505] Zeilenga, K., "Anonymous Simple Authentication and Security Layer (SASL) Mechanism", RFC 4505, June 2006. [RFC4518] Zeilenga, K., "Lightweight Directory Access Protocol (LDAP): Internationalized String Preparation", RFC 4518, June 2006. [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and Recommendations for Internationalized Domain Names (IDNs)", RFC 4690, September 2006. [1] [2] Author's Address Marc Blanchet Viagenie 2600 boul. Laurier, suite 625 Quebec, QC G1V 4W1 Canada Email: Marc.Blanchet@viagenie.ca URI: http://www.viagenie.ca Blanchet Expires January 6, 2011 [Page 20]